Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 100]
cs.CV [Total: 166]
cs.AI [Total: 42]
cs.SD [Total: 4]
cs.LG [Total: 110]
cs.MA [Total: 3]
cs.MM [Total: 2]
eess.AS [Total: 7]
eess.IV [Total: 16]

cs.CL

[1] Large Language Models in the Travel Domain: An Industrial Experience

Sergio Di Meglio, Aniello Somma, Luigi Libero Lucio Starace, Fabio Scippacercola, Giancarlo Sperlì, Sergio Di Martino

Main category: cs.CL

TL;DR: The paper evaluates two LLMs (Mistral 7B and Mixtral 8x7B) for generating consistent property descriptions on a booking platform, finding Mixtral superior in quality but more resource-intensive.

Details

Motivation: Inconsistent third-party data in property booking platforms frustrates users and harms market share, prompting the need for reliable solutions.

Method: Case study integrating Mistral 7B (fine-tuned with QLoRA) and Mixtral 8x7B (refined system prompt) into CALEIDOHOTELS, assessing performance on description quality and hallucinations.

Result: Mixtral 8x7B outperformed Mistral 7B in completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%), but required significantly more resources (50GB VRAM, $1.61/hour vs. 5GB, $0.16/hour).

Conclusion: The study highlights trade-offs between model quality and resource efficiency, offering practical insights for deploying LLMs to improve data consistency in booking platforms.

Abstract: Online property booking platforms are widely used and rely heavily on consistent, up-to-date information about accommodation facilities, often sourced from third-party providers. However, these external data sources are frequently affected by incomplete or inconsistent details, which can frustrate users and result in a loss of market. In response to these challenges, we present an industrial case study involving the integration of Large Language Models (LLMs) into CALEIDOHOTELS, a property reservation platform developed by FERVENTO. We evaluate two well-known LLMs in this context: Mistral 7B, fine-tuned with QLoRA, and Mixtral 8x7B, utilized with a refined system prompt. Both models were assessed based on their ability to generate consistent and homogeneous descriptions while minimizing hallucinations. Mixtral 8x7B outperformed Mistral 7B in terms of completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%), producing shorter yet more concise content (249 vs. 277 words on average). However, this came at a significantly higher computational cost: 50GB VRAM and $1.61/hour versus 5GB and $0.16/hour for Mistral 7B. Our findings provide practical insights into the trade-offs between model quality and resource efficiency, offering guidance for deploying LLMs in production environments and demonstrating their effectiveness in enhancing the consistency and reliability of accommodation data.

[2] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Qinfeng Song, Kaixuan Yang, Jiangbo Zhang, Yaoying Wang, Ruimeng Li, Biyi Zhou

Main category: cs.CL

TL;DR: ElectriQ is a benchmark for evaluating LLMs in electric power marketing, addressing gaps in domain expertise and empathy. It includes a dialogue dataset, evaluation metrics, and a knowledge augmentation method, showing smaller models can outperform GPT-4o when fine-tuned.

Details

Motivation: Current systems like China's 95598 hotline have slow response times and lack domain expertise. General-purpose LLMs also fall short in domain-specific tasks and empathy.

Method: ElectriQ introduces a dialogue dataset, four evaluation metrics, and a knowledge augmentation method to enhance LLMs for power marketing.

Result: Fine-tuned smaller models (e.g., LLama3-8B) outperform GPT-4o in professionalism and user-friendliness.

Conclusion: ElectriQ provides a foundation for developing domain-specific LLMs for power marketing, improving service quality.

Abstract: Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China’s 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.

Navid Yazdanjue, Morteza Rakhshaninejad, Hossein Yazdanjouei, Mohammad Sadegh Khorshidi, Mikko S. Niemela, Fang Chen, Amir H. Gandomi

Main category: cs.CL

TL;DR: A hierarchical framework combining fine-tuned language models and semi-supervised ensemble learning detects and classifies illicit marketplace content across diverse platforms with high accuracy.

Details

Motivation: Illicit marketplaces on hidden internet platforms pose detection challenges due to limited labeled data, evolving language, and heterogeneous sources.

Method: Uses ModernBERT for semantic extraction, combines it with engineered features, and employs a two-stage semi-supervised ensemble for classification.

Result: Outperforms baselines with 0.96489 accuracy, 0.93467 F1-score, and 0.95388 TMCC.

Conclusion: The framework is robust, generalizes well, and effectively detects illicit content in real-world scenarios.

Abstract: Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.

[4] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

Jinyu Liu, Xiaoying Song, Diana Zhang, Jason Thomale, Daqing He, Lingzi Hong

Main category: cs.CL

TL;DR: A hybrid framework combining ML models and LLMs improves subject analysis by guiding LLM predictions and post-editing to reduce hallucinations.

Details

Motivation: Subject access in libraries is crucial, but traditional ML models struggle with unseen cases, and LLMs over-generate or hallucinate.

Method: A hybrid approach uses ML to predict the number of LCSH labels and post-edit LLM outputs for accuracy.

Result: The framework produces more controlled and vocabulary-aligned subject terms.

Conclusion: Integrating ML and LLMs enhances subject analysis efficiency and accuracy.

Abstract: Providing subject access to information resources is an essential function of any library management system. Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored. Multi-label classification with traditional machine learning (ML) models has been used for subject analysis but struggles with unseen cases. LLMs offer an alternative but often over-generate and hallucinate. Therefore, we propose a hybrid framework that integrates embedding-based ML models with LLMs. This approach uses ML models to (1) predict the optimal number of LCSH labels to guide LLM predictions and (2) post-edit the predicted terms with actual LCSH terms to mitigate hallucinations. We experimented with LLMs and the hybrid framework to predict the subject terms of books using the Library of Congress Subject Headings (LCSH). Experiment results show that providing initial predictions to guide LLM generations and imposing post-edits result in more controlled and vocabulary-aligned outputs.

[5] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs

Victor Eiti Yamamoto, Hideaki Takeda

Main category: cs.CL

TL;DR: A novel KG integration method combining label and triple matching addresses context matching gaps, outperforming existing methods in accuracy.

Details

Motivation: Current KG integration methods lack focus on context matching, which is crucial for diverse and complex real-world KGs.

Method: Proposes label matching (using string manipulation, fuzzy matching, vector similarity) and triple matching to align entities and predicates.

Result: Achieves high accuracy, competitive with OAEI leaders and supervised methods, and introduces a new dataset for evaluation.

Conclusion: The method effectively integrates diverse KGs by addressing context matching, demonstrating superior performance and practicality.

Abstract: Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively.

[6] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Xiaoyu Pan, Yang Bai, Ke Zou, Yang Zhou, Jun Zhou, Huazhu Fu, Yih-Chung Tham, Yong Liu

Main category: cs.CL

TL;DR: EH-Benchmark is introduced to evaluate and mitigate hallucinations in Medical Large Language Models (MLLMs) for ophthalmic diagnosis, improving accuracy and reliability.

Details

Motivation: MLLMs face accuracy issues due to hallucinations from limited ophthalmic knowledge, poor visual reasoning, and lack of multimodal data. Existing benchmarks fail to address these challenges.

Method: A three-phase framework (Knowledge-Level Retrieval, Task-Level Case Studies, Result-Level Validation) is proposed to categorize and mitigate hallucinations in MLLMs.

Result: The framework significantly reduces hallucinations, enhancing model accuracy, interpretability, and reliability.

Conclusion: EH-Benchmark effectively addresses MLLM hallucinations in ophthalmology, offering a practical solution for improved diagnosis.

Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.

[7] Theoretical Foundations and Mitigation of Hallucination in Large Language Models

Esmail Gumaan

Main category: cs.CL

TL;DR: The paper rigorously defines and analyzes hallucination in LLMs, distinguishing intrinsic and extrinsic types, and introduces a hallucination risk metric. It provides detection and mitigation strategies, a unified workflow, and evaluation protocols.

Details

Motivation: Address the challenge of hallucination in LLMs by providing formal definitions, theoretical analysis, and practical solutions.

Method: Uses learning-theoretic frameworks (PAC-Bayes and Rademacher complexity) to derive bounds on hallucination risk. Surveys detection (e.g., token-level uncertainty) and mitigation (e.g., retrieval-augmented generation) strategies.

Result: Proposes a unified workflow for detection and mitigation, along with evaluation protocols (datasets, metrics) to quantify and reduce hallucinations.

Conclusion: Lays a theoretical foundation and practical guidelines for tackling hallucination in LLMs, offering actionable solutions for researchers and practitioners.

Abstract: Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textit{hallucination risk} for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.

[8] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

Jizhou Guo

Main category: cs.CL

TL;DR: LENS is a novel ensemble method for LLMs that learns model confidence from internal representations, outperforming traditional techniques.

Details

Motivation: Existing ensemble methods for LLMs rely on simplistic approaches like voting, ignoring context-dependent model reliability.

Method: LENS trains lightweight linear confidence predictors using layer-wise hidden states and normalized probabilities to weight predictions.

Result: LENS significantly outperforms traditional ensemble methods on multiple-choice and boolean QA tasks.

Conclusion: Internal representations offer valuable signals for model confidence, enhancing ensemble learning without extra computation.

Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.

[9] Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen

Main category: cs.CL

TL;DR: The paper proposes a method to improve multilingual ASR for dysarthric speech by generating synthetic dysarthric-like speech using a voice conversion model, enhancing recognition performance.

Details

Motivation: Addressing data scarcity for dysarthric speech in non-English languages by leveraging English dysarthric data and converting healthy non-English speech.

Method: Fine-tune a voice conversion model on English dysarthric speech, apply it to convert healthy non-English speech into dysarthric-like speech, and use this to fine-tune a multilingual ASR model.

Result: VC with speaker and prosody conversion outperforms off-the-shelf MMS and conventional augmentation techniques, validated on Spanish, Italian, and Tamil datasets.

Conclusion: The approach effectively simulates dysarthric characteristics and improves ASR performance for dysarthric speech in non-English languages.

Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.

[10] Reading Between the Timelines: RAG for Answering Diachronic Questions

Kwun Hang Lau, Ruiyuan Zhang, Weijie Shi, Xiaofang Zhou, Xiaojun Cheng

Main category: cs.CL

TL;DR: A new RAG framework enhances temporal query handling by disentangling queries into subject and temporal components, improving accuracy by 13-27%.

Details

Motivation: Address the deficit of conventional RAG in handling longitudinal queries requiring temporal coherence.

Method: Disentangles queries into subject and temporal window, uses a specialized retriever for temporal relevance, and introduces ADQAB for evaluation.

Result: Substantial gains in answer accuracy (13-27% over standard RAG) on the ADQAB benchmark.

Conclusion: Provides a validated pathway for RAG systems to handle nuanced, evolutionary analysis for complex questions.

Abstract: While Retrieval-Augmented Generation (RAG) excels at injecting static, factual knowledge into Large Language Models (LLMs), it exhibits a critical deficit in handling longitudinal queries that require tracking entities and phenomena across time. This blind spot arises because conventional, semantically-driven retrieval methods are not equipped to gather evidence that is both topically relevant and temporally coherent for a specified duration. We address this challenge by proposing a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic. Our methodology begins by disentangling a user’s query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance, ensuring the collection of a contiguous evidence set that spans the entire queried period. To enable rigorous evaluation of this capability, we also introduce the Analytical Diachronic Question Answering Benchmark (ADQAB), a challenging evaluation suite grounded in a hybrid corpus of real and synthetic financial news. Empirical results on ADQAB show that our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions. The dataset and code for this study are publicly available at https://github.com/kwunhang/TA-RAG.

[11] Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Daniel Son, Sanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O’Brien

Main category: cs.CL

TL;DR: The paper explores feature universality in Gemma-2 language models (2B and 9B) to see if differently scaled models converge on similar internal concepts. Using Sparse Autoencoders (SAEs), the study aligns and compares feature spaces, finding middle layers show the strongest overlap. Multi-token subspace analysis further supports universality.

Details

Motivation: To determine if language models of varying scales develop comparable internal features, which could enhance cross-model interpretability.

Method: Uses Sparse Autoencoders (SAEs) on residual-stream activations, aligns features via activation correlation, and compares them with SVCCA and RSA. Extends analysis to multi-token subspaces.

Result: Middle layers exhibit the strongest feature overlap, while early and late layers show less similarity. Multi-token subspaces also demonstrate semantic similarity.

Conclusion: Large language models develop broadly similar, interpretable features regardless of size, supporting universality as a basis for cross-model interpretability.

Abstract: We investigate feature universality in Gemma-2 language models (Gemma-2-2B and Gemma-2-9B), asking whether models with a four-fold difference in scale still converge on comparable internal concepts. Using the Sparse Autoencoder (SAE) dictionary-learning pipeline, we utilize SAEs on each model’s residual-stream activations, align the resulting monosemantic features via activation correlation, and compare the matched feature spaces with SVCCA and RSA. Middle layers yield the strongest overlap, while early and late layers show far less similarity. Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models. These results strengthen the case that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.

[12] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

Qixuan Hu, Xumou Zhang, Jinman Kim, Florence Bourgeois, Adam G. Dunn

Main category: cs.CL

TL;DR: The paper evaluates methods for predicting serious adverse event (SAE) outcomes in clinical trials using registration data, achieving 77.6% AUC for classification and 18.6% RMSE for regression.

Details

Motivation: To improve clinical trial design by predicting SAE outcomes beforehand, reducing unnecessary risks to participants.

Method: Used transfer learning with pretrained models (ClinicalT5, BioBERT) for feature extraction, combined with a sliding window approach for long texts, and downstream models for prediction.

Result: Best model achieved 77.6% AUC for classification and 18.6% RMSE for regression, with the sliding window method outperforming alternatives.

Conclusion: ClinicalTrials.gov data is underutilized; predicting trial outcomes can enhance design and flag safety discrepancies.

Abstract: Objectives: With accurate estimates of expected safety results, clinical trials could be designed to avoid terminations and limit exposing participants to unnecessary risks. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analysed 22,107 two-arm parallel interventional clinical trials from ClinicalTrials.gov with structured summary results. Two prediction models were developed: a classifier predicting will experimental arm have higher SAE rates (area under the receiver operating characteristic curve; AUC) than control arm, and a regression model to predict the proportion of SAEs in control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with downstream model for prediction. To maintain semantic representation in long trial texts exceeding localised language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC predicting which trial arm has a higher proportion of patients with SAEs. When predicting proportion of participants experiencing SAE in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed methods without it. Across 12 classifiers, the average absolute AUC increase was 2.00%; across 12 regressors, the average absolute RMSE reduction was 1.58%. Discussion: Summary results data available at ClinicalTrials.gov remains underutilised. The potential to estimate results of trials before they start is an opportunity to improve trial design and flag discrepancies between expected and reported safety results.

[13] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

Jindong Li, Yali Fu, Jiahong Liu, Linxiao Cao, Wei Ji, Menglin Yang, Irwin King, Ming-Hsuan Yang

Main category: cs.CL

TL;DR: A survey on discrete tokenization methods for LLMs, categorizing 8 VQ variants, analyzing their principles, and discussing challenges and future directions.

Details

Motivation: The need for effective mechanisms to transform multimodal data into discrete representations compatible with LLMs, given the lack of comprehensive surveys on VQ techniques in LLM-based systems.

Method: Presents a structured taxonomy and analysis of 8 VQ variants, examining algorithmic principles, training dynamics, and integration challenges with LLM pipelines.

Result: Identifies key challenges (e.g., codebook collapse, unstable gradients) and discusses applications in classical, single-modality, and multimodal LLM systems.

Conclusion: Bridges traditional VQ and modern LLM applications, offering a foundational reference for efficient multimodal systems, with ongoing updates.

Abstract: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: https://github.com/jindongli-Ai/LLM-Discrete-Tokenization-Survey.

[14] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

Lee Harris

Main category: cs.CL

TL;DR: The paper introduces the Language Model Chain (LMC) algorithm to improve accuracy and reduce hallucinations in language models by cascading predictions through multiple models until correctness is achieved.

Details

Motivation: Address the high cost and unreliability (hallucinations) of language models, ensuring correct information extraction without wasted resources.

Method: Proposes the LMC algorithm, where a language model’s response is validated against candidate answers, and incorrect responses are passed to a slower, more predictive model in a cascade.

Result: Applied to extract patient dates of birth from medical documents, LMC significantly improved speed and accuracy while reducing hallucinations.

Conclusion: The LMC algorithm is a novel contribution to knowledge extraction, warranting further exploration.

Abstract: Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model’s response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.

[15] Predicting stock prices with ChatGPT-annotated Reddit sentiment

Mateusz Kmak, Kamil Chmurzyński, Kamil Matejuk, Paweł Kotzbach, Jan Kocoń

Main category: cs.CL

TL;DR: The paper examines if social media sentiment predicts stock movements, focusing on GameStop and AMC. It tests three sentiment analysis methods, finding weak correlation with prices but stronger signals from comment volume and search trends.

Details

Motivation: To understand the impact of online sentiment, particularly from Reddit's r/wallstreetbets, on stock prices, inspired by events like the GameStop short squeeze.

Method: Uses two existing sentiment analysis methods and a new ChatGPT-annotated RoBERTa model to analyze informal language and emojis. Tests predictive power via correlation and causality metrics.

Result: Social media sentiment shows weak correlation with stock prices, while comment volume and Google search trends are stronger predictors.

Conclusion: Retail investor behavior is complex, and traditional sentiment analysis may miss nuances in online discussions that influence markets.

Abstract: The surge of retail investor activity on social media, exemplified by the 2021 GameStop short squeeze, raised questions about the influence of online sentiment on stock prices. This paper explores whether sentiment derived from social media discussions can meaningfully predict stock market movements. We focus on Reddit’s r/wallstreetbets and analyze sentiment related to two companies: GameStop (GME) and AMC Entertainment (AMC). To assess sentiment’s role, we employ two existing text-based sentiment analysis methods and introduce a third, a ChatGPT-annotated and fine-tuned RoBERTa-based model designed to better interpret the informal language and emojis prevalent in social media discussions. We use correlation and causality metrics to determine these models’ predictive power. Surprisingly, our findings suggest that social media sentiment has only a weak correlation with stock prices. At the same time, simpler metrics, such as the volume of comments and Google search trends, exhibit stronger predictive signals. These results highlight the complexity of retail investor behavior and suggest that traditional sentiment analysis may not fully capture the nuances of market-moving online discussions.

[16] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting

Aman Gupta, Yingying Zhuang, Zhou Yu, Ziji Zhang, Anurag Beniwal

Main category: cs.CL

TL;DR: The paper evaluates prompt translation strategies in multilingual RAG-based systems, showing optimized strategies improve cross-lingual knowledge sharing and task performance.

Details

Motivation: To address performance variability in multilingual LLMs and unclear impacts of prompt translation strategies in RAG systems.

Method: Systematic evaluation of prompt translation strategies (pre-translation vs. cross-lingual prompting) for classification tasks in multilingual RAG-enhanced LLMs.

Result: Optimized prompting improves cross-lingual knowledge sharing and downstream task performance.

Conclusion: Advocates for multilingual resource sharing and cross-lingual prompt optimization, especially for low-resource languages.

Abstract: Despite advances in the multilingual capabilities of Large Language Models (LLMs), their performance varies substantially across different languages and tasks. In multilingual retrieval-augmented generation (RAG)-based systems, knowledge bases (KB) are often shared from high-resource languages (such as English) to low-resource ones, resulting in retrieved information from the KB being in a different language than the rest of the context. In such scenarios, two common practices are pre-translation to create a mono-lingual prompt and cross-lingual prompting for direct inference. However, the impact of these choices remains unclear. In this paper, we systematically evaluate the impact of different prompt translation strategies for classification tasks with RAG-enhanced LLMs in multilingual systems. Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages, therefore improve the performance on the downstream classification task. The findings advocate for a broader utilization of multilingual resource sharing and cross-lingual prompt optimization for non-English languages, especially the low-resource ones.

[17] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers

Brittney Exline, Melanie Duffin, Brittany Harbison, Chrissa da Gomez, David Joyner

Main category: cs.CL

TL;DR: The paper explores how native vs. non-native English speaker status impacts peer feedback in online U.S. CS courses, finding differences in sentiment and ratings.

Details

Motivation: To understand how language background affects peer feedback experiences in online graduate CS courses, given the high enrollment of international students.

Method: Analyzed sentiment of peer reviews from 500 students using the Twitter-roBERTa model, correlating sentiment scores and feedback ratings with language background.

Result: Native English speakers rated feedback less favorably; non-native speakers wrote more positively but received less positive sentiment. Language background had modest but complex effects.

Conclusion: Language background influences peer feedback dynamics, highlighting nuanced challenges in online education for non-native speakers.

Abstract: Graduate-level CS programs in the U.S. increasingly enroll international students, with 60.2 percent of master’s degrees in 2023 awarded to non-U.S. students. Many of these students take online courses, where peer feedback is used to engage students and improve pedagogy in a scalable manner. Since these courses are conducted in English, many students study in a language other than their first. This paper examines how native versus non-native English speaker status affects three metrics of peer feedback experience in online U.S.-based computing courses. Using the Twitter-roBERTa-based model, we analyze the sentiment of peer reviews written by and to a random sample of 500 students. We then relate sentiment scores and peer feedback ratings to students’ language background. Results show that native English speakers rate feedback less favorably, while non-native speakers write more positively but receive less positive sentiment in return. When controlling for sex and age, significant interactions emerge, suggesting that language background plays a modest but complex role in shaping peer feedback experiences.

[18] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Haoran Sun, Shaoning Zeng

Main category: cs.CL

TL;DR: Proposes a Hierarchical Memory (H-MEM) architecture for LLM Agents to improve long-term memory organization and retrieval, outperforming baselines in dialogue tasks.

Details

Motivation: Existing memory mechanisms for LLM Agents lack structured organization and efficient retrieval, limiting reasoning capabilities.

Method: Introduces H-MEM, a multi-level memory architecture with positional indexing and index-based routing for efficient retrieval.

Result: Outperforms five baseline methods on the LoCoMo dataset, showing effectiveness in long-term dialogue scenarios.

Conclusion: H-MEM enhances LLM Agents’ reasoning by improving memory organization and retrieval efficiency.

Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.

[19] Multi-Relation Extraction in Entity Pairs using Global Context

Nilesh, Atul Gupta, Avinash C Panday

Main category: cs.CL

TL;DR: A novel input embedding method for document-level relation extraction captures global context by representing entities as standalone segments, improving accuracy over sentence-focused approaches.

Details

Motivation: Existing methods fail to capture the full document context for relation extraction, as they only focus on sentences where entities appear.

Method: Introduces a global context-aware input embedding approach, representing entities independently of their positions in the document.

Result: Tested on DocRED, Re-DocRED, and REBEL datasets, the method accurately predicts relationships, outperforming previous approaches.

Conclusion: The method advances global context modeling and multi-sentence reasoning, with practical benefits for NLP applications requiring detailed entity insights.

Abstract: In document-level relation extraction, entities may appear multiple times in a document, and their relationships can shift from one context to another. Accurate prediction of the relationship between two entities across an entire document requires building a global context spanning all relevant sentences. Previous approaches have focused only on the sentences where entities are mentioned, which fails to capture the complete document context necessary for accurate relation extraction. Therefore, this paper introduces a novel input embedding approach to capture the positions of mentioned entities throughout the document rather than focusing solely on the span where they appear. The proposed input encoding approach leverages global relationships and multi-sentence reasoning by representing entities as standalone segments, independent of their positions within the document. The performance of the proposed method has been tested on three benchmark relation extraction datasets, namely DocRED, Re-DocRED, and REBEL. The experimental results demonstrated that the proposed method accurately predicts relationships between entities in a document-level setting. The proposed research also has theoretical and practical implications. Theoretically, it advances global context modeling and multi-sentence reasoning in document-level relation extraction. Practically, it enhances relationship detection, enabling improved performance in real-world NLP applications requiring comprehensive entity-level insights and interpretability.

[20] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Zhehao Tan, Yihan Jiao, Dan Yang, Lei Liu, Jie Feng, Duolin Sun, Yue Shen, Jian Wang, Peng Wei, Jinjie Gu

Main category: cs.CL

TL;DR: The paper introduces Placeholder-RAG-Benchmark, a fine-grained benchmark to evaluate LLM-specific capabilities in RAG systems, focusing on filtering, combination, and reasoning abilities.

Details

Motivation: Current benchmarks lack granular evaluation of LLM-specific document utilization in RAG systems, prompting the need for a systematic framework.

Method: The authors propose a placeholder-based approach to decouple LLM’s parametric knowledge from external knowledge, assessing multi-level filtering, combination, and reference reasoning.

Result: Experiments reveal limitations in LLMs’ generation capabilities, especially in error resilience and context faithfulness.

Conclusion: The benchmark offers a reproducible framework for improving RAG system reliability and efficiency.

Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM’s ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs’ roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM’s parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system’s generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.

[21] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen, Aske Plaat, Niki van Stein

Main category: cs.CL

TL;DR: The study investigates the faithfulness of Chain-of-Thought (CoT) prompting in Large Language Models (LLMs), revealing its effectiveness in larger models like Pythia-2.8B but not in smaller ones like Pythia-70M.

Details

Motivation: To determine if CoT-generated thoughts truly reflect the internal reasoning process of LLMs and to assess its impact on model performance and interpretability.

Method: Combined sparse autoencoders with activation patching to analyze monosemantic features in Pythia-70M and Pythia-2.8B under CoT and plain prompting, focusing on GSM8K math problems.

Result: CoT significantly improves answer accuracy and interpretability in Pythia-2.8B, with higher activation sparsity and modular computation, but shows no reliable effect in Pythia-70M.

Conclusion: CoT induces more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated “thoughts” reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model’s confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

[22] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra, Suparna De, Nishanth Sastry, Saeed Fadaei

Main category: cs.CL

TL;DR: The paper introduces a method to create synthetic PII-revealing text data for privacy research, using LLMs, and evaluates its utility with three metrics.

Details

Motivation: To address the lack of open-source labeled datasets for studying PII disclosures on social platforms like Reddit, which pose privacy risks.

Method: Develops a taxonomy of 19 PII categories, uses three LLMs (Llama2-7B, Llama3-8B, zephyr-7b-beta) to generate synthetic data resembling Reddit posts, and evaluates it with reproducibility, unlinkability, and indistinguishability metrics.

Result: A synthetic PII-labeled dataset is created and released, meeting reproducibility, unlinkability, and indistinguishability criteria.

Conclusion: The synthetic dataset enables reproducible research into PII privacy risks without compromising user privacy, fostering safer online interactions.

Abstract: Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users’ Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.

[23] Enhancing RAG Efficiency with Adaptive Context Compression

Shuyu Guo, Zhaochun Ren

Main category: cs.CL

TL;DR: ACC-RAG dynamically adjusts compression rates for RAG, improving efficiency without losing accuracy.

Details

Motivation: Fixed compression rates in RAG over-compress simple queries or under-compress complex ones, leading to inefficiency.

Method: Uses a hierarchical compressor and context selector to adaptively compress based on input complexity.

Result: Outperforms fixed-rate methods, achieves 4x faster inference, and maintains/improves accuracy.

Conclusion: ACC-RAG optimizes RAG efficiency dynamically, balancing speed and accuracy.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.

[24] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

Baptiste Lefort, Eric Benhamou, Beatrice Guez, Jean-Jacques Ohana, Ethan Setrouk, Alban Etienne

Main category: cs.CL

TL;DR: A hierarchical framework combines LLMs and DRL for portfolio optimization, achieving 26% annualized return and outperforming benchmarks.

Details

Motivation: To integrate sentiment signals from financial news with traditional market indicators for better portfolio performance.

Method: Three-tier architecture: base RL agents process hybrid data, meta-agents aggregate decisions, and a super-agent merges decisions using market and sentiment data.

Result: 26% annualized return and Sharpe ratio of 1.2, outperforming benchmarks.

Conclusion: The framework offers scalable cross-modal integration, hierarchical stability, and open-source reproducibility.

Abstract: This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.

[25] Augmented Vision-Language Models: A Systematic Review

Anthony C Davis, Burhan Sadiq, Tianmin Shu, Chien-Ming Huang

Main category: cs.CL

TL;DR: The paper reviews neural-symbolic systems integrating VLMs with external symbolic systems to enhance reasoning, interpretability, and adaptability.

Details

Motivation: Current visual-language models lack interpretability, adaptability, and logical reasoning, prompting the need for neural-symbolic integration.

Method: Systematic literature review categorizing techniques for improving visual-language understanding via neural-symbolic systems.

Result: Neural-symbolic systems offer interpretable outputs, better reasoning, and reduced retraining needs.

Conclusion: Integrating VLMs with symbolic systems is a promising solution for advancing visual-language understanding.

Abstract: Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.

[26] Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Jingwei Zhao, Yuhua Wen, Qifei Li, Minchi Hu, Yingying Zhou, Jingyao Xue, Junyang Wu, Yingming Gao, Zhengqi Wen, Jianhua Tao, Ya Li

Main category: cs.CL

TL;DR: Survey of deep learning methods for intent recognition, focusing on multimodal approaches and Transformer-based models.

Details

Motivation: To address the growing need for natural human-computer interaction by advancing intent recognition beyond text to include multimodal data.

Method: Reviews unimodal to multimodal techniques, datasets, methodologies, and applications in intent recognition.

Result: Highlights breakthroughs from Transformer-based models and identifies current challenges.

Conclusion: Provides insights into multimodal intent recognition (MIR) and suggests future research directions.

Abstract: Intent recognition aims to identify users’ underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.

[27] OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

Zhangcheng Qiang, Kerry Taylor, Weiqing Wang, Jing Jiang

Main category: cs.CL

TL;DR: A new benchmark dataset, OAEI-LLM-T, is introduced to address hallucinations in LLM-based ontology matching (OM) tasks, derived from seven TBox datasets, categorizing OM-specific hallucinations into two main and six sub-categories.

Details

Motivation: Hallucinations in LLMs pose challenges for OM tasks, necessitating a structured approach to evaluate and mitigate them.

Method: The dataset OAEI-LLM-T is created from seven TBox datasets, capturing hallucinations from ten LLMs performing OM tasks, categorized into two primary and six sub-categories.

Result: The dataset aids in building an LLM leaderboard for OM tasks and fine-tuning LLMs for OM applications.

Conclusion: OAEI-LLM-T provides a valuable resource for evaluating and improving LLM performance in OM tasks by addressing hallucinations.

Abstract: Hallucinations are often inevitable in downstream tasks using large language models (LLMs). To tackle the substantial challenge of addressing hallucinations for LLM-based ontology matching (OM) systems, we introduce a new benchmark dataset OAEI-LLM-T. The dataset evolves from seven TBox datasets in the Ontology Alignment Evaluation Initiative (OAEI), capturing hallucinations of ten different LLMs performing OM tasks. These OM-specific hallucinations are organised into two primary categories and six sub-categories. We showcase the usefulness of the dataset in constructing an LLM leaderboard for OM tasks and for fine-tuning LLMs used in OM tasks.

[28] Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Kathleen Mealey, Jonathan A. Karr Jr., Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman II

Main category: cs.CL

TL;DR: The paper addresses challenges in deriving operational intelligence from confidential data, focusing on NLP and LLM tools for Knowledge Graph construction in aviation maintenance. It evaluates 16 tools, highlights performance limitations, and provides recommendations for trusted applications.

Details

Motivation: The need to balance data confidentiality with integration for operational intelligence, especially in mission-critical industries like aviation, drives this research.

Method: The study breaks down Knowledge Extraction into functional components (NER, Coreference Resolution, etc.), evaluates 16 NLP tools and LLMs, and uses a public FAA dataset for zero-shot performance testing.

Result: Significant performance limitations of NLP and LLM tools in confidential environments are observed, raising concerns about their Technical Readiness Level for aviation.

Conclusion: The paper recommends enhancing trust in NLP/LLM tools and provides an open-source dataset for further evaluation in mission-critical applications.

Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

[29] Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Md Talha Mohsin

Main category: cs.CL

TL;DR: This paper compares five leading LLMs (GPT, Claude, Perplexity, Gemini, DeepSeek) in FinNLP tasks using 10-K filings, evaluating performance via human annotation, automated metrics, and diagnostics. GPT outperforms others in coherence and relevance.

Details

Motivation: Systematic comparisons of LLMs in financial analysis are lacking despite their growing influence, prompting this study.

Method: The study evaluates five LLMs using domain-specific prompts and three methodologies: human annotation, automated metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics.

Result: GPT performs best in coherence and relevance; Claude and Perplexity follow, while Gemini and DeepSeek show more variability. Outputs are sensitive to prompt wording and source material.

Conclusion: GPT is the most reliable LLM for FinNLP tasks, but performance varies with prompts and data, highlighting the need for careful implementation.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the ‘Magnificent Seven’ technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.

[30] CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

Jinkun Zhao, Yuanshuai Wang, Xingjian Zhang, Ruibo Chen, Xingchuang Liao, Junle Wang, Lei Huang, Kui Zhang, Wenjun Wu

Main category: cs.CL

TL;DR: The paper proposes CoE-Ops, a collaboration-of-expert framework for AIOps, integrating a large language model task classifier and retrieval-augmented generation, showing significant improvements in task routing and problem resolution.

Details

Motivation: Existing AIOps models are limited to specific tasks due to domain constraints. Combining multiple models, inspired by ensemble learning and LLM training, can enhance efficiency.

Method: Introduces CoE-Ops with a general-purpose LLM task classifier and retrieval-augmented generation for handling diverse AIOps tasks.

Result: CoE-Ops improves routing accuracy by 72% for high-level tasks, enhances problem resolution by 8% over single models, and outperforms MoE models by 14%.

Conclusion: CoE-Ops effectively addresses AIOps challenges by leveraging collaboration of experts, demonstrating superior performance in task routing and problem-solving.

Abstract: With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework’s capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.

Sumit Soman, H. G. Ranjani, Sujoy Roychowdhury, Venkata Dharma Surya Narayana Sastry, Akshat Jain, Pranav Gangrade, Ayaaz Khan

Main category: cs.CL

TL;DR: The paper proposes integrating graph representations of flowcharts from Visual Language Models (VLMs) into text-based RAG systems to improve QA performance for technical documents, particularly in telecom.

Details

Motivation: Text-based RAG systems struggle with QA tasks involving answers in figures like flowcharts. The study aims to bridge this gap by leveraging VLMs for better retrieval.

Method: The approach involves processing technical documents, classifying image types, creating graph representations, and integrating them with text embeddings for retrieval. A telecom QA dataset is used for benchmarking.

Result: Graph representations from fine-tuned VLMs show lower edit distance to ground truth, indicating robustness. The hybrid approach improves retrieval performance without requiring VLMs during inference.

Conclusion: The method enhances QA for technical documents by combining graph and text embeddings, offering cost-effective deployment without compromising performance.

Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.

[32] PARROT: An Open Multilingual Radiology Reports Dataset

Bastien Le Guellec, Kokou Adambounou, Lisa C Adams, Thibault Agripnidis, Sung Soo Ahn, Radhia Ait Chalal, Tugba Akinci D Antonoli, Philippe Amouyel, Henrik Andersson, Raphael Bentegeac, Claudio Benzoni, Antonino Andrea Blandino, Felix Busch, Elif Can, Riccardo Cau, Armando Ugo Cavallo, Christelle Chavihot, Erwin Chiquete, Renato Cuocolo, Eugen Divjak, Gordana Ivanac, Barbara Dziadkowiec Macek, Armel Elogne, Salvatore Claudio Fanni, Carlos Ferrarotti, Claudia Fossataro, Federica Fossataro, Katarzyna Fulek, Michal Fulek, Pawel Gac, Martyna Gachowska, Ignacio Garcia Juarez, Marco Gatti, Natalia Gorelik, Alexia Maria Goulianou, Aghiles Hamroun, Nicolas Herinirina, Krzysztof Kraik, Dominik Krupka, Quentin Holay, Felipe Kitamura, Michail E Klontzas, Anna Kompanowska, Rafal Kompanowski, Alexandre Lefevre, Tristan Lemke, Maximilian Lindholz, Lukas Muller, Piotr Macek, Marcus Makowski, Luigi Mannacio, Aymen Meddeb, Antonio Natale, Beatrice Nguema Edzang, Adriana Ojeda, Yae Won Park, Federica Piccione, Andrea Ponsiglione, Malgorzata Poreba, Rafal Poreba, Philipp Prucker, Jean Pierre Pruvo, Rosa Alba Pugliesi, Feno Hasina Rabemanorintsoa, Vasileios Rafailidis, Katarzyna Resler, Jan Rotkegel, Luca Saba, Ezann Siebert, Arnaldo Stanzione, Ali Fuat Tekin, Liz Toapanta Yanchapaxi, Matthaios Triantafyllou, Ekaterini Tsaoulia, Evangelia Vassalou, Federica Vernuccio, Johan Wasselius, Weilang Wang, Szymon Urban, Adrian Wlodarczak, Szymon Wlodarczak, Andrzej Wysocki, Lina Xu, Tomasz Zatonski, Shuhang Zhang, Sebastian Ziegelmayer, Gregory Kuchcinski, Keno K Bressem

Main category: cs.CL

TL;DR: PARROT is a multilingual, open-access dataset of fictional radiology reports for NLP testing, validated by a human vs. AI differentiation study.

Details

Motivation: To create a large, multicentric, open-access dataset for testing NLP applications in radiology without privacy constraints.

Method: Radiologists contributed fictional reports with metadata. A study compared human and AI-generated report detection accuracy.

Result: 2,658 reports in 13 languages were collected. Participants achieved 53.9% accuracy in distinguishing human vs. AI reports, with radiologists performing better.

Conclusion: PARROT enables NLP development across linguistic and clinical boundaries, offering privacy-free validation.

Abstract: Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.

[33] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Rui Jiao, Yue Zhang, Jinku Li

Main category: cs.CL

TL;DR: RELIANCE is a framework to improve factual accuracy in LLMs’ reasoning steps, combining fact-checking, reinforcement learning, and interpretability modules. It shows significant improvements in factual robustness while maintaining benchmark performance.

Details

Motivation: Addressing factual inaccuracies in LLMs' reasoning steps, which can mislead users in high-stakes domains like healthcare and legal analysis.

Method: Integrates a fact-checking classifier, GRPO reinforcement learning, and a mechanistic interpretability module.

Result: Improves factual robustness by up to 49.90%, with maintained or improved benchmark performance.

Conclusion: RELIANCE enhances factual accuracy in reasoning and provides insights for future training methodologies.

Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.

[34] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

Paul Minchella, Loïc Verlingue, Stéphane Chrétien, Rémi Vaucher, Guillaume Metzler

Main category: cs.CL

TL;DR: SigBERT is a temporal survival analysis framework for EHR data, using signature extraction and LASSO-penalized Cox models to improve risk estimation.

Details

Motivation: Existing survival analysis methods struggle with textual data complexity, especially sequential clinical reports.

Method: SigBERT processes timestamped reports by averaging word embeddings, applies signature extraction for temporal dynamics, and integrates features into a LASSO-Cox model.

Result: Achieved a C-index of 0.75 on an oncology dataset, showing improved risk estimation.

Conclusion: SigBERT advances narrative-based survival analysis by effectively leveraging sequential medical data.

Abstract: Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L'eon B'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.

[35] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

Shirley V Wang, Georg Hahn, Sushama Kattinakere Sreedhara, Mufaddal Mahesri, Haritha S. Pillai, Rajendra Aldis, Joyce Lii, Sarah K. Dutcher, Rhoda Eniafe, Jamal T. Jones, Keewan Kim, Jiwei He, Hana Lee, Sengwee Toh, Rishi J Desai, Jie Yang

Main category: cs.CL

TL;DR: The paper introduces an efficient validation process for code-based algorithms in claims databases, using NLP and adaptive sampling to reduce manual chart review time and resources.

Details

Motivation: To enhance the robustness of database study results by validating code-based algorithms more efficiently, addressing the time and resource constraints of manual chart reviews.

Method: Combines NLP to speed up human chart reviews and a multi-wave adaptive sampling approach with pre-defined stopping criteria to optimize resource use.

Result: NLP reduced review time by 40%, and adaptive sampling could have avoided 77% of chart reviews without significantly compromising precision.

Conclusion: The proposed method enables more routine validation of algorithms, improving the reliability of findings from database studies.

Abstract: Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.

[36] Opacity as Authority: Arbitrariness and the Preclusion of Contestation

Naomi Omeonga wa Kayembe

Main category: cs.CL

TL;DR: The paper redefines arbitrariness as a functional mechanism in human systems, extending Saussure’s linguistic concept to law and social dynamics, and formalizes it using Shannon’s entropy model.

Details

Motivation: To challenge the conflation of arbitrariness with injustice and demonstrate its foundational role in structuring systems like language, law, and social interactions.

Method: Extends Saussure’s concept of arbitrariness to other domains, introduces the “Motivation -> Constatability -> Contestability” chain, and formalizes arbitrariness using Shannon’s entropy model (A = H(L|M)).

Result: Arbitrariness is shown as a deliberate design protecting authority from accountability, while also being central to control and care in interpersonal relations.

Conclusion: Arbitrariness is a neutral, functional operator with cross-domain applicability, offering insights for human systems and AI explainability.

Abstract: This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure’s concept of l’arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the “Motivation -> Constatability -> Contestability” chain, arguing that motivation functions as a crucial interface rendering an act’s logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like “immotivization” or “Conflict Lateralization” (exemplified by “the blur of the wolf drowned in the fish”), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon’s entropy model, the paper formalizes arbitrariness as A = H(L|M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.

[37] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma, Wei Tao, Yiwen Guo

Main category: cs.CL

TL;DR: The paper introduces a benchmark dataset for evaluating Spoken Dialogue Models (SDMs) to address gaps in understanding their effectiveness compared to text-based LLMs, focusing on challenges like ambiguity and context-dependency in spoken conversations.

Details

Motivation: The lack of comprehensive research on SDMs' practical effectiveness in emulating human conversations compared to text-based LLMs, especially given the complexity of spoken dialogue.

Method: Creation of a benchmark dataset with 1,079 instances in English and Chinese, paired with an LLM-based evaluation method aligned with human judgment.

Result: The dataset and evaluation method enable a thorough assessment of SDMs’ performance in handling spoken dialogue challenges.

Conclusion: The work provides a foundation for better understanding and improving SDMs by addressing practical challenges in spoken conversations.

Abstract: Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users’ spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

[38] Math Natural Language Inference: this should be easy!

Valeria de Paiva, Qiyue Gao, Hai Hu, Pavel Kovalev, Yikang Liu, Lawrence S. Moss, Zhiheng Qian

Main category: cs.CL

TL;DR: The paper investigates whether LLMs can perform NLI tasks on mathematical texts (Math NLI), using human- and LLM-generated hypotheses, and evaluates their performance and consistency.

Details

Motivation: To assess the capability of contemporary LLMs in handling NLI tasks involving mathematical language and to compare their performance with human-labeled data.

Method: Constructed a Math NLI corpus with human-provided hypotheses and labels, and another with LLM-generated hypotheses. Evaluated performance and inter-group consistency of LLMs.

Result: Positive: Majority vote of LLMs can match human-labeled data in some settings. Negative: LLMs struggle with mathematical language and basic inferences, though less prone to hypothesis-only errors than older models.

Conclusion: LLMs show promise in Math NLI but still face challenges. The provided corpora aim to support future research in this area.

Abstract: We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only “inference” in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.

[39] Exploring In-Context Learning for Frame-Semantic Parsing

Diego Garat, Guillermo Moncecchi, Dina Wonsever

Main category: cs.CL

TL;DR: The paper explores using In-Context Learning (ICL) with Large Language Models (LLMs) for Frame Semantic Parsing (FSP) without fine-tuning, achieving competitive results.

Details

Motivation: To investigate the feasibility of using ICL for FSP tasks without the need for model fine-tuning, leveraging the FrameNet database.

Method: Automatically generates task-specific prompts for Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) using frame definitions and examples, tested on six LLMs.

Result: Achieves F1 scores of 94.3% for FI and 77.4% for FSRL on a subset of violent event frames.

Conclusion: ICL is a practical and effective alternative to fine-tuning for domain-specific FSP tasks.

Abstract: Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.

[40] Context-aware Rotary Position Embedding

Ali Veisi, Delaram Fartoot, Hamidreza Amirzadeh

Main category: cs.CL

TL;DR: CARoPE is a dynamic, context-aware extension of RoPE for positional encoding in Transformers, improving performance and efficiency.

Details

Motivation: RoPE's static frequency patterns limit context-sensitive modeling, prompting the need for a dynamic solution.

Method: CARoPE dynamically generates head-specific frequency patterns using token embeddings and integrates them into RoPE.

Result: CARoPE outperforms RoPE and baselines, achieving lower perplexity and faster training on GPT-2 variants.

Conclusion: CARoPE is a scalable, expressive, and efficient upgrade for positional encoding in Transformers.

Abstract: Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.

[41] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

Ishani Mondal, Meera Bharadwaj, Ayush Roy, Aparna Garimella, Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: SMART-Editor is a framework for global coherence in layout and content editing across structured and unstructured domains, outperforming baselines with reward-guided strategies.

Details

Motivation: To address the limitation of prior models that perform only local edits, ensuring global coherence in edits across diverse domains.

Method: Uses Reward-Refine (inference-time refinement) and RewardDPO (training-time preference optimization) for reward-aligned edits.

Result: Outperforms baselines like InstructPix2Pix and HIVE, with 15% gains in structured settings and advantages on natural images.

Conclusion: Reward-guided planning is effective for semantically consistent and visually aligned edits, validated by benchmarks and evaluations.

Abstract: We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.

[42] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

Jeffrey Eben, Aitzaz Ahmad, Stephen Lau

Main category: cs.CL

TL;DR: A component-based retrieval architecture improves scalability for LLM-based natural language interfaces in enterprise databases by decomposing schemas and metadata into semantic units, outperforming baselines without fine-tuning.

Details

Motivation: Scaling LLM-based natural language interfaces to enterprise-level data catalogs is challenging due to reliance on domain-specific fine-tuning and underutilized semantic context in database metadata.

Method: Introduces a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, separately indexed for targeted retrieval, prioritizing table identification and column-level information.

Result: The method maintains high recall and accuracy, outperforming baselines on massive databases with varying structure and metadata, without requiring specialized fine-tuning.

Conclusion: The solution enables practical text-to-SQL systems deployable across diverse enterprise settings, addressing scalability gaps in natural language database interfaces.

Abstract: Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.

[43] Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Xinwei Wu, Haojie Li, Hongyu Liu, Xinyu Ji, Ruohan Li, Yule Chen, Yigeng Zhang

Main category: cs.CL

TL;DR: The paper investigates how large language models (LLMs) handle ambiguous Chinese text, revealing their fragility and limitations compared to humans.

Details

Motivation: To assess the trustworthiness of LLMs in processing ambiguous narrative text, particularly in Chinese, and identify their shortcomings.

Method: Created a benchmark dataset of ambiguous sentences with disambiguated pairs, categorized into 3 main and 9 subcategories, and tested LLMs on it.

Result: LLMs struggle with ambiguity: they fail to distinguish ambiguous from unambiguous text, overconfidently assign single meanings, and overthink interpretations.

Conclusion: Current LLMs have fundamental limitations in handling ambiguity, necessitating improved approaches for real-world applications.

Abstract: In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.

[44] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

Ananya Sadana, Yash Kumar Lal, Jiawei Zhou

Main category: cs.CL

TL;DR: ISO-Bench evaluates multimodal models’ ability to infer causal dependencies between visual and text steps, revealing poor performance (best F1: 0.57) compared to humans (0.98 F1).

Details

Motivation: To address the challenge of understanding causal relationships across modalities in real-world environments.

Method: Introduces ISO-Bench, a benchmark where models decide if a visual step occurs before or after a text snippet. Evaluates ten vision-language models.

Result: Models perform poorly (best F1: 0.57), with chain-of-thought reasoning offering slight improvement (up to 0.62 F1), far behind humans (0.98 F1).

Conclusion: Highlights the need for better causal understanding in multimodal models and suggests directions for improvement.

Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.

[45] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Yuhan Liu, Michael J. Q. Zhang, Eunsol Choi

Main category: cs.CL

TL;DR: The paper studies implicit user feedback from interaction logs to improve language models, finding feedback content and prompt quality impact performance.

Details

Motivation: To improve LMs continuously without disruptive direct feedback by analyzing implicit feedback from interaction logs.

Method: Analyzed user feedback in WildChat and LMSYS datasets, studying feedback occurrence and harvesting learning signals.

Result: Feedback content improves performance on short questions (MTBench) but not complex ones (WildBench). Prompt quality affects feedback usefulness.

Conclusion: Implicit feedback has potential but limitations, with effectiveness tied to feedback content and initial prompt quality.

Abstract: Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user’s initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.

[46] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, Emad Barsoum

Main category: cs.CL

TL;DR: GEAK, an AI-driven framework, generates efficient GPU kernels for AMD hardware using LLMs, outperforming baselines in correctness and speed.

Details

Motivation: The growing demand for scalable, hardware-optimized GPU kernels in AI workloads drives the need for automated, high-performance solutions.

Method: GEAK uses LLMs with Reflexion-style feedback to generate Triton-based GPU kernels, evaluated on benchmarks for AMD GPUs.

Result: GEAK achieved 63% correctness and 2.59X speedup over baselines, demonstrating superior performance.

Conclusion: GEAK shows promise in democratizing expert-level kernel performance and accelerating hardware adoption.

Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.

[47] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

Yunhao Liang, Ruixuan Ying, Takuya Taniguchi, Zhe Cui

Main category: cs.CL

TL;DR: The paper introduces a method using Negative samples to improve Positive sample selection for few-shot in-context learning (ICL), enhancing performance beyond methods relying only on Positive samples.

Details

Motivation: Existing ICL methods focus on Positive samples, ignoring the potential of Negative samples to improve example selection and mitigate biases.

Method: Construct Positive and Negative sample corpora using Zero-Shot-Cot, then use semantic similarity to select and combine examples from both for ICL demonstrations.

Result: The method outperforms approaches using only Positive samples, showing Negative samples aid in better Positive example selection.

Conclusion: Incorporating Negative samples improves ICL performance by refining Positive example selection, validating their utility in the learning process.

Abstract: Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection.

[48] Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Carolina Zheng, Nicolas Beltran-Velez, Sweta Karlekar, Claudia Shi, Achille Nazaret, Asif Mallik, Amir Feder, David M. Blei

Main category: cs.CL

TL;DR: MTMs use sparse autoencoders to create interpretable topics, outperforming traditional and neural topic models in coherence and enabling controllable text generation.

Details

Motivation: Traditional topic models fail to capture abstract semantics due to bag-of-words limitations, and neural variants are constrained by word-list topics.

Method: MTMs leverage sparse autoencoders to define topics over semantically rich features, enabling deeper conceptual themes and controllable generation.

Result: MTMs match or exceed baselines in coherence, are preferred in evaluations, and allow effective steering of LLM outputs.

Conclusion: MTMs offer a superior, interpretable, and controllable approach to topic modeling.

Abstract: Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.

[49] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

Daeyong Kwon, SeungHeon Doh, Juhan Nam

Main category: cs.CL

TL;DR: MusT-RAG enhances LLMs for music tasks using RAG, outperforming fine-tuning with a specialized database (MusWikiDB).

Details

Motivation: LLMs lack music-specific knowledge, limiting their effectiveness in music-related tasks.

Method: Proposes MusT-RAG, combining RAG with MusWikiDB for retrieval and fine-tuning.

Result: MusT-RAG outperforms fine-tuning, improving performance on MQA benchmarks.

Conclusion: MusT-RAG effectively adapts LLMs for music tasks, with MusWikiDB proving superior to general corpora.

Abstract: Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs’ effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs’ music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.

[50] Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs

Sophie Kearney, Shu Yang, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Jason Moore, Marylyn Ritchie, Li Shen

Main category: cs.CL

TL;DR: TAP-GPT adapts TableGPT2 for Alzheimer’s disease diagnosis using structured biomarker data, outperforming general-purpose LLMs and tabular foundation models.

Details

Motivation: Early and accurate diagnosis of Alzheimer's disease requires analyzing heterogeneous biomarkers, which LLMs can address with their multimodal and interpretable capabilities.

Method: The framework constructs few-shot tabular prompts, finetunes TableGPT2 with qLoRA, and leverages its tabular understanding for binary classification (AD vs. cognitively normal).

Result: TAP-GPT outperforms advanced general-purpose LLMs and a tabular foundation model in AD prediction tasks.

Conclusion: This is the first LLM application for tabular biomarker prediction, opening doors for LLM-driven biomedical informatics frameworks.

Abstract: Early and accurate diagnosis of Alzheimer’s disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer’s Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.

Sneha Oram, Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: The paper explores pragmatic reasoning in LLMs for mental health, introduces the P-ReMe dataset, and benchmarks models like Mistral and Qwen. It also studies stigma using StiPRompts, finding Claude-3.5-haiku most responsible.

Details

Motivation: To address gaps in explainability and dialogue discourse for mental health chatbots by investigating LLMs' pragmatic reasoning.

Method: Introduces P-ReMe dataset, defines pragmatic phenomena (implicature, presupposition), formulates tasks, and benchmarks models (Llama3.1, Mistral, MentaLLaMa, Qwen). Uses StiPRompts to study stigma with GPT-4o mini, Deepseek-chat, Claude-3.5-haiku.

Result: Mistral and Qwen excel in reasoning tasks. Claude-3.5-haiku handles stigma more responsibly than GPT-4o mini and Deepseek-chat.

Conclusion: LLMs show promise in mental health reasoning, with Claude-3.5-haiku leading in stigma handling. Future work could refine definitions and expand datasets.

Abstract: There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.

[52] Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Shimanto Bhowmik, Tawsif Tashwar Dipto, Md Sazzad Islam, Sheryl Hsu, Tahsin Reasat

Main category: cs.CL

TL;DR: The paper investigates challenges in Bengali NLP, evaluates 10 LLMs on translated datasets, and identifies performance gaps and tokenization issues.

Details

Motivation: Bengali is underrepresented in NLP, and its unique linguistic structure and lack of standardized benchmarks hinder progress.

Method: Evaluated 10 LLMs on 8 translated datasets, analyzed errors, and studied tokenization efficiency.

Result: Performance gaps exist for Bengali vs. English, with smaller models and Mistral struggling. DeepSeek showed robustness. Tokenization efficiency impacts accuracy.

Conclusion: Improved datasets and evaluation methods are needed for multilingual NLP. The work aims to advance underrepresented language research.

Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient & concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.

[53] Text-to-SQL Task-oriented Dialogue Ontology Construction

Renato Vukovic, Carel van Niekerk, Michael Heck, Benjamin Ruppik, Hsien-Chin Lin, Shutong Feng, Nurul Lubis, Milica Gasic

Main category: cs.CL

TL;DR: TeQoDO is a method for autonomously constructing task-oriented dialogue ontologies using LLMs without supervision, outperforming transfer learning and scaling well for larger datasets.

Details

Motivation: LLMs lack explainability and trustworthiness due to reliance on parametric knowledge. Task-oriented dialogue systems use external databases with explicit ontologies, but building these requires manual effort.

Method: TeQoDO uses an LLM to autonomously build a TOD ontology from scratch, leveraging its SQL programming capabilities and dialogue theory provided in prompts.

Result: TeQoDO outperforms transfer learning, scales to larger datasets (Wikipedia, ArXiv), and performs competitively on dialogue state tracking.

Conclusion: TeQoDO advances LLM explainability by enabling unsupervised ontology construction, with potential for broader applications.

Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

[54] Unveiling Super Experts in Mixture-of-Experts Large Language Models

Zunhai Su, Qingyuan Li, Hao Zhang, YuLei Qian, Yuchen Xie, Kehong Yuan

Main category: cs.CL

TL;DR: The paper identifies ‘Super Experts’ (SEs) in MoE LLMs, a small subset of experts critical for model performance, and analyzes their unique characteristics and impact.

Details

Motivation: Existing expert-level compression techniques lack a deep understanding of expert importance, prompting the study of SEs to fill this gap.

Method: The study identifies SEs through their rare but extreme activation outliers, analyzes their distribution and impact via pruning, and explores their role in attention mechanisms.

Result: SEs are crucial for model performance, especially in mathematical reasoning, and their pruning disrupts attention sinks, leading to poor outputs.

Conclusion: SEs play a vital role in MoE LLMs, and their study provides deeper insights into model mechanisms, with implications for future compression techniques.

Abstract: Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model’s forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.

[55] What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Alfio Ferrara, Sergio Picascia, Laura Pinnavaia, Vojimir Ranitovic, Elisabetta Rocchetti, Alice Tuveri

Main category: cs.CL

TL;DR: The study examines GPT-4o-mini’s implicit moderation of sensitive content, finding systematic sanitization and reduced derogatory language. It also compares LLMs’ zero-shot sensitivity classification with traditional methods.

Details

Motivation: To explore whether LLMs implicitly moderate sensitive content without explicit training, a less studied aspect compared to explicit moderation.

Method: Empirical analysis of GPT-4o-mini’s paraphrasing of sensitive content and evaluation of sensitivity shifts, alongside zero-shot classification performance comparison.

Result: GPT-4o-mini systematically moderates content, reducing derogatory and taboo language, and shows competitive zero-shot classification performance.

Conclusion: LLMs like GPT-4o-mini implicitly sanitize language effectively, offering potential for automated moderation without explicit training.

Abstract: Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

[56] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che

Main category: cs.CL

TL;DR: The paper introduces MPCC, a benchmark for evaluating multimodal planning in MLLMs, addressing gaps in current benchmarks by focusing on real-world tasks and complex constraints. Results show low performance across models, highlighting challenges in constraint-aware reasoning.

Details

Motivation: Current benchmarks lack the ability to assess multimodal real-world planning and handle complex constraints, limiting progress in multimodal reasoning.

Method: MPCC evaluates MLLMs on three real-world tasks (Flight, Calendar, Meeting Planning) with graded constraints (EASY, MEDIUM, HARD). Experiments test 13 advanced MLLMs.

Result: Closed-source models achieve 21.3% feasible plans; open-source models average below 11%. MLLMs struggle with constraint complexity and traditional prompting fails.

Conclusion: MPCC formalizes multimodal constraints, provides a rigorous evaluation framework, and underscores the need for advancements in constraint-aware reasoning for MLLMs.

Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs’ ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

[57] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Ailiang Lin, Zhuoyun Li, Kotaro Funakoshi

Main category: cs.CL

TL;DR: Causal2Vec enhances decoder-only LLMs for embedding tasks without altering their architecture or adding computational costs, achieving top performance on MTEB with reduced sequence length and inference time.

Details

Motivation: Existing methods for embedding tasks with decoder-only LLMs either compromise semantic extraction by removing causal attention masks or increase computational costs with extra inputs.

Method: Causal2Vec uses a lightweight BERT-style model to pre-encode input text into a Contextual token, prepended to the LLM’s input sequence. It concatenates the last hidden states of Contextual and EOS tokens for the final embedding.

Result: Causal2Vec achieves state-of-the-art performance on MTEB, reducing sequence length by 85% and inference time by 82% compared to leading methods.

Conclusion: Causal2Vec effectively leverages decoder-only LLMs for embedding tasks without architectural changes or significant overhead, offering superior efficiency and performance.

Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model’s ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM’s input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.

[58] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Peter Sandrini

Main category: cs.CL

TL;DR: The paper explores locally deployable, free language models as an alternative to cloud-based AI for translation, focusing on accessibility, privacy, and performance.

Details

Motivation: Address concerns about data privacy, security, and equitable access in cloud-based AI translation solutions.

Method: Evaluates three open-source models on CPU-based platforms, comparing them to commercial chatbots, focusing on functional performance.

Result: Local deployment offers benefits like enhanced data control and privacy, though it introduces challenges.

Conclusion: Supports democratizing AI for translators and small businesses by making LLMs more accessible and practical.

Abstract: The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.

[59] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

Yongbing Zhang, Fang Nan, Shengxiang Gao, Yuxin Huang, Kaiwen Tan, Zhengtao Yu

Main category: cs.CL

TL;DR: MRGSEM-Sum is an unsupervised multi-document summarization framework using multi-relational graphs and structural entropy minimization to address redundancy and complex relationships, outperforming existing methods.

Details

Motivation: Existing methods for multi-document summarization struggle with single-relational graphs and predefined cluster numbers, limiting their ability to handle rich relational information and adaptive redundancy reduction.

Method: The framework constructs a multi-relational graph for semantic and discourse relations, applies structural entropy minimization for adaptive clustering, and uses position-aware compression for summarization.

Result: Experiments on benchmark datasets show MRGSEM-Sum outperforms unsupervised methods and rivals supervised models, with human evaluations confirming high-quality summaries.

Conclusion: MRGSEM-Sum effectively addresses redundancy and relational complexity in multi-document summarization, achieving near-human performance.

Abstract: The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.

[60] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Salah Eddine Bekhouche, Azeddine Benlamoudi, Yazid Bounab, Fadi Dornaika, Abdenour Hadid

Main category: cs.CL

TL;DR: An enhanced Dense Passage Retrieval (DPR) framework for Arabic NLP, featuring a novel Attentive Relevance Scoring (ARS) to improve semantic relevance modeling and retrieval performance.

Details

Motivation: Arabic's complex morphology, optional diacritics, and dialect variations make it underrepresented in NLP research, necessitating tailored solutions.

Method: Proposes an ARS mechanism within a DPR framework, integrating pre-trained Arabic language models and architectural refinements.

Result: Improved retrieval performance and ranking accuracy for Arabic question answering.

Conclusion: The framework addresses Arabic NLP challenges effectively, with publicly available code for further research.

Abstract: Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.

[61] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Ante Wang, Yujie Lin, Jingyao Liu, Suhang Wu, Hao Liu, Xinyan Xiao, Jinsong Su

Main category: cs.CL

TL;DR: The paper introduces proactive critical thinking in AI, where models actively seek missing information to resolve queries, and evaluates it using GSM-MC and GSM-MCE benchmarks. Reinforcement learning significantly improves performance.

Details

Motivation: Prior work focused on passive critical thinking, rejecting flawed queries without constructive steps. This work aims to enhance AI's ability to collaborate with users by proactively addressing incomplete or misleading queries.

Method: The authors introduce GSM-MC and GSM-MCE benchmarks for assessing mathematical reasoning under incomplete or misleading conditions. They use reinforcement learning (RL) to improve models’ proactive critical thinking.

Result: Experiments show models struggle with proactive critical thinking, especially smaller ones, but RL significantly improves performance (e.g., Qwen3-1.7B’s accuracy rose from 0.15% to 73.98% on GSM-MC).

Conclusion: The work advances AI’s ability to collaborate with users through proactive critical thinking, demonstrating the effectiveness of RL in enhancing this capability.

Abstract: Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B’s accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.

[62] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri, Yerulan Kongrat, Adrian Santosh, Ruslan Tasmukhanov, Josemaria Vera, Muhammad Dehan Al Kautsar, Fajri Koto

Main category: cs.CL

TL;DR: The paper explores fine-tuning LLMs to generate role-specific responses in enterprise settings, comparing three modeling strategies and evaluating them on custom datasets.

Details

Motivation: Existing safety methods for LLMs lack role-specific access control, which is crucial for enterprise deployments.

Method: Three strategies are tested: BERT-based classifier, LLM-based classifier, and role-conditioned generation, evaluated on two datasets (adapted and synthetic).

Result: Performance is assessed across organizational structures and tested for robustness against prompt injection, role mismatch, and jailbreak attempts.

Conclusion: The study demonstrates the feasibility of fine-tuning LLMs for role-specific access control, addressing a gap in current safety methods.

Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

[63] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Chen, Qiguang Zhao, Si-Xuan Liu, Lina Zhou, Hua Gao, Dongqiang Ye, Lingmin Meng, Youtao Yu, Naixin Liang, Jianxiong Wu

Main category: cs.CL

TL;DR: The paper introduces CSEDB, a framework for evaluating LLMs in clinical settings, revealing moderate performance and highlighting risks in high-risk scenarios. Domain-specific models outperform general ones.

Details

Motivation: To address challenges in evaluating the safety and effectiveness of LLMs in clinical decision support.

Method: Developed CSEDB with 30 criteria, tested six LLMs using 2,069 Q&A items reviewed by specialists.

Result: Moderate performance (57.2% avg), 13.3% drop in high-risk scenarios. Domain-specific models performed better.

Conclusion: CSEDB provides a standardized metric for LLM evaluation, aiding safer and more effective deployment in healthcare.

Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

[64] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Keer Lu, Zheng Liang, Youquan Li, Jiejun Tan, Da Pan, Shusen Zhang, Guosheng Dong, Huang Leng

Main category: cs.CL

TL;DR: Med-R³ is a medical retrieval-augmented reasoning framework using progressive reinforcement learning to jointly optimize retrieval and reasoning, outperforming existing models.

Details

Motivation: Existing methods lack joint optimization of retrieval and reasoning and rely on supervised fine-tuning, limiting generalization. Medical domain demands are not adequately addressed.

Method: Develop logical reasoning first, then optimize retrieval to align with knowledge corpus, and finally jointly optimize retrieval-reasoning coordination using reinforcement learning.

Result: Med-R³ achieves state-of-the-art performance, surpassing GPT-4o-mini by 3.93% and Qwen2.5-14B by 13.53%.

Conclusion: Med-R³ effectively addresses the limitations of existing methods by jointly optimizing retrieval and reasoning, demonstrating superior performance in medical scenarios.

Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce Med-R$^3$, a Medical Retrieval-augmented Reasoning framework driven by progressive Reinforcement learning. In this framework, we first develop the model’s ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model’s retrieval and reasoning coordination. Extensive experiments indicate that Med-R$^3$ could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53%.

[65] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

Alva West, Luodan Zhang, Liuliu Zhang, Minjun Zhu, Yixuan Weng, Yue Zhang

Main category: cs.CL

TL;DR: T-Detect introduces a heavy-tailed discrepancy score for detecting machine-generated text, outperforming traditional Gaussian-based methods by up to 3.9% in AUROC.

Details

Motivation: The need for robust detection methods for adversarial or non-native English texts, which challenge existing Gaussian-based zero-shot detectors.

Method: T-Detect replaces Gaussian normalization with a heavy-tailed discrepancy score derived from the Student’s t-distribution, leveraging leptokurtosis in adversarial texts.

Result: T-Detect improves AUROC by up to 3.9% and achieves state-of-the-art performance (0.926 AUROC) on the RAID benchmark.

Conclusion: T-Detect provides a theoretically justified, robust method for detecting machine-generated text, validated by superior performance and adversarial resilience.

Abstract: The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student’s t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.

[66] DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Alexandre Misrahi, Nadezhda Chirkova, Maxime Louis, Vassilina Nikoulina

Main category: cs.CL

TL;DR: DiffLoRA introduces a parameter-efficient adaptation of differential attention with low-rank adapters, showing mixed results but notable gains in specific tasks like HumanEval.

Details

Motivation: To combine the efficiency of LoRA with the performance benefits of differential attention in Transformer models.

Method: DiffLoRA uses low-rank adapters on positive and negative attention terms to retain efficiency while leveraging differential attention.

Result: Mixed performance across tasks, with a significant +11-point improvement over LoRA on HumanEval.

Conclusion: DiffLoRA shows promise in specific domains but generally underperforms other parameter-efficient methods, warranting further analysis of attention patterns.

Abstract: Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.

Salam Thabet Doghmash, Motaz Saad

Main category: cs.CL

TL;DR: The paper focuses on detecting and cleaning hate speech in Arabic text using deep learning and transformers, achieving high accuracy in detection and a good BLEU score for text masking.

Details

Motivation: Addressing the growing issue of hate speech in social media by developing methods to detect and clean such content in Arabic text.

Method: Used deep learning models and transformers for hate speech detection and treated text cleaning as a machine translation task.

Result: Achieved 92% Macro F1 score and 95% accuracy in detection, and a BLEU score of 0.3 for text masking.

Conclusion: The proposed methods are effective for hate speech detection and cleaning in Arabic text, outperforming state-of-the-art systems.

Abstract: Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92% Macro F1 score and 95% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.

[68] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Nasim Shirvani-Mahdavi, Devin Wingfield, Amin Ghasemi, Chengkai Li

Main category: cs.CL

TL;DR: The paper explores using large language models to generate natural language explanations for logical rules in knowledge graphs, evaluating their correctness and clarity.

Details

Motivation: To improve KG completeness, detect errors, and enhance reasoning by making complex logical rules understandable through natural language explanations.

Method: Extracts logical rules using AMIE 3.5.1 from datasets FB15k-237, FB-CVT-REV, and FB+CVT-REV, and tests prompting strategies like zero-/few-shot and chain-of-thought reasoning.

Result: Promising performance in explanation correctness and clarity, though challenges remain.

Conclusion: Large language models show potential for explaining KG rules, but further research is needed.

Abstract: Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.

[69] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Yunxiang Yan, Tomohiro Sawada, Kartik Goyal

Main category: cs.CL

TL;DR: The paper introduces a cascaded question disclosure framework to better evaluate LLMs’ problem-solving capabilities, showing it outperforms standard QA benchmarks by reducing performance gaps and improving reasoning traces.

Details

Motivation: Standard QA benchmarks are indirect and may overestimate differences in LLM performance, necessitating a more accurate and scalable evaluation method.

Method: A stagewise approach where partial question information is revealed to elicit generalized reasoning in LLMs, tested on diverse datasets.

Result: The framework provides better model comparisons, improves intermediate reasoning traces, and narrows performance gaps compared to standard QA.

Conclusion: The cascaded approach offers a more accurate and scalable evaluation of LLMs, challenging the prevalent QA paradigm.

Abstract: While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models’ problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.

[70] LiMe: a Latin Corpus of Late Medieval Criminal Sentences

Alessandra Bassani, Beatrice Del Bo, Alfio Ferrara, Marta Mangini, Sergio Picascia, Ambra Stefanello

Main category: cs.CL

TL;DR: The paper introduces the LiMe dataset, a corpus of 325 annotated medieval Latin documents, to improve masked language models and NLP tasks for Latin.

Details

Motivation: The disparity in data for Latin compared to modern languages hinders the performance of language models. The LiMe dataset aims to address this gap.

Method: The dataset comprises 325 documents from medieval manuscripts, meticulously annotated by experts for use in masked language modeling and supervised NLP tasks.

Result: The LiMe dataset provides a valuable resource for enhancing Latin language models and NLP applications.

Conclusion: The LiMe dataset is a significant step toward improving computational linguistics tools for Latin, bridging the data gap with modern languages.

Abstract: The Latin language has received attention from the computational linguistics research community, which has built, over the years, several valuable resources, ranging from detailed annotated corpora to sophisticated tools for linguistic analysis. With the recent advent of large language models, researchers have also started developing models capable of generating vector representations of Latin texts. The performances of such models remain behind the ones for modern languages, given the disparity in available data. In this paper, we present the LiMe dataset, a corpus of 325 documents extracted from a series of medieval manuscripts called Libri sententiarum potestatis Mediolani, and thoroughly annotated by experts, in order to be employed for masked language model, as well as supervised natural language processing tasks.

[71] Explaining vague language

Paul Égré, Benjamin Spector

Main category: cs.CL

TL;DR: The paper compares two accounts of vagueness in language—Lipman’s game-theoretic approach and Égré et al.’s Bayesian account—and argues that the latter, which includes semantic content, is more adequate.

Details

Motivation: To resolve the puzzle of why vague language is more useful than precise language by comparing and contrasting two theoretical approaches.

Method: Comparative analysis of Lipman’s game-theoretic model (mixed strategies) and Égré et al.’s Bayesian account (semantic content).

Result: The semantic account (Égré et al.) is shown to be more informative and explanatory of vagueness than the purely strategic account (Lipman).

Conclusion: A semantic layer is necessary for a complete and adequate explanation of vagueness in language.

Abstract: Why is language vague? Vagueness may be explained and rationalized if it can be shown that vague language is more useful to speaker and hearer than precise language. In a well-known paper, Lipman proposes a game-theoretic account of vagueness in terms of mixed strategy that leads to a puzzle: vagueness cannot be strictly better than precision at equilibrium. More recently, 'Egr'e, Spector, Mortier and Verheyen have put forward a Bayesian account of vagueness establishing that using vague words can be strictly more informative than using precise words. This paper proposes to compare both results and to explain why they are not in contradiction. Lipman’s definition of vagueness relies exclusively on a property of signaling strategies, without making any assumptions about the lexicon, whereas 'Egr'e et al.’s involves a layer of semantic content. We argue that the semantic account of vagueness is needed, and more adequate and explanatory of vagueness.

[72] Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability

Riya Sawhney, Samrat Yadav, Indrajit Bhattacharya, Mausam

Main category: cs.CL

TL;DR: The paper introduces FUn-FuSIC, a novel model for few-shot KBQA with unanswerable questions, outperforming existing methods.

Details

Motivation: Addressing the challenge of handling unanswerable questions in KBQA with limited labeled data.

Method: Extends FuSIC-KBQA with Feedback for Unanswerability (FUn), using iterative repair and verifiers for better answerability assessment.

Result: FUn-FuSIC outperforms state-of-the-art models on few-shot KBQA with unanswerable questions.

Conclusion: FUn-FuSIC is effective for few-shot KBQA, setting a new benchmark for both answerable and unanswerable questions.

Abstract: Real-world applications of KBQA require models to handle unanswerable questions with a limited volume of in-domain labeled training data. We propose the novel task of few-shot transfer for KBQA with unanswerable questions and contribute two new datasets for performance evaluation. We present FUn-FuSIC - a novel solution for our task that extends FuSIC KBQA, the state-of-the-art few-shot transfer model for answerable-only KBQA. We first note that FuSIC-KBQA’s iterative repair makes a strong assumption that all questions are unanswerable. As a remedy, we propose Feedback for Unanswerability (FUn), which uses iterative repair using feedback from a suite of strong and weak verifiers, and an adaptation of self consistency for unanswerabilty to better assess the answerability of a question. Our experiments show that FUn-FuSIC significantly outperforms suitable adaptations of multiple LLM based and supervised SoTA models on our task, while establishing a new SoTA for answerable few-shot transfer as well.

[73] Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra

Main category: cs.CL

TL;DR: The paper addresses LLMs’ struggle with irrelevant information in math word problems by proposing a prompting framework to generate adversarial variants. A dataset (PROBLEMATHIC) is introduced, showing LLMs’ susceptibility to noise (~26% performance drop). Fine-tuning improves robustness (~8% gain). Generalizability is tested with GSM-8K-Adv, showing a 6% performance drop.

Details

Motivation: LLMs perform well on MWPs but falter with irrelevant information. The goal is to enhance their robustness to such noise.

Method: A prompting framework generates adversarial MWPs with irrelevant variables. A dataset (PROBLEMATHIC) is created, and LLMs are fine-tuned on adversarial samples.

Result: LLMs suffer a ~26% performance drop on adversarial MWPs. Fine-tuning improves performance by ~8%. GSM-8K-Adv shows a 6% drop.

Conclusion: The framework and dataset improve LLMs’ robustness to irrelevant information, though challenges remain in generalizability.

Abstract: Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.

[74] Neutral Residues: Revisiting Adapters for Model Extension

Franck Signe Talla, Edouard Grave, Hervé Jégou

Main category: cs.CL

TL;DR: The paper introduces ’neutral residues,’ an improved adapter method for extending pretrained LLMs to new domains without degrading original domain performance.

Details

Motivation: Addressing the trade-off in domain adaptation techniques like finetuning or LoRA, which often degrade performance on the original domain.

Method: Improves adapters by jointly optimizing data, architecture, and training, ensuring new residual blocks output near-zeros on the original domain.

Result: Outperforms finetuning, LoRA, and vanilla adapters in adapting a model to a new language while retaining English performance.

Conclusion: Neutral residues offer a superior solution for domain adaptation in LLMs, balancing new domain learning and original domain retention.

Abstract: We address the problem of extending a pretrained large language model to a new domain that was not seen during training. Standard techniques, such as finetuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain. Here, we revisit and improve adapters to extend LLMs from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperform competing approaches such as finetuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.

[75] Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

T. G. D. K. Sumanathilaka, Nicholas Micallef, Julian Hough

Main category: cs.CL

TL;DR: The paper proposes a method using LLMs and a knowledge base to improve Word Sense Disambiguation (WSD) by combining prompt augmentation, POS tagging, synonyms, and few-shot prompting, showing significant performance gains.

Details

Motivation: Lexical ambiguity in digital communications challenges traditional WSD methods, limiting efficiency in translation, information retrieval, and question-answering systems.

Method: A novel approach combining systematic prompt augmentation with a KB, incorporating POS tagging, synonyms, aspect-based sense filtering, and few-shot COT prompting.

Result: Demonstrates substantial improvement in WSD performance, evaluated using FEWS test data and sense tags.

Conclusion: The method advances accurate word interpretation in digital communication, leveraging LLMs and human-in-loop prompt augmentation.

Abstract: Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.

[76] Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

Jiahao Yuan, Zixiang Di, Shangzixin Zhao, Zhiqing Cui, Hanqing Wang, Guisong Yang, Usman Naseem

Main category: cs.CL

TL;DR: Cultural Palette is a multi-agent framework for aligning LLMs with diverse cultural values by blending region-specific responses dynamically.

Details

Motivation: LLMs struggle with monocultural biases and adapting to unknown cultures post-fine-tuning, necessitating a culturally adaptive solution.

Method: Uses a three-step process: synthesizing a cultural dataset, forming continent-level alignment agents, and dynamically blending responses via a Meta Agent.

Result: Outperforms existing baselines in cultural alignment across various countries.

Conclusion: Cultural Palette effectively addresses cultural biases in LLMs through adaptive blending of regional nuances.

Abstract: Large language models (LLMs) face challenges in aligning with diverse cultural values despite their remarkable performance in generation, which stems from inherent monocultural biases and difficulties in capturing nuanced cultural semantics. Existing methods struggle to adapt to unknown culture after fine-tuning. Inspired by cultural geography across five continents, we propose Cultural Palette, a multi-agent framework that redefines cultural alignment as an adaptive “color-blending” process for country-specific adaptation. Our approach harnesses cultural geography across five continents (Africa, America, Asia, Europe, Oceania) through three key steps: First, we synthesize the Pentachromatic Cultural Palette Dataset using GPT-4o, refining continental-level dialogues with Hofstede’s cultural dimensions to establish foundational cultural representations. Second, five continent-level alignment agents form specialized cultural communities that generate region-specific draft responses. Third, a Meta Agent employs Cultural MoErges to dynamically blend these cultural “colors” through attention-gated parameter merging, akin to mixing pigments on a palette, resolving conflicts while preserving cultural nuances to produce the final culturally-aligned response. Extensive experiments across various countries demonstrate that Cultural Palette surpasses existing baselines in cultural alignment.

[77] Inside-Out: Hidden Factual Knowledge in LLMs

Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart

Main category: cs.CL

TL;DR: The paper introduces a framework to measure hidden knowledge in LLMs, showing they encode more facts internally than they express externally, with a 40% gap. Some knowledge is so deeply hidden it’s never generated, revealing limitations in LLM generation.

Details

Motivation: To quantify and demonstrate the gap between the factual knowledge LLMs encode internally and what they express in outputs, addressing a previously undefined phenomenon.

Method: Proposes a formal definition of knowledge, distinguishing external (observable outputs) and internal (intermediate computations) knowledge. Applies this framework to three LLMs in closed-book QA.

Result: LLMs consistently encode 40% more knowledge internally than they express externally. Some answers are never generated despite perfect internal knowledge.

Conclusion: Reveals fundamental limitations in LLM generation and constraints on improving performance via repeated sampling, as some correct answers remain practically inaccessible.

Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model’s observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average relative gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) put a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

[78] Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer

Alexandra DeLucia, Mark Dredze

Main category: cs.CL

TL;DR: The paper evaluates multi-document summarization (MDS) models across training approaches, domains, and dimensions, focusing on zero-shot domain transfer challenges and metric issues.

Details

Motivation: To analyze why MDS models trained on one domain fail in another and assess the limitations of existing summarization metrics.

Method: Evaluates MDS models across four training approaches (direct, chunk-then-summarize, extract-then-summarize, GPT-style) in zero-shot domain transfer (News, Science, Conversation). Measures reference similarity, quality, and factuality.

Result: Identifies domain-transfer failures as decreased factuality, higher deviation from target, and lower summary quality. Highlights issues with popular summarization metrics.

Conclusion: MDS models struggle with domain transfer, and current metrics may not fully capture performance. Future work should address these gaps.

Abstract: Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training (“direct”), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer “failure” as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.

[79] Splits! A Flexible Dataset and Evaluation Framework for Sociocultural Linguistic Investigation

Eylon Caplan, Tania Chakraborty, Dan Goldwasser

Main category: cs.CL

TL;DR: The paper introduces Splits!, a 9.7M-post Reddit dataset for studying Sociocultural Linguistic Phenomena (SLP), with a framework to validate hypotheses efficiently.

Details

Motivation: Current computational studies of SLPs are limited to narrow analyses, slowing progress. A scalable dataset and method are needed.

Method: Splits! dataset (53K users, 6 demographics, 89 topics) is validated via self-identification and replication. A framework uses retrieval methods to test hypotheses and measures ‘unexpectedness.’

Result: The framework reduces manual inspection needs by 1.5-1.8x, aiding efficient discovery of SLPs.

Conclusion: Splits! and the framework enable systematic, scalable SLP research, accelerating insights into sociocultural language variation.

Abstract: Variation in language use, shaped by speakers’ sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. However, the computational study of these Sociocultural Linguistic Phenomena (SLP) has often been limited to bespoke analyses of specific groups or topics, hindering the pace of scientific discovery. To address this, we introduce Splits!, a 9.7 million-post dataset from Reddit designed for systematic and flexible research. The dataset contains posts from over 53,000 users across 6 demographic groups, organized into 89 discussion topics to enable comparative analysis. We validate Splits! via self-identification and by successfully replicating several known SLPs from existing literature. We complement this dataset with a framework that leverages efficient retrieval methods to rapidly validate potential SLPs (PSLPs) by automatically evaluating whether a given hypothesis is supported by our data. Crucially, to distinguish between novel and obvious insights, the framework incorporates a human-validated measure of a hypothesis’s ``unexpectedness.’’ We demonstrate that the two-stage process reduces the number of statistically significant findings requiring manual inspection by a factor of 1.5-1.8x, streamlining the discovery of promising phenomena for further investigation.

[80] Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Siddhant Gupta, Drishti Sharma, Jebish Purbey, Kanwal Mehreen, Muhammad Arham, Suman Debnath, Hamza Farooq

Main category: cs.CL

TL;DR: Mantra-14B, a Hindi-English bi-lingual LLM, achieves ~3% better benchmark scores than larger models by fine-tuning with curated data, avoiding resource-heavy methods.

Details

Motivation: Addressing the underrepresentation of non-English languages in LLMs by improving performance for Hindi and English without compromising native capabilities.

Method: Instruction tuning of models like Qwen-2.5-14B-Instruct and Phi-4 using a curated dataset of 485K English-Hindi samples, experimenting with data ratios and avoiding vocabulary expansion or architectural changes.

Result: Mantra-14B outperforms larger models, showing improved multilingual performance without significant computational overhead.

Conclusion: Modest fine-tuning with culturally informed data can bridge performance gaps for under-resourced languages, as demonstrated by Mantra-14B.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

[81] Robust and Fine-Grained Detection of AI Generated Texts

Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Drishti Sharma, Siddhant Gupta, Jebish Purbey, Ashay Srivastava, Subhasya TippaReddy, Arvind Reddy Bobbili, Suraj Telugara Chandrashekhar, Modabbir Adeeb, Srinadh Vura, Suman Debnath, Hamza Farooq

Main category: cs.CL

TL;DR: The paper introduces token classification models for detecting human-LLM co-authored texts, trained on a large dataset of 2.4M texts across 23 languages, and evaluates their performance on unseen domains, generators, and adversarial inputs.

Details

Motivation: Existing detection systems struggle with identifying AI-generated content in shorter or partially co-authored texts, necessitating a more robust solution.

Method: Developed token classification models trained on a dataset of human-machine co-authored texts, tested on diverse scenarios including unseen domains and adversarial inputs.

Result: Models performed well on unseen domains, generators, non-native speaker texts, and adversarial inputs, with detailed performance findings across various conditions.

Conclusion: The introduced models and dataset provide a robust framework for detecting human-LLM co-authored texts, addressing gaps in existing systems.

Abstract: An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models’ performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.

[82] Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Ziqiao Ma, Jing Ding, Xuejun Zhang, Dezhi Luo, Jiahe Ding, Sihan Xu, Yuchen Huang, Run Peng, Joyce Chai

Main category: cs.CL

TL;DR: The paper revisits Referring Expression Generation (REG) from a pragmatic perspective, highlighting current vision-language models’ (VLMs) failures in adhering to Gricean maxims. It introduces a new dataset (RefOI) and identifies three key pragmatic failures in VLMs, advocating for pragmatically informed models and evaluations.

Details

Motivation: Current evaluations of VLMs often neglect the pragmatic dimension of REG, reducing it to a region-based captioning task. This work aims to address this gap by focusing on pragmatic competence in REG.

Method: The authors introduce a new dataset (RefOI) with 1.5k images annotated with referring expressions. They systematically evaluate state-of-the-art VLMs to identify pragmatic failures.

Result: Three key pragmatic failures are identified: inability to uniquely identify referents, inclusion of irrelevant information, and misalignment with human pragmatic preferences. Standard evaluations fail to capture these issues.

Conclusion: The study calls for pragmatically informed models and evaluation frameworks to better align with human communication principles.

Abstract: Referring Expression Generation (REG) is a core task for evaluating the pragmatic competence of vision-language systems, requiring not only accurate semantic grounding but also adherence to principles of cooperative communication (Grice, 1975). However, current evaluations of vision-language models (VLMs) often overlook the pragmatic dimension, reducing REG to a region-based captioning task and neglecting Gricean maxims. In this work, we revisit REG from a pragmatic perspective, introducing a new dataset (RefOI) of 1.5k images annotated with both written and spoken referring expressions. Through a systematic evaluation of state-of-the-art VLMs, we identify three key failures of pragmatic competence: (1) failure to uniquely identify the referent, (2) inclusion of excessive or irrelevant information, and (3) misalignment with human pragmatic preference, such as the underuse of minimal spatial cues. We also show that standard automatic evaluations fail to capture these pragmatic violations, reinforcing superficial cues rather than genuine referential success. Our findings call for a renewed focus on pragmatically informed models and evaluation frameworks that align with real human communication.

[83] Leveraging LLMs to Create Content Corpora for Niche Domains

Franklin Zhang, Sonya Zhang, Alon Halevy

Main category: cs.CL

TL;DR: A method for creating domain-specific corpora using LLMs, validated in behavior education with high user satisfaction.

Details

Motivation: Challenges in curating domain-specific content from unstructured web data.

Method: LLM-enhanced techniques for structured content extraction and semantic deduplication.

Result: Extracted 3,531 unique challenges from 15K webpages; user satisfaction score of 4.3/5.

Conclusion: Proposed framework effectively generates high-quality domain-specific corpora.

Abstract: Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.

[84] The Pragmatic Mind of Machines: Tracing the Emergence of Pragmatic Competence in Large Language Models

Kefan Yu, Qingcheng Zeng, Weihao Xuan, Wanxin Li, Jingyi Wu, Rob Voigt

Main category: cs.CL

TL;DR: The paper introduces ALTPRAG, a dataset to evaluate LLMs’ pragmatic competence across training stages, showing improvements with scale and alignment techniques.

Details

Motivation: To understand how LLMs acquire pragmatic competence (e.g., implicature resolution) during training, which is currently unclear.

Method: ALTPRAG dataset evaluates LLMs at three training stages (pre-training, SFT, preference optimization) using contrastive reasoning tasks to infer speaker intentions.

Result: Base models show pragmatic sensitivity, improving with scale. SFT and RLHF enhance performance, especially in cognitive-pragmatic scenarios.

Conclusion: Pragmatic competence emerges compositionally during LLM training, offering insights for aligning models with human communication norms.

Abstract: Current large language models (LLMs) have demonstrated emerging capabilities in social intelligence tasks, including implicature resolution and theory-of-mind reasoning, both of which require substantial pragmatic understanding. However, how LLMs acquire this pragmatic competence throughout the training process remains poorly understood. In this work, we introduce ALTPRAG, a dataset grounded in the pragmatic concept of alternatives, to evaluate whether LLMs at different training stages can accurately infer nuanced speaker intentions. Each instance pairs two equally plausible yet pragmatically divergent continuations and requires the model to (i) infer the speaker’s intended meaning and (ii) explain when and why a speaker would choose one utterance over its alternative, thus directly probing pragmatic competence through contrastive reasoning. We systematically evaluate 22 LLMs across 3 key training stages: after pre-training, supervised fine-tuning (SFT), and preference optimization, to examine the development of pragmatic competence. Our results show that even base models exhibit notable sensitivity to pragmatic cues, which improves consistently with increases in model and data scale. Additionally, SFT and RLHF contribute further gains, particularly in cognitive-pragmatic scenarios. These findings highlight pragmatic competence as an emergent and compositional property of LLM training and offer new insights for aligning models with human communicative norms.

[85] AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song

Main category: cs.CL

TL;DR: AutoSchemaKG is a framework for autonomous knowledge graph construction without predefined schemas, using LLMs to extract triples and induce schemas from text. It builds ATLAS, a large KG, improving QA tasks and LLM factuality.

Details

Motivation: To eliminate the need for manual schema design in knowledge graph construction and enhance scalability and accuracy.

Method: Leverages large language models to extract knowledge triples and induce schemas from text, organizing entities and events into semantic categories.

Result: Constructs ATLAS, a KG with 900M+ nodes and 5.9B edges, outperforming baselines in QA tasks and achieving 95% schema alignment with human-crafted ones.

Conclusion: Dynamically induced schemas in billion-scale KGs can effectively complement parametric knowledge in LLMs, demonstrating high accuracy without manual intervention.

Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

[86] Framing Political Bias in Multilingual LLMs Across Pakistani Languages

Afrozah Nadeem, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: The paper evaluates political bias in 13 LLMs across five Pakistani languages, revealing liberal-left tendencies from Western data but authoritarian framing in regional languages, emphasizing the need for multilingual bias auditing.

Details

Motivation: Political and economic bias evaluations in LLMs often overlook low-resource, multilingual regions like Pakistan, where linguistic identity is tied to political and ideological contexts.

Method: A culturally adapted Political Compass Test (PCT) and multi-level framing analysis were used to assess ideological stance and stylistic framing across 11 socio-political themes in five Pakistani languages.

Result: LLMs predominantly reflect liberal-left orientations from Western data but show authoritarian framing in regional languages, with consistent model-specific bias patterns.

Conclusion: The study underscores the necessity for culturally grounded, multilingual bias auditing frameworks in global NLP to address language-conditioned ideological biases.

Abstract: Large Language Models (LLMs) increasingly shape public discourse, yet most evaluations of political and economic bias have focused on high-resource, Western languages and contexts. This leaves critical blind spots in low-resource, multilingual regions such as Pakistan, where linguistic identity is closely tied to political, religious, and regional ideologies. We present a systematic evaluation of political bias in 13 state-of-the-art LLMs across five Pakistani languages: Urdu, Punjabi, Sindhi, Pashto, and Balochi. Our framework integrates a culturally adapted Political Compass Test (PCT) with multi-level framing analysis, capturing both ideological stance (economic/social axes) and stylistic framing (content, tone, emphasis). Prompts are aligned with 11 socio-political themes specific to the Pakistani context. Results show that while LLMs predominantly reflect liberal-left orientations consistent with Western training data, they exhibit more authoritarian framing in regional languages, highlighting language-conditioned ideological modulation. We also identify consistent model-specific bias patterns across languages. These findings show the need for culturally grounded, multilingual bias auditing frameworks in global NLP.

[87] Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models

Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin

Main category: cs.CL

TL;DR: ToTh is a new framework for LLM reasoning that uses three parallel agents (abductive, deductive, inductive) to create structured reasoning graphs, outperforming CoT and other methods.

Details

Motivation: Current LLM reasoning methods like CoT lack logical structure and internal coherence checks.

Method: ToTh models reasoning as collaboration among three agents, forming a reasoning graph evaluated via Bayesian belief propagation and NLI.

Result: ToTh outperforms CoT and other methods on symbolic and numerical benchmarks, producing interpretable reasoning.

Conclusion: ToTh offers a robust, cognitively inspired approach for improving LLM reasoning, with potential for broader applications.

Abstract: Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

[88] Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Chupei Wang, Jiaqiu Vince Sun

Main category: cs.CL

TL;DR: The paper investigates how intra-context interference affects retrieval in LLMs, revealing a decline in accuracy as interference accumulates, despite clear positioning of final values. Mitigation attempts via prompt engineering show limited success, highlighting a working memory bottleneck.

Details

Motivation: To understand the effects of intra-context interference on LLM retrieval, inspired by cognitive science's proactive interference paradigm, and assess LLMs' ability to handle such interference.

Method: Introduces PI-LLM, an evaluation framework that streams semantically related key-value updates and queries final values, measuring retrieval accuracy under accumulating interference.

Result: LLM retrieval accuracy declines log-linearly with interference, often retrieving overwritten values. Prompt engineering mitigations are ineffective.

Conclusion: LLMs face a working memory bottleneck in handling interference, suggesting the need for improved methods to suppress irrelevant content during retrieval.

Abstract: Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs’ ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models’ ability to suppress irrelevant content during retrieval.

[89] Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

Asifullah Khan, Muhammad Zaeem Khan, Saleha Jamshed, Sadia Ahmad, Aleesha Zainab, Kaynat Khatib, Faria Bibi, Abdul Rehman

Main category: cs.CL

TL;DR: A survey on LLM advancements, covering reasoning, adaptability, efficiency, ethics, and emerging techniques like Chain-of-Thought prompting and Mixture-of-Experts. Highlights challenges like computational costs and biases, with future focus on multimodal inputs and ethical alignment.

Details

Motivation: To summarize key developments in LLMs, addressing gaps in reasoning, efficiency, and ethics, while exploring their broader applications like Agentic AI.

Method: Reviews techniques like Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback, alongside efficiency strategies like MoE architecture.

Result: Identifies advancements in LLM capabilities and emerging methods, but notes challenges like biases, high costs, and ethical risks.

Conclusion: Future research should prioritize multimodal handling, ethical alignment, and addressing underexplored areas like interpretability and sustainability.

Abstract: This survey paper outlines the key developments in the field of Large Language Models (LLMs), including enhancements to their reasoning skills, adaptability to various tasks, increased computational efficiency, and the ability to make ethical decisions. The techniques that have been most effective in bridging the gap between human and machine communications include the Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback. The improvements in multimodal learning and few-shot or zero-shot techniques have further empowered LLMs to handle complex jobs with minor input. A significant focus is placed on efficiency, detailing scaling strategies, optimization techniques, and the influential Mixture-of-Experts (MoE) architecture, which strategically routes inputs to specialized subnetworks to boost predictive accuracy, while optimizing resource allocation. This survey also offers a broader perspective on recent advancements in LLMs, going beyond isolated aspects such as model architecture or ethical concerns. Additionally, it explores the role of LLMs in Agentic AI and their use as Autonomous Decision-Making Systems, and categorizes emerging methods that enhance LLM reasoning, efficiency, and ethical alignment. The survey also identifies underexplored areas such as interpretability, cross-modal integration, and sustainability. While significant advancements have been made in LLMs, challenges such as high computational costs, biases, and ethical risks remain. Overcoming these requires a focus on bias mitigation, transparent decision-making, and explicit ethical guidelines. Future research will generally focus on enhancing the model’s ability to handle multiple inputs, thereby making it more intelligent, safe, and reliable.

[90] Prompt Engineering Techniques for Mitigating Cultural Bias Against Arabs and Muslims in Large Language Models: A Systematic Review

Bushra Asseri, Estabrag Abdelaziz, Areej Al-Wabil

Main category: cs.CL

TL;DR: The paper reviews prompt engineering strategies to mitigate cultural bias in large language models (LLMs) towards Arabs and Muslims, identifying five effective approaches with varying success.

Details

Motivation: Addressing cultural bias in LLMs, particularly against Arabs and Muslims, to reduce harmful stereotypes and marginalization.

Method: A mixed-methods systematic review of 8 empirical studies (2021-2024) following PRISMA and Kitchenham’s methodology, analyzing bias mitigation techniques.

Result: Five prompt engineering approaches were identified, with structured multi-step pipelines showing the highest effectiveness (87.7% bias reduction). Cultural prompting was more accessible.

Conclusion: Prompt engineering is accessible for bias mitigation, but more research is needed on culturally adaptive techniques and integrating complementary methods.

Abstract: Large language models have demonstrated remarkable capabilities across various domains, yet concerns about cultural bias - particularly towards Arabs and Muslims - pose significant ethical challenges by perpetuating harmful stereotypes and marginalization. Despite growing recognition of bias in LLMs, prompt engineering strategies specifically addressing Arab and Muslim representation remain understudied. This mixed-methods systematic review examines such techniques, offering evidence-based guidance for researchers and practitioners. Following PRISMA guidelines and Kitchenham’s systematic review methodology, we analyzed 8 empirical studies published between 2021-2024 investigating bias mitigation strategies. Our findings reveal five primary prompt engineering approaches: cultural prompting, affective priming, self-debiasing techniques, structured multi-step pipelines, and parameter-optimized continuous prompts. Although all approaches show potential for reducing bias, effectiveness varied substantially across studies and bias types. Evidence suggests that certain bias types may be more resistant to prompt-based mitigation than others. Structured multi-step pipelines demonstrated the highest overall effectiveness, achieving up to 87.7% reduction in bias, though they require greater technical expertise. Cultural prompting offers broader accessibility with substantial effectiveness. These results underscore the accessibility of prompt engineering for mitigating cultural bias without requiring access to model parameters. The limited number of studies identified highlights a significant research gap in this critical area. Future research should focus on developing culturally adaptive prompting techniques, creating Arab and Muslim-specific evaluation resources, and integrating prompt engineering with complementary debiasing methods to address deeper stereotypes while maintaining model utility.

[91] WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

Jian Zhang, Linhao Zhang, Bokai Lei, Chuhan Wu, Wei Jia, Xiao Zhou

Main category: cs.CL

TL;DR: The paper introduces a specialized benchmark for evaluating speech-based LLMs, addressing gaps in existing methods by incorporating speech-specific challenges and a query-aware evaluation approach.

Details

Motivation: Current benchmarks for LLMs are text-based and overlook speech-specific challenges like prosody and homophones, hindering the optimization of Audio LLMs in real-world applications.

Method: The authors curate real-world chat data, introduce speaker and acoustic diversity, and augment the dataset with speech-specific phenomena. They also design a query-aware evaluation method using customized checklists and prompts.

Result: Testing reveals significant performance differences among speech models across scenarios, with query-aware evaluation enabling finer-grained assessment.

Conclusion: The proposed benchmark offers valuable insights for improving speech model development and evaluation, addressing the limitations of existing methods.

Abstract: Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech’s unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.

[92] Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

Main category: cs.CL

TL;DR: PAPO, a novel policy gradient algorithm, improves multimodal reasoning in LLMs by integrating perception-aware supervision, achieving significant performance gains and reducing perception errors.

Details

Motivation: Current RLVR methods are suboptimal for multimodal tasks due to poor visual input perception, prompting the need for a solution that integrates perception and reasoning.

Method: PAPO introduces Implicit Perception Loss (KL divergence) and Double Entropy Loss, enhancing RLVR algorithms like GRPO and DAPO without extra data or models.

Result: PAPO improves performance by 4.4%-17.5% on benchmarks, with higher gains (8.0%-19.1%) in vision-dependent tasks, and reduces perception errors by 30.5%.

Conclusion: PAPO advances RL by integrating perception-aware supervision, enabling visually grounded reasoning, with code and data made publicly available.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

[93] KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

Hruday Markondapatnaikuni, Basem Suleiman, Abdelkarim Erradi, Shijing Chen

Main category: cs.CL

TL;DR: K2RAG is a novel framework combining dense/sparse vector search, knowledge graphs, and summarization to enhance RAG efficiency and accuracy in LLMs.

Details

Motivation: Addressing the resource-intensive nature of fine-tuning LLMs and scalability/accuracy limitations in naive RAG implementations.

Method: Integrates dense/sparse vector search, knowledge graphs, and text summarization; includes preprocessing for efficiency.

Result: Achieved 0.57 mean answer similarity, 0.82 Q3 similarity, 93% faster training, and 40% faster execution than traditional RAG.

Conclusion: K2RAG improves accuracy, efficiency, and scalability in knowledge expansion for LLMs compared to naive RAG.

Abstract: Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.

[94] DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro

Main category: cs.CL

TL;DR: DocPolarBERT is a layout-aware BERT model for document understanding that uses relative polar coordinates instead of absolute 2D positional embeddings, achieving state-of-the-art results with less pre-training data.

Details

Motivation: Traditional document understanding models rely on absolute 2D positional embeddings, which may not be efficient. DocPolarBERT aims to improve this by using a relative polar coordinate system.

Method: Extends self-attention to incorporate text block positions in a relative polar coordinate system, reducing dependency on large pre-training datasets.

Result: Achieves state-of-the-art performance despite being pre-trained on a dataset six times smaller than the IIT-CDIP corpus.

Conclusion: A well-designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective solution for document understanding.

Abstract: We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.

[95] Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires

Simon Münker

Main category: cs.CL

TL;DR: LLMs homogenize moral diversity, failing to represent cultural nuances despite their size, challenging their use in social science and AI alignment.

Details

Motivation: To investigate whether LLMs accurately represent diverse cultural moral frameworks or merely average them, highlighting gaps in AI alignment.

Method: Applied the Moral Foundations Questionnaire across 19 cultural contexts, comparing LLMs’ outputs to human moral intuitions.

Result: LLMs systematically homogenize moral diversity; larger models don’t improve cultural representation.

Conclusion: Current AI alignment lacks nuanced cultural representation; new objectives and metrics are needed to preserve moral diversity.

Abstract: Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs’ origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn’t consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.

[96] ILID: Native Script Language Identification for Indian Languages

Yash Ingle, Pruthwik Mishra

Main category: cs.CL

TL;DR: A dataset of 250K sentences in 23 languages (including English and 22 Indian languages) is released, along with baseline models for language identification, outperforming state-of-the-art transformer models.

Details

Motivation: Language identification is critical for NLP tasks but challenging in noisy, short, and code-mixed contexts, especially for similar Indian languages sharing scripts.

Method: Developed a dataset and baseline models using state-of-the-art machine learning and fine-tuned transformer models.

Result: The models outperform existing transformer models for language identification.

Conclusion: The dataset and models are publicly available, advancing language identification for diverse Indian languages.

Abstract: The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release baseline models using state-of-the-art approaches in machine learning and fine-tuning pre-trained transformer models. Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task. The dataset and the codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface open source libraries.

[97] RAVine: Reality-Aligned Evaluation for Agentic Search

Yilong Xu, Xiang Long, Zhi Zheng, Jinhua Gao

Main category: cs.CL

TL;DR: RAVine is a new evaluation framework for agentic search systems, addressing limitations in current benchmarks by aligning with realistic user queries, improving ground truth accuracy, and evaluating the iterative search process.

Details

Motivation: Existing evaluation frameworks for agentic search systems misalign with real-world user scenarios, introduce noise in ground truth extraction, and overlook the iterative search process.

Method: RAVine proposes a reality-aligned framework with multi-point queries, attributable ground truth construction, and evaluation of iterative interactions and efficiency.

Result: RAVine benchmarks models and provides insights to advance agentic search systems.

Conclusion: RAVine addresses key limitations in current frameworks, offering a more accurate and comprehensive evaluation for agentic search systems.

Abstract: Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine – a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model’s interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at https://github.com/SwordFaith/RAVine.

[98] FinGAIA: A Chinese Benchmark for AI Agents in Real-World Financial Domain

Lingfeng Zeng, Fangqi Lou, Zixuan Wang, Jiajie Xu, Jinyi Niu, Mengping Li, Yifan Dong, Qi Qi, Wei Zhang, Ziwei Yang, Jun Han, Ruilun Feng, Ruiqi Hu, Lejie Zhang, Zhengbo Feng, Yicheng Ren, Xin Guo, Zhaowei Liu, Dongpo Cheng, Weige Cai, Liwen Zhang

Main category: cs.CL

TL;DR: FinGAIA is a benchmark for evaluating AI agents in finance, testing 10 agents with 407 tasks across seven sub-domains. ChatGPT performed best (48.9% accuracy) but lags behind experts. Five failure patterns were identified, guiding future research.

Details

Motivation: The financial sector lacks exploration of AI agents' multi-step, multi-tool collaboration. FinGAIA aims to objectively assess and improve AI agent capabilities in finance.

Method: FinGAIA includes 407 tasks across seven financial sub-domains, organized into three hierarchical levels. Ten AI agents were evaluated in a zero-shot setting.

Result: ChatGPT achieved the highest accuracy (48.9%), but still fell short of financial experts by over 35 percentage points. Five recurring failure patterns were identified.

Conclusion: FinGAIA is the first financial-focused AI agent benchmark, highlighting gaps and guiding future research to enhance AI capabilities in finance.

Abstract: The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at https://github.com/SUFE-AIFLM-Lab/FinGAIA.

[99] Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

Aarón Galiano-Jiménez, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Víctor M. Sánchez-Cartagena

Main category: cs.CL

TL;DR: The paper introduces Multi-Hypothesis Distillation (MHD), a sequence-level KD method for multilingual translation models, leveraging multiple translations to improve student learning and address issues like low variability and gender bias.

Details

Motivation: The teacher model's output distribution offers valuable insights beyond beam search, motivating the need for a method that better represents this distribution for student learning.

Method: Proposes MHD, which generates multiple translations per source sentence using $n$-best lists and alternative decoding methods to enhance variability and lexical richness.

Result: For low-resource languages, sampling methods increase variability and lexical richness, improving student performance and reducing gender bias amplification.

Conclusion: MHD effectively improves student model performance by better representing the teacher’s distribution and addressing key limitations of traditional KD methods.

Abstract: This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model’s output distribution holds valuable insights for the student, beyond the approximated mode obtained through beam search (the standard decoding method), and present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence. This provides a larger representation of the teacher model distribution and exposes the student model to a wider range of target-side prefixes. We leverage $n$-best lists from beam search to guide the student’s learning and examine alternative decoding methods to address issues like low variability and the under-representation of infrequent tokens. For low-resource languages, our research shows that while sampling methods may slightly compromise translation quality compared to beam search based approaches, they enhance the generated corpora with greater variability and lexical richness. This ultimately improves student model performance and mitigates the gender bias amplification often associated with KD.

[100] Unveiling the Influence of Amplifying Language-Specific Neurons

Inaya Rahmanisa, Lyzander Marciano Andrylie, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji

Main category: cs.CL

TL;DR: Amplifying language-specific neurons in LLMs improves performance in target languages, especially low-resource ones, but harms cross-lingual transfer.

Details

Motivation: To explore the role of language-specific neurons in multilingual behavior and their potential for improving performance in specific languages.

Method: Amplify language-specific neurons across 18 languages, evaluate effectiveness using Language Steering Shift (LSS) score, and test on downstream tasks (commonsense reasoning, knowledge, translation).

Result: Optimal amplification steers output to target languages, improving self-language performance but degrading cross-language results.

Conclusion: Language-specific neuron amplification benefits low-resource languages but offers limited advantage for cross-lingual transfer.

Abstract: Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.

cs.CV

[101] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving

Santosh Patapati, Trisanth Srinivasan

Main category: cs.CV

TL;DR: NovaDrive is a vision-language architecture for autonomous vehicles, combining front-camera images, HD maps, LiDAR depth, and waypoints in a single branch. It uses cross-attention and a smoothness loss for real-time performance, achieving higher success rates and efficiency.

Details

Motivation: Autonomous vehicles need fast, accurate navigation in complex environments. Current methods often rely on recurrent memory, which can be inefficient. NovaDrive aims to simplify and improve performance.

Method: NovaDrive uses a single-branch architecture with cross-attention blocks to align waypoints with HD maps and refine attention over images/depth. It employs a smoothness loss to avoid abrupt changes and fine-tunes a vision-language backbone for real-time inference.

Result: NovaDrive improves success rate (84%, +4%), path efficiency (SPL 0.66, +0.11), and reduces collisions (1.2%, -1.4%) compared to prior methods. Key contributions include waypoint tokens, partial fine-tuning, and cross-attention fusion.

Conclusion: NovaDrive enhances autonomous driving performance and efficiency, with potential applications in other embodied-AI domains. Its design reduces fuel/battery usage and simplifies updates.

Abstract: Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive’s shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.

[102] Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, Baining Guo

Main category: cs.CV

TL;DR: The paper introduces Phi-Ground, a model family for GUI grounding in Computer Use Agents (CUAs), achieving state-of-the-art performance on benchmarks.

Details

Motivation: Improving GUI grounding accuracy for CUAs, as current models underperform (<65% accuracy) on benchmarks like ScreenSpot-pro and UI-Vision.

Method: Empirical study on grounding model training, covering data collection to model training, leading to the development of Phi-Ground.

Result: Phi-Ground achieves SOTA performance on five benchmarks for models under 10B parameters, with scores of 43.2 (ScreenSpot-pro) and 27.2 (UI-Vision).

Conclusion: The study clarifies grounding model construction and benefits perception tasks, with Phi-Ground demonstrating significant improvements.

Abstract: With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{“Iron Man”}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}

[103] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam

Ruslan Khrulev

Main category: cs.CV

TL;DR: A new benchmark (EGE-Math Solutions Assessment Benchmark) evaluates VLMs on grading handwritten math solutions, revealing gaps in reasoning and rubric alignment.

Details

Motivation: Existing benchmarks focus on problem-solving, but this work aims to assess VLMs' ability to understand, identify mistakes, and grade student solutions.

Method: The benchmark uses 122 scanned solutions from the Russian Unified State Exam (EGE) with expert grades, testing seven VLMs in three inference modes.

Result: Results show limitations in VLMs’ mathematical reasoning and alignment with human grading rubrics.

Conclusion: The study highlights new research opportunities in AI-assisted assessment and provides an open-source benchmark for further work.

Abstract: This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math

[104] LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning

Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

Main category: cs.CV

TL;DR: LLaVA-MORE introduces a family of MLLMs with diverse visual backbones and unified training for fair comparisons, exploring model size, architecture, and performance trade-offs.

Details

Motivation: Prior work lacks exploration of trade-offs between model size, architecture, and performance, and suffers from inconsistent training and evaluation protocols.

Method: Integrates recent language models with diverse visual backbones, uses unified training, and evaluates multimodal reasoning, generation, and instruction following.

Result: Provides insights into MLLM design, including the impact of LLM and visual encoders, image resolution, and pre-training datasets.

Conclusion: Offers a reproducible framework for fair comparisons and guidance for future MLLM development, with publicly available code and models.

Abstract: Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs – including Phi-4, LLaMA-3.1, and Gemma-2 – to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.

[105] Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

Zhensheng Yuan, Haozhi Huang, Zhen Xiong, Di Wang, Guanghua Yang

Main category: cs.CV

TL;DR: A framework for fast reconstruction and real-time rendering of urban scenes, with robustness to appearance variations, using parallel training, LOD control, and enhancement modules.

Details

Motivation: To address the challenges of efficiently reconstructing and rendering large urban scenes while handling appearance inconsistencies across multi-view captures.

Method: Scene partitioning for parallel training, visibility-based image selection, controllable LOD strategy, appearance transformation module, and enhancement modules (depth regularization, scale regularization, antialiasing).

Result: The method effectively reconstructs urban-scale scenes, outperforming previous approaches in efficiency and quality.

Conclusion: The proposed framework achieves high visual fidelity and robustness, with publicly available source code.

Abstract: We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: https://yzslab.github.io/REUrbanGS.

[106] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Giuseppe Cartella, Vittorio Cuculo, Alessandro D’Amelio, Marcella Cornia, Giuseppe Boccignone, Rita Cucchiara

Main category: cs.CV

TL;DR: ScanDiff uses diffusion models and Vision Transformers to generate diverse and realistic human gaze scanpaths, outperforming existing methods in accuracy and variability.

Details

Motivation: Existing deep learning models for gaze scanpath prediction fail to capture human visual exploration variability, limiting their realism and applicability.

Method: Combines diffusion models with Vision Transformers to model scanpath variability and introduces textual conditioning for task-driven generation.

Result: ScanDiff outperforms state-of-the-art methods in free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths.

Conclusion: ScanDiff advances gaze prediction by better capturing human visual behavior complexity, with potential applications in HCI, autonomous systems, and cognitive robotics.

Abstract: Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.

[107] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: Deep learning-based super-resolution (SR) improves classification accuracy on low-quality echocardiograms, with SRResNet showing significant gains and computational efficiency.

Details

Motivation: Poor-quality echocardiographic imaging in resource-constrained settings limits diagnostic model effectiveness. SR techniques, underexplored for echocardiography, could enhance image quality.

Method: Applied SRGAN and SRResNet to enhance low-quality 2D echocardiograms from the CAMUS dataset. Evaluated performance on two tasks: 2CH vs. 4CH view classification and ED vs. ES phase classification.

Result: SRResNet significantly improved performance metrics and offered computational efficiency. SR recovered diagnostic value in degraded scans.

Conclusion: SR is a viable tool for AI-assisted care in resource-constrained settings, enhancing diagnostic accuracy with limited resources.

Abstract: Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

[108] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields

Ranxi Lin, Canming Yao, Jiayi Li, Weihang Liu, Xin Lou, Pingqiang Zhou

Main category: cs.CV

TL;DR: A spike-based NeRF framework (PATA) reduces computational costs by 64% in time steps and 61.55% in power while maintaining rendering quality, using adaptive time-step training.

Details

Motivation: NeRF models are computationally expensive, limiting their use in resource-constrained scenarios. SNNs offer energy efficiency, making them a viable alternative.

Method: Proposes PATA, a spike-based NeRF framework with dynamic time-step training, adapting to scene complexity for efficient inference.

Result: PATA reduces inference time steps by 64% and power consumption by 61.55% while preserving rendering fidelity.

Conclusion: PATA effectively balances rendering quality and computational efficiency, making NeRF models more practical for resource-limited environments.

Abstract: Neural Radiance Fields (NeRF)-based models have achieved remarkable success in 3D reconstruction and rendering tasks. However, during both training and inference, these models rely heavily on dense point sampling along rays from multiple viewpoints, resulting in a surge in floating-point operations and severely limiting their use in resource-constrained scenarios like edge computing. Spiking Neural Networks (SNNs), which communicate via binary spikes over discrete time steps, offer a promising alternative due to their energy-efficient nature. Given the inherent variability in scene scale and texture complexity in neural rendering and the prevailing practice of training separate models per scene, we propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA). This approach automatically explores the trade-off between rendering quality and time step length during training. Consequently, it enables scene-adaptive inference with variable time steps and reduces the additional consumption of computational resources in the inference process. Anchoring to the established Instant-NGP architecture, we evaluate our method across diverse datasets. The experimental results show that PATA can preserve rendering fidelity while reducing inference time steps by 64% and running power by 61.55%.

[109] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation

Alexandru Buburuzan

Main category: cs.CV

TL;DR: The paper introduces MObI and AnydoorMed, two novel methods for synthetic data generation in autonomous driving and medical imaging, using diffusion models for realistic and controllable inpainting.

Details

Motivation: The need for realistic and controllable synthetic data in safety-critical applications like autonomous driving and medical imaging drives this work.

Method: MObI uses a diffusion model for multimodal object inpainting with 3D bounding box conditioning, while AnydoorMed applies a similar approach for medical image inpainting.

Result: Both methods achieve high realism and controllability, enabling seamless object insertion and anomaly inpainting while maintaining semantic consistency.

Conclusion: The methods show that foundation models for inpainting can be adapted across diverse modalities, advancing realistic synthetic data generation.

Abstract: Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing. Synthetic data methods are gaining prominence due to the cost and complexity of gathering real-world data, but they demand a high degree of realism and controllability to be useful. This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively. MObI is a first-of-its-kind framework for Multimodal Object Inpainting that leverages a diffusion model to produce realistic and controllable object inpaintings across perceptual modalities, demonstrated simultaneously for camera and lidar. Given a single reference RGB image, MObI enables seamless object insertion into existing multimodal scenes at a specified 3D location, guided by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, this approach uses 3D bounding box conditioning to ensure accurate spatial positioning and realistic scaling. AnydoorMed extends this paradigm to the medical imaging domain, focusing on reference-guided inpainting for mammography scans. It leverages a diffusion-based model to inpaint anomalies with impressive detail preservation, maintaining the reference anomaly’s structural integrity while semantically blending it with the surrounding tissue. Together, these methods demonstrate that foundation models for reference-guided inpainting in natural images can be readily adapted to diverse perceptual modalities, paving the way for the next generation of systems capable of constructing highly realistic, controllable and multimodal counterfactual scenarios.

[110] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

Santosh Patapati, Trisanth Srinivasan, Murari Ambati

Main category: cs.CV

TL;DR: XYZ-Drive integrates vision, map, and waypoint data into a single model for autonomous driving, achieving high success rates and efficiency.

Details

Motivation: Autonomous cars require both geometric accuracy and semantic understanding, but existing methods handle them separately. XYZ-Drive aims to unify these aspects for better performance.

Method: XYZ-Drive uses a vision-language model with goal-centered cross-attention to fuse front-camera frames, overhead maps, and waypoints, processed by a fine-tuned LLaMA-3.2 11B model.

Result: Achieves 95% success and 0.80 SPL on MD-NEX, outperforming PhysNav-DG by 15%, with fewer collisions and higher efficiency. Ablations confirm the importance of each modality and fusion method.

Conclusion: Early token-level fusion of intent and map layout enables accurate, transparent, and real-time autonomous driving.

Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.

[111] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model

Dmitry Demidov, Zaigham Zaheer, Omkar Thawakar, Salman Khan, Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: The paper introduces E-FineR, a training-free method for fine-grained image classification, leveraging LLMs and VLMs for open-set recognition without predefined labels, achieving SOTA results and interpretability.

Details

Motivation: Traditional fine-grained classification methods are limited by fixed vocabularies and closed-set paradigms, hindering scalability and adaptability to novel classes.

Method: Proposes E-FineR, a training-free framework combining LLMs and VLMs, avoiding reliance on guessed class names and enhancing interpretability.

Result: E-FineR achieves state-of-the-art performance in fine-grained recognition and excels in zero-shot and few-shot classification without training or human intervention.

Conclusion: E-FineR enables flexible, language-driven image classification, scalable for real-world applications where expert annotations are scarce.

Abstract: Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.

[112] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

Sanghun Jung, Jingjing Zheng, Ke Zhang, Nan Qiao, Albert Y. C. Chen, Lu Xia, Chi Liu, Yuyin Sun, Xiao Zeng, Hsiang-Wei Huang, Byron Boots, Min Sun, Cheng-Hao Kuo

Main category: cs.CV

TL;DR: The paper proposes a state-of-the-art open-vocabulary 3D instance segmentation (OV-3DIS) method by combining complementary concepts, refining them, and introducing innovations like Alpha-CLIP and SMS score.

Details

Motivation: Existing OV-3DIS methods use vision-language models (VLMs) but lack synergy between concepts. The paper aims to integrate and refine these concepts for better performance.

Method: A two-stage approach: 1) 3D proposal generation via tracking-based aggregation and iterative merging/removal, and 2) instance classification using Alpha-CLIP (mask-enhanced CLIP) and SMS score for similarity normalization.

Result: Achieves SOTA performance on ScanNet200 and S3DIS, outperforming even closed-vocabulary methods in AP and AR metrics.

Conclusion: The proposed framework effectively combines and refines existing concepts, demonstrating superior performance in OV-3DIS tasks.

Abstract: Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.

[113] X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu

Main category: cs.CV

TL;DR: X-NeMo is a zero-shot diffusion-based method for animating static portraits using facial movements from a driving video, addressing identity leakage and expression capture issues.

Details

Motivation: Prior approaches suffer from identity leakage and poor capture of subtle/extreme expressions. X-NeMo aims to overcome these challenges.

Method: Uses a 1D identity-agnostic latent motion descriptor, cross-attention for motion control, and a dual GAN decoder for disentanglement.

Result: X-NeMo outperforms state-of-the-art methods, producing expressive animations with high identity resemblance.

Conclusion: X-NeMo effectively mitigates identity leakage and captures detailed facial motion, offering a robust solution for portrait animation.

Abstract: We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.

[114] Single Image Rain Streak Removal Using Harris Corner Loss and R-CBAM Network

Jongwook Si, Sungyoung Kim

Main category: cs.CV

TL;DR: A novel image restoration network with Corner Loss and R-CBAM Block improves rain streak removal while preserving details, outperforming previous methods on standard datasets.

Details

Motivation: Single-image rain streak removal requires preserving structural details and visual quality, which existing methods struggle with.

Method: Introduces Corner Loss to protect boundaries and textures, and R-CBAM Block to dynamically adjust feature importance in spatial and channel dimensions.

Result: Achieves PSNR of 33.29 dB on Rain100L and 26.16 dB on Rain100H, outperforming prior approaches.

Conclusion: The proposed method effectively removes rain streaks while maintaining image details, demonstrating superior performance.

Abstract: The problem of single-image rain streak removal goes beyond simple noise suppression, requiring the simultaneous preservation of fine structural details and overall visual quality. In this study, we propose a novel image restoration network that effectively constrains the restoration process by introducing a Corner Loss, which prevents the loss of object boundaries and detailed texture information during restoration. Furthermore, we propose a Residual Convolutional Block Attention Module (R-CBAM) Block into the encoder and decoder to dynamically adjust the importance of features in both spatial and channel dimensions, enabling the network to focus more effectively on regions heavily affected by rain streaks. Quantitative evaluations conducted on the Rain100L and Rain100H datasets demonstrate that the proposed method significantly outperforms previous approaches, achieving a PSNR of 33.29 dB on Rain100L and 26.16 dB on Rain100H.

[115] Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues

Xu Cao, Takafumi Taketomi

Main category: cs.CV

TL;DR: A neural inverse rendering method jointly reconstructs geometry, reflectance, and lighting from multi-view images without light calibration or intermediate cues.

Details

Motivation: Prior methods require light calibration or intermediate cues, limiting flexibility and accuracy. This work aims to optimize all scene parameters directly from raw images.

Method: Uses neural implicit fields for geometry and reflectance, with shadow-aware volume rendering. A spatial network predicts signed distance and reflectance latent codes, while a reflectance network estimates reflectance values.

Result: Outperforms state-of-the-art methods in shape and lighting accuracy, generalizes to view-unaligned multi-light images, and handles complex geometry and reflectance.

Conclusion: The proposed method is effective for joint reconstruction of scene parameters, offering improved accuracy and flexibility over existing approaches.

Abstract: We propose a neural inverse rendering approach that jointly reconstructs geometry, spatially varying reflectance, and lighting conditions from multi-view images captured under varying directional lighting. Unlike prior multi-view photometric stereo methods that require light calibration or intermediate cues such as per-view normal maps, our method jointly optimizes all scene parameters from raw images in a single stage. We represent both geometry and reflectance as neural implicit fields and apply shadow-aware volume rendering. A spatial network first predicts the signed distance and a reflectance latent code for each scene point. A reflectance network then estimates reflectance values conditioned on the latent code and angularly encoded surface normal, view, and light directions. The proposed method outperforms state-of-the-art normal-guided approaches in shape and lighting estimation accuracy, generalizes to view-unaligned multi-light images, and handles objects with challenging geometry and reflectance.

[116] CNN-based solution for mango classification in agricultural environments

Beatriz Díaz Peón, Jorge Torres Gómez, Ariel Fajardo Márquez

Main category: cs.CV

TL;DR: A fruit detection and classification system using CNNs and cascade detectors was designed for mango quality assessment, achieving accuracy and efficiency.

Details

Motivation: To automate fruit quality assessment for farm inventory management, ensuring reliable and efficient classification.

Method: Used Resnet-18 for classification and a cascade detector for detection, balancing speed and computational resources. Results were displayed via a MatLab App Designer interface.

Result: The system provided a reliable solution for fruit classification and detection, suitable for agricultural quality control.

Conclusion: The integration of CNNs and cascade detectors offers an effective approach for automated fruit quality assessment in agriculture.

Abstract: This article exemplifies the design of a fruit detection and classification system using Convolutional Neural Networks (CNN). The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Specifically, a method for mango fruit classification was developed using image processing, ensuring both accuracy and efficiency. Resnet-18 was selected as the preliminary architecture for classification, while a cascade detector was used for detection, balancing execution speed and computational resource consumption. Detection and classification results were displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction. The integration of convolutional neural networks and cascade detectors proffers a reliable solution for fruit classification and detection, with potential applications in agricultural quality control.

[117] Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

Yingjie Zhou, Jiezhang Cao, Zicheng Zhang, Farong Wen, Yanwei Jiang, Jun Jia, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: The paper introduces THQA-10K, the largest dataset for assessing AI-Generated Talking Heads (AGTHs) quality, evaluates 12 T2I models and 14 talkers, and proposes a SOTA objective assessment method.

Details

Motivation: Addressing the lack of comprehensive studies on AGTH quality, the paper aims to improve evaluation standards and expose distortions in existing models.

Method: Created THQA-10K dataset with 10,457 AGTHs, conducted subjective ratings by volunteers, and proposed an objective assessment method using first frame, Y-T slice, and tone-lip consistency.

Result: The proposed objective method achieves SOTA performance in AGTH quality assessment, and the dataset highlights distortions and talker performance.

Conclusion: The work provides a robust framework for AGTH quality assessment, with the dataset and method serving as benchmarks for future research.

Abstract: Speech-driven methods for portraits are figuratively known as “Talkers” because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.

Shiyao Yu, Zi-An Wang, Kangning Yin, Zheng Tian, Mingyuan Zhang, Weixin Si, Shihao Zou

Main category: cs.CV

TL;DR: The paper proposes a 4-modal framework (text, audio, video, motion) for motion retrieval, using sequence-level contrastive learning to improve alignment and performance over existing methods.

Details

Motivation: Existing motion retrieval methods lack intuitive interaction and sequential representation, limiting performance and user experience.

Method: A fine-grained joint embedding space aligns four modalities (text, audio, video, motion) via sequence-level contrastive learning. Synthetic audio is added to datasets for evaluation.

Result: The framework outperforms state-of-the-art methods, with significant improvements in retrieval metrics (e.g., 10.16% in R@10 for text-to-motion).

Conclusion: Multi-modal motion retrieval, especially with audio, enhances performance and user immersion, advancing motion acquisition.

Abstract: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities – text, audio, video, and motion – within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

[119] A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery

Youngsun Jang, Dongyoun Kim, Chulwoo Pack, Kwanghee Won

Main category: cs.CV

TL;DR: A new dataset for segmenting flooded areas in satellite images is introduced, addressing a gap in existing benchmarks. It includes 10 images per location from five Midwestern USA states, tested with state-of-the-art models, showing modest results and highlighting the need for future multimodal approaches.

Details

Motivation: Existing datasets for flooded area segmentation in satellite imagery are insufficient. This study aims to fill that gap by creating a dedicated dataset.

Method: Collected satellite images of the 2019 Midwestern USA floods, ensuring uniform resolution. Tested semantic segmentation models and conducted an ablation study on window sizes.

Result: Models showed modest performance, indicating a need for improved multimodal and temporal learning strategies.

Conclusion: The dataset addresses a critical gap and will be publicly available, encouraging further research in flood segmentation.

Abstract: This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on https://github.com/youngsunjang/SDSU_MidWest_Flood_2019.

[120] Adversarial-Guided Diffusion for Multimodal LLM Attacks

Chengwei Xia, Fan Ma, Ruijie Quan, Kun Zhan, Yi Yang

Main category: cs.CV

TL;DR: The paper proposes an adversarial-guided diffusion (AGD) method to generate adversarial images for multimodal large language models (MLLMs) without distorting clean images, achieving robust attack performance against defenses.

Details

Motivation: To deceive MLLMs into generating targeted responses while minimizing image distortion, addressing limitations of traditional adversarial attacks.

Method: Introduces AGD, which injects target semantics into the noise component of reverse diffusion, leveraging the full-spectrum property of diffusion models.

Result: AGD outperforms state-of-the-art methods in attack efficacy and robustness against defenses like low-pass filtering.

Conclusion: AGD is a robust and effective approach for adversarial attacks on MLLMs, resistant to common defenses.

Abstract: This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.

[121] Sparse Reconstruction of Optical Doppler Tomography with Alternative State Space Model and Attention

Zhenghong Li, Jiaxiang Ren, Wensheng Cheng, Yanzuo Liu, Congwu Du, Yingtian Pan, Haibin Ling

Main category: cs.CV

TL;DR: A novel sparse ODT reconstruction framework (ASSAN) reduces the need for densely sampled A-scans, improving efficiency without compromising image fidelity.

Details

Motivation: Current ODT requires densely sampled A-scans for high-fidelity B-scans, leading to prolonged scanning time and increased storage demands.

Method: ASSAN uses 1D SSM for intra-A-scan representation and 1D gated self-attention for inter-A-scan features, enhanced by sequential 1D convolutions.

Result: ASSAN outperforms state-of-the-art methods in reconstruction on real animal data.

Conclusion: ASSAN effectively reduces raw A-scan requirements while maintaining high-fidelity ODT imaging.

Abstract: Optical coherence Doppler tomography (ODT) is an emerging blood flow imaging technique. The fundamental unit of ODT is the 1D depth-resolved trace named raw A-scans (or A-line). A 2D ODT image (B-scan) is formed by reconstructing a cross-sectional flow image via Doppler phase-subtraction of raw A-scans along B-line. To obtain a high-fidelity B-scan, densely sampled A-scans are required currently, leading to prolonged scanning time and increased storage demands. Addressing this issue, we propose a novel sparse ODT reconstruction framework with an Alternative State Space Attention Network (ASSAN) that effectively reduces raw A-scans needed. Inspired by the distinct distributions of information along A-line and B-line, ASSAN applies 1D State Space Model (SSM) to each A-line to learn the intra-A-scan representation, while using 1D gated self-attention along B-line to capture the inter-A-scan features. In addition, an effective feedforward network based on sequential 1D convolutions along different axes is employed to enhance the local feature. In validation experiments on real animal data, ASSAN shows clear effectiveness in the reconstruction in comparison with state-of-the-art reconstruction methods.

[122] Confidence-aware agglomeration classification and segmentation of 2D microscopic food crystal images

Xiaoyu Ji, Ali Shakouri, Fengqing Zhu

Main category: cs.CV

TL;DR: The paper proposes a method to improve food crystal agglomeration detection in 2D microscopic images using supervised learning and post-processing.

Details

Motivation: Manual annotation of agglomeration is challenging due to water transparency and limited perspective.

Method: A supervised baseline model generates segmentation pseudo-labels, followed by an instance classification model with pixel-wise segmentation. Post-processing preserves crystal properties.

Result: The method improves true positive classification accuracy and size distribution predictions.

Conclusion: The approach successfully classifies agglomerated instances, addressing annotation variability.

Abstract: Food crystal agglomeration is a phenomenon occurs during crystallization which traps water between crystals and affects food product quality. Manual annotation of agglomeration in 2D microscopic images is particularly difficult due to the transparency of water bonding and the limited perspective focusing on a single slide of the imaged sample. To address this challenge, we first propose a supervised baseline model to generate segmentation pseudo-labels for the coarsely labeled classification dataset. Next, an instance classification model that simultaneously performs pixel-wise segmentation is trained. Both models are used in the inference stage to combine their respective strengths in classification and segmentation. To preserve crystal properties, a post processing module is designed and included to both steps. Our method improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. Given the variability in confidence levels of manual annotations, our proposed method is evaluated under two confidence levels and successfully classifies potential agglomerated instances.

[123] DeepForest: Sensing Into Self-Occluding Volumes of Vegetation With Aerial Imaging

Mohamed Youssef, Jian Peng, Oliver Bimber

Main category: cs.CV

TL;DR: A novel method using drones and synthetic-aperture imaging with CNNs improves deep canopy vegetation sensing, outperforming traditional remote sensing by up to 12x in dense forests.

Details

Motivation: Overcome the limitation of remote sensing in penetrating dense canopy layers to access below-canopy volumetric vegetation data for ecosystem insights.

Method: Uses synthetic-aperture imaging with drones and pre-trained 3D CNNs (MSE loss) to process focal stacks, reducing out-of-focus signals and combining spectral channels.

Result: Achieves ~x7 average improvement (min ~x2, max ~x12) in dense forests (220-1680 trees/ha) and MSE of 0.05 compared to classical multispectral imaging.

Conclusion: The approach enables detailed volumetric vegetation analysis, offering insights into plant health and environmental conditions deep within canopies.

Abstract: Access to below-canopy volumetric vegetation data is crucial for understanding ecosystem dynamics. We address the long-standing limitation of remote sensing to penetrate deep into dense canopy layers. LiDAR and radar are currently considered the primary options for measuring 3D vegetation structures, while cameras can only extract the reflectance and depth of top layers. Using conventional, high-resolution aerial images, our approach allows sensing deep into self-occluding vegetation volumes, such as forests. It is similar in spirit to the imaging process of wide-field microscopy, but can handle much larger scales and strong occlusion. We scan focal stacks by synthetic-aperture imaging with drones and reduce out-of-focus signal contributions using pre-trained 3D convolutional neural networks with mean squared error (MSE) as the loss function. The resulting volumetric reflectance stacks contain low-frequency representations of the vegetation volume. Combining multiple reflectance stacks from various spectral channels provides insights into plant health, growth, and environmental conditions throughout the entire vegetation volume. Compared with simulated ground truth, our correction leads to ~x7 average improvements (min: ~x2, max: ~x12) for forest densities of 220 trees/ha - 1680 trees/ha. In our field experiment, we achieved an MSE of 0.05 when comparing with the top-vegetation layer that was measured with classical multispectral aerial imaging.

[124] YOLO-ROC: A High-Precision and Ultra-Lightweight Model for Real-Time Road Damage Detection

Zicheng Lin, Weichao Pan

Main category: cs.CV

TL;DR: The paper introduces YOLO-ROC, a lightweight and high-precision model for road damage detection, addressing challenges like multi-scale feature extraction and computational efficiency.

Details

Motivation: Existing deep learning methods struggle with detecting small-scale road damage and have high computational demands, limiting real-time deployment.

Method: Proposes YOLO-ROC with a BMS-SPPF module for multi-scale feature extraction and a channel compression strategy to reduce complexity.

Result: Achieves 67.6% mAP50, outperforming YOLOv8n by 2.11%, with significant improvements in small-target detection and reduced model size (2.0 MB).

Conclusion: YOLO-ROC is effective for real-time road damage detection, offering high precision, lightweight design, and strong generalization.

Abstract: Road damage detection is a critical task for ensuring traffic safety and maintaining infrastructure integrity. While deep learning-based detection methods are now widely adopted, they still face two core challenges: first, the inadequate multi-scale feature extraction capabilities of existing networks for diverse targets like cracks and potholes, leading to high miss rates for small-scale damage; and second, the substantial parameter counts and computational demands of mainstream models, which hinder their deployment for efficient, real-time detection in practical applications. To address these issues, this paper proposes a high-precision and lightweight model, YOLO - Road Orthogonal Compact (YOLO-ROC). We designed a Bidirectional Multi-scale Spatial Pyramid Pooling Fast (BMS-SPPF) module to enhance multi-scale feature extraction and implemented a hierarchical channel compression strategy to reduce computational complexity. The BMS-SPPF module leverages a bidirectional spatial-channel attention mechanism to improve the detection of small targets. Concurrently, the channel compression strategy reduces the parameter count from 3.01M to 0.89M and GFLOPs from 8.1 to 2.6. Experiments on the RDD2022_China_Drone dataset demonstrate that YOLO-ROC achieves a mAP50 of 67.6%, surpassing the baseline YOLOv8n by 2.11%. Notably, the mAP50 for the small-target D40 category improved by 16.8%, and the final model size is only 2.0 MB. Furthermore, the model exhibits excellent generalization performance on the RDD2022_China_Motorbike dataset.

[125] Toward Safe, Trustworthy and Realistic Augmented Reality User Experience

Yanming Xiu

Main category: cs.CV

TL;DR: The paper focuses on ensuring AR safety by detecting harmful virtual content using vision-language models and proposes future directions for scalable, human-aligned AR safeguards.

Details

Motivation: The increasing integration of AR into daily life necessitates addressing risks like task-detrimental content and perceptual manipulation to ensure safety and trustworthiness.

Method: Developed two systems, ViDDAR and VIM-Sense, leveraging vision-language models and multimodal reasoning to detect harmful AR content.

Result: Proposed future directions: automated quality assessment, multimodal attack detection, and efficient VLM adaptation for AR devices.

Conclusion: Aims to establish a scalable, human-aligned framework for AR safety, seeking feedback on perceptual modeling, multimodal content, and lightweight model adaptation.

Abstract: As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.

[126] Ambiguity-Guided Learnable Distribution Calibration for Semi-Supervised Few-Shot Class-Incremental Learning

Fan Lyu, Linglan Zhao, Chengyan Liu, Yinying Mei, Zhang Zhang, Jian Zhang, Fuyuan Hu, Liang Wang

Main category: cs.CV

TL;DR: The paper redefines Semi-FSCIL as GSemi-FSCIL by including base and novel classes in unlabeled data, proposes ALDC to address distribution bias, and achieves state-of-the-art results.

Details

Motivation: Existing Semi-FSCIL assumes unlabeled data is only from novel classes, which is impractical. The paper aims to align with real-world scenarios by including base and all novel classes in unlabeled data.

Method: Proposes Ambiguity-guided Learnable Distribution Calibration (ALDC) to dynamically correct biased feature distributions using base samples.

Result: ALDC outperforms existing methods on three benchmark datasets, achieving state-of-the-art performance.

Conclusion: The GSemi-FSCIL framework and ALDC strategy effectively address the challenge of distinguishing unlabeled samples from base and novel classes, improving model performance.

Abstract: Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.

[127] Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Sungguk Cha, DongWook Kim, Taeseung Hahn, Mintae Kim, Youngsub Han, Byoung-Ki Jeon

Main category: cs.CV

TL;DR: RL-QR is a reinforcement learning framework for query rewriting in RAG systems, improving retrieval performance without human annotations. It shows gains for multi-modal and lexical retrievers but struggles with semantic/hybrid ones.

Details

Motivation: Optimizing queries for diverse, unstructured real-world documents in RAG systems is challenging, and existing methods often require human-annotated datasets.

Method: RL-QR uses reinforcement learning (GRPO) to train retriever-specific query rewriters, synthesizing scenario-question pairs for training.

Result: RL-QR improves NDCG@3 by 11% for multi-modal and 9% for lexical retrievers, but not for semantic/hybrid retrievers due to training misalignments.

Conclusion: RL-QR offers a scalable, annotation-free solution for query optimization in RAG systems, with potential for further refinement in semantic retrieval.

Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}{\text{multi-modal}}$ achieving an 11% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}{\text{lexical}}$ yielding a 9% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR’s potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.

[128] Automated Mapping the Pathways of Cranial Nerve II, III, V, and VII/VIII: A Multi-Parametric Multi-Stage Diffusion Tractography Atlas

Lei Xie, Jiahao Huang, Jiawei Zhang, Jianzhong He, Yiang Pan, Guoqiang Xie, Mengjun Li, Qingrun Zeng, Mingchu Li, Yuanjing Feng

Main category: cs.CV

TL;DR: The study presents the first comprehensive diffusion tractography atlas for automated mapping of cranial nerve (CN) pathways in the human brain, using multi-stage fiber clustering on data from 50 subjects.

Details

Motivation: Mapping CN pathways is challenging due to their unique anatomical structures and the complexity of the skull base environment. This work aims to provide a detailed and automated solution.

Method: The atlas is generated using multi-stage fiber clustering on approximately 1,000,000 streamlines from 50 HCP subjects, validated against expert annotations and clinical cases.

Result: The atlas successfully identifies 8 fiber bundles associated with 5 CN pairs, showing high spatial correspondence with manual annotations across multiple datasets.

Conclusion: This atlas enhances automated CN pathway mapping, improving analysis and visualization of brain structures, with demonstrated robustness across diverse datasets.

Abstract: Cranial nerves (CNs) play a crucial role in various essential functions of the human brain, and mapping their pathways from diffusion MRI (dMRI) provides valuable preoperative insights into the spatial relationships between individual CNs and key tissues. However, mapping a comprehensive and detailed CN atlas is challenging because of the unique anatomical structures of each CN pair and the complexity of the skull base environment.In this work, we present what we believe to be the first study to develop a comprehensive diffusion tractography atlas for automated mapping of CN pathways in the human brain. The CN atlas is generated by fiber clustering by using the streamlines generated by multi-parametric fiber tractography for each pair of CNs. Instead of disposable clustering, we explore a new strategy of multi-stage fiber clustering for multiple analysis of approximately 1,000,000 streamlines generated from the 50 subjects from the Human Connectome Project (HCP). Quantitative and visual experiments demonstrate that our CN atlas achieves high spatial correspondence with expert manual annotations on multiple acquisition sites, including the HCP dataset, the Multi-shell Diffusion MRI (MDM) dataset and two clinical cases of pituitary adenoma patients. The proposed CN atlas can automatically identify 8 fiber bundles associated with 5 pairs of CNs, including the optic nerve CN II, oculomotor nerve CN III, trigeminal nerve CN V and facial-vestibulocochlear nerve CN VII/VIII, and its robustness is demonstrated experimentally. This work contributes to the field of diffusion imaging by facilitating more efficient and automated mapping the pathways of multiple pairs of CNs, thereby enhancing the analysis and understanding of complex brain structures through visualization of their spatial relationships with nearby anatomy.

[129] A Deep Dive into Generic Object Tracking: A Survey

Fereshteh Aghaee Meibodi, Shadi Alijani, Homayoun Najjaran

Main category: cs.CV

TL;DR: A comprehensive review of generic object tracking methods, focusing on Siamese-based, discriminative, and transformer-based trackers, with emphasis on the latter’s advancements.

Details

Motivation: Addressing challenges like occlusions, distractors, and appearance variations in object tracking by reviewing diverse tracking paradigms.

Method: Analyzing core design principles, innovations, and limitations of Siamese-based, discriminative, and transformer-based trackers through qualitative and quantitative comparisons.

Result: A novel categorization, unified visual and tabular comparison of methods, and summary of evaluation benchmarks.

Conclusion: Transformer-based trackers show rapid advancements due to robust spatio-temporal modeling, making them a promising direction in the field.

Abstract: Generic object tracking remains an important yet challenging task in computer vision due to complex spatio-temporal dynamics, especially in the presence of occlusions, similar distractors, and appearance variations. Over the past two decades, a wide range of tracking paradigms, including Siamese-based trackers, discriminative trackers, and, more recently, prominent transformer-based approaches, have been introduced to address these challenges. While a few existing survey papers in this field have either concentrated on a single category or widely covered multiple ones to capture progress, our paper presents a comprehensive review of all three categories, with particular emphasis on the rapidly evolving transformer-based methods. We analyze the core design principles, innovations, and limitations of each approach through both qualitative and quantitative comparisons. Our study introduces a novel categorization and offers a unified visual and tabular comparison of representative methods. Additionally, we organize existing trackers from multiple perspectives and summarize the major evaluation benchmarks, highlighting the fast-paced advancements in transformer-based tracking driven by their robust spatio-temporal modeling capabilities.

[130] Towards Measuring and Modeling Geometric Structures in Time Series Forecasting via Image Modality

Mingyang Yu, Xiahui Guo, Peng chen, Zhenkai Li, Yang Shu

Main category: cs.CV

TL;DR: Proposes TGSI and SATL to evaluate and improve time series forecasting by focusing on geometric structure, outperforming baselines in accuracy and structure metrics.

Details

Motivation: Traditional metrics like MSE fail to capture the geometric structure of time series, which is crucial for understanding temporal dynamics.

Method: Introduces TGSI for geometric evaluation and SATL, a multi-component loss (first-order difference, frequency domain, perceptual feature) for training.

Result: Models with SATL outperform baselines in both MSE and TGSI metrics without extra inference cost.

Conclusion: SATL effectively enhances structure modeling in time series forecasting, validated by superior performance across datasets.

Abstract: Time Series forecasting is critical in diverse domains such as weather forecasting, financial investment, and traffic management. While traditional numerical metrics like mean squared error (MSE) can quantify point-wise accuracy, they fail to evaluate the geometric structure of time series data, which is essential to understand temporal dynamics. To address this issue, we propose the time series Geometric Structure Index (TGSI), a novel evaluation metric that transforms time series into images to leverage their inherent two-dimensional geometric representations. However, since the image transformation process is non-differentiable, TGSI cannot be directly integrated as a training loss. We further introduce the Shape-Aware Temporal Loss (SATL), a multi-component loss function operating in the time series modality to bridge this gap and enhance structure modeling during training. SATL combines three components: a first-order difference loss that measures structural consistency through the MSE between first-order differences, a frequency domain loss that captures essential periodic patterns using the Fast Fourier Transform while minimizing noise, and a perceptual feature loss that measures geometric structure difference in time-series by aligning temporal features with geometric structure features through a pre-trained temporal feature extractor and time-series image autoencoder. Experiments across multiple datasets demonstrate that models trained with SATL achieve superior performance in both MSE and the proposed TGSI metrics compared to baseline methods, without additional computational cost during inference.

[131] Learning Semantic-Aware Threshold for Multi-Label Image Recognition with Partial Labels

Haoxian Ruan, Zhihua Xu, Zhijing Yang, Guang Ma, Jieming Xie, Changxiang Fan, Tianshui Chen

Main category: cs.CV

TL;DR: The paper introduces SATL, a method for multi-label image recognition with partial labels, improving performance by dynamically learning category-specific thresholds and using differential ranking loss.

Details

Motivation: Traditional methods for MLR-PL use fixed thresholds, ignoring varying score distributions across categories, leading to inaccurate pseudo-labels and poor performance.

Method: Proposes Semantic-Aware Threshold Learning (SATL), which calculates and updates score distributions and thresholds dynamically, and uses differential ranking loss to enhance threshold discrimination.

Result: Experiments on COCO and VG-200 show SATL significantly outperforms traditional methods in limited-label scenarios.

Conclusion: SATL effectively addresses the limitations of fixed-threshold approaches, improving accuracy and performance in multi-label recognition with partial labels.

Abstract: Multi-label image recognition with partial labels (MLR-PL) is designed to train models using a mix of known and unknown labels. Traditional methods rely on semantic or feature correlations to create pseudo-labels for unidentified labels using pre-set thresholds. This approach often overlooks the varying score distributions across categories, resulting in inaccurate and incomplete pseudo-labels, thereby affecting performance. In our study, we introduce the Semantic-Aware Threshold Learning (SATL) algorithm. This innovative approach calculates the score distribution for both positive and negative samples within each category and determines category-specific thresholds based on these distributions. These distributions and thresholds are dynamically updated throughout the learning process. Additionally, we implement a differential ranking loss to establish a significant gap between the score distributions of positive and negative samples, enhancing the discrimination of the thresholds. Comprehensive experiments and analysis on large-scale multi-label datasets, such as Microsoft COCO and VG-200, demonstrate that our method significantly improves performance in scenarios with limited labels.

[132] PixNerd: Pixel Neural Field Diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang

Main category: cs.CV

TL;DR: PixelNerd (Pixel Neural Field Diffusion) is a single-stage, efficient, end-to-end solution for image generation, avoiding the issues of two-stage training paradigms like VAE and achieving competitive results on benchmarks.

Details

Motivation: Current diffusion transformers rely on pre-trained VAEs, leading to accumulated errors and decoding artifacts. PixelNerd aims to simplify the process by modeling patch-wise decoding with neural fields.

Method: Proposes PixelNerd, a single-scale, single-stage framework using neural field representation for efficient end-to-end image generation, eliminating the need for VAEs or cascade pipelines.

Result: Achieved 2.15 FID on ImageNet 256×256 and 2.84 FID on 512×512, with competitive scores on GenEval (0.73) and DPG (80.9) benchmarks.

Conclusion: PixelNerd offers a simpler, more efficient alternative to traditional two-stage methods, demonstrating strong performance in image generation tasks.

Abstract: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

[133] Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2

Solha Kang, Eugene Kim, Joris Vankerschaver, Utku Ozbulak

Main category: cs.CV

TL;DR: The paper explores adapting the Segment Anything Model 2 (SAM2) for low-cost 3D tumor segmentation in breast MRI, using minimal input and achieving strong performance despite being a zero-shot model.

Details

Motivation: Manual interpretation of 3D breast MRI scans is labor-intensive and subjective, and commercial AI tools are often inaccessible in low- and middle-income countries due to high costs and infrastructure demands.

Method: SAM2 is adapted for 3D segmentation using a single bounding box annotation on one slice, with three slice-wise tracking strategies (top-to-bottom, bottom-to-top, center-outward) evaluated for propagation.

Result: Center-outward propagation yields the most consistent and accurate segmentations, with SAM2 performing well despite no training on volumetric medical data. Performance varies with tumor size, location, and shape.

Conclusion: General-purpose foundation models like SAM2 can enable accessible and affordable 3D medical image analysis in resource-constrained settings with minimal supervision.

Abstract: Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.

[134] iLRM: An Iterative Large 3D Reconstruction Model

Gyeongjin Kang, Seungtae Nam, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park

Main category: cs.CV

TL;DR: iLRM introduces an iterative refinement model for scalable 3D Gaussian splatting, improving reconstruction quality and speed by decoupling scene representation, reducing computational costs, and injecting high-resolution details.

Details

Motivation: Address scalability and efficiency issues in feed-forward 3D reconstruction, particularly the prohibitive computational costs of transformer-based methods.

Method: Iterative refinement with decoupled scene representation, two-stage attention, and high-resolution information injection.

Result: Outperforms existing methods in quality and speed, with superior scalability for larger input views.

Conclusion: iLRM offers a scalable, efficient solution for high-quality 3D reconstruction, advancing the field of explicit 3D representations.

Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.

[135] UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, Liwei Wang

Main category: cs.CV

TL;DR: UniLIP extends CLIP to reconstruction, generation, and editing with a two-stage training scheme and self-distillation, outperforming previous unified models.

Details

Motivation: To unify reconstruction, generation, and editing in CLIP without degrading its comprehension performance or requiring additional decoders.

Method: Introduces a two-stage training scheme, self-distillation, and a dual-condition architecture connecting MLLM and diffusion transformer.

Result: Scores 0.87 and 0.53 on GenEval and WISE benchmarks for generation, and 3.62 on ImgEdit Benchmark for editing, surpassing competitors.

Conclusion: UniLIP expands CLIP’s capabilities, maintaining comprehension while excelling in generation and editing tasks.

Abstract: In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance.In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM’s strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.

Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim

Main category: cs.CV

TL;DR: The paper introduces BLiM, a framework for text-video retrieval using MLLMs, addressing candidate prior bias with CPN, improving performance by 6.4 R@1.

Details

Motivation: To mitigate candidate prior bias in MLLM-based retrieval, which favors candidates with higher priors over relevance.

Method: Proposes BLiM, leveraging bidirectional likelihood estimation, and CPN for score calibration.

Result: BLiM with CPN outperforms SOTA by 6.4 R@1, reducing bias and improving relevance.

Conclusion: BLiM and CPN effectively address prior bias, enhancing retrieval and broader multi-modal tasks.

Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

[137] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis

Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung

Main category: cs.CV

TL;DR: A new benchmark, Layout Error Detection (LED), is introduced to evaluate structural robustness in document layout predictions, addressing limitations of traditional metrics like IoU and mAP.

Details

Motivation: Existing metrics fail to detect critical structural errors (e.g., region merging, splitting, missing content) in document layout analysis.

Method: LED defines eight standardized error types and three tasks: error existence detection, error type classification, and element-wise error type classification. A synthetic dataset (LED-Dataset) is created by injecting realistic errors.

Result: LED effectively differentiates structural understanding in models, revealing biases and trade-offs unseen by traditional metrics.

Conclusion: LED provides a more comprehensive evaluation of structural robustness in document layout analysis, filling a gap left by conventional metrics.

Abstract: Recent advancements in Document Layout Analysis through Large Language Models and Multimodal Models have significantly improved layout detection. However, despite these improvements, challenges remain in addressing critical structural errors, such as region merging, splitting, and missing content. Conventional evaluation metrics like IoU and mAP, which focus primarily on spatial overlap, are insufficient for detecting these errors. To address this limitation, we propose Layout Error Detection (LED), a novel benchmark designed to evaluate the structural robustness of document layout predictions. LED defines eight standardized error types, and formulates three complementary tasks: error existence detection, error type classification, and element-wise error type classification. Furthermore, we construct LED-Dataset, a synthetic dataset generated by injecting realistic structural errors based on empirical distributions from DLA models. Experimental results across a range of LMMs reveal that LED effectively differentiates structural understanding capabilities, exposing modality biases and performance trade-offs not visible through traditional metrics.

[138] Training-free Geometric Image Editing on Diffusion Models

Hanshen Zhu, Zhen Zhu, Kaile Zhang, Yiming Gong, Yuliang Liu, Xiang Bai

Main category: cs.CV

TL;DR: A decoupled pipeline for geometric image editing outperforms state-of-the-art methods by separating object transformation, inpainting, and refinement.

Details

Motivation: Handling large or complex geometric transformations in image editing is challenging for single-step diffusion methods.

Method: Proposes a decoupled pipeline with object transformation, source inpainting, and target refinement using FreeFine, a training-free diffusion approach.

Result: FreeFine excels in image fidelity and edit precision, especially for demanding transformations, as shown on the GeoBench benchmark.

Conclusion: The decoupled approach with FreeFine improves geometric editing performance and is validated by the GeoBench benchmark.

Abstract: We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine

[139] ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection

Xihang Hu, Fuming Sun, Jiazhe Liu, Feilong Xu, Xiaoli Zhang

Main category: cs.CV

TL;DR: ST-SAM is a concise, annotation-efficient framework for semi-supervised camouflaged object detection, using self-training and hybrid prompts to outperform existing methods with minimal labeled data.

Details

Motivation: To reduce reliance on costly pixel-level annotations and address issues like prediction bias and high computational overhead in existing SSCOD methods.

Method: Employs self-training to dynamically filter and expand high-confidence pseudo-labels, and uses hybrid prompts to leverage the Segment Anything Model for specialized tasks.

Result: Achieves state-of-the-art performance with only 1% labeled data, matching fully supervised methods while training only a single network.

Conclusion: ST-SAM establishes a new paradigm for annotation-efficient SSCOD, offering scalability and simplicity without relying on specific models or loss functions.

Abstract: Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model’s potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.

[140] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving

Xuewei Tang, Mengmeng Yang, Tuopu Wen, Peijin Jia, Le Cui, Mingshang Luo, Kehua Sheng, Bo Zhang, Diange Yang, Kun Jiang

Main category: cs.CV

TL;DR: PriorFusion integrates semantic, geometric, and generative priors to improve road perception in autonomous driving, outperforming existing methods in accuracy and robustness.

Details

Motivation: Autonomous vehicles need reliable road perception in complex environments without HD maps, but current methods fail to fully exploit structured priors in road elements.

Method: Proposes PriorFusion, using instance-aware attention, shape-prior features, and a diffusion-based framework to generate accurate predictions.

Result: Significant improvement in perception accuracy on large-scale datasets, especially in challenging conditions.

Conclusion: PriorFusion enhances road element perception, producing more accurate and coherent predictions.

Abstract: With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.

[141] Forgetting of task-specific knowledge in model merging-based continual learning

Timm Hess, Gido M van de Ven, Tinne Tuytelaars

Main category: cs.CV

TL;DR: Linear merging of models in continual learning preserves shared knowledge but degrades task-specific knowledge. Incremental training outperforms parallel training in merging.

Details

Motivation: To explore how merging models affects shared and task-specific knowledge in continual learning.

Method: Controlled computer vision experiments with visual cues to analyze model merging.

Result: Merging preserves shared knowledge but degrades task-specific knowledge. Incremental training yields better merging results than parallel training.

Conclusion: Linear merging is effective for shared knowledge in continual learning, with incremental training being superior for model merging.

Abstract: This paper investigates the linear merging of models in the context of continual learning (CL). Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.

[142] The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti

Main category: cs.CV

TL;DR: The paper explores how text-to-image diffusion models internally represent content and style in artworks, using cross-attention heatmaps to analyze their emergent understanding of these concepts.

Details

Motivation: To understand how diffusion models encode content and style in artworks without explicit guidance, given their remarkable generative capabilities.

Method: Leveraging cross-attention heatmaps to attribute image regions to content or style tokens in transformer-based text-to-image diffusion models.

Result: Diffusion models show varying degrees of content-style separation, with content tokens affecting object regions and style tokens influencing backgrounds and textures.

Conclusion: The study reveals an emergent understanding of content-style distinction in diffusion models, enhancing insights into their internal representations of artistic concepts.

Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

[143] Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification

Vineet Kumar Rakesh, Soumya Mazumdar, Tapas Samanta, Sarbajit Pal, Amitabha Das

Main category: cs.CV

TL;DR: The paper analyzes hyperparameter impacts on seven lightweight models for real-time image classification, highlighting RepVGG-A2’s balance of accuracy and efficiency.

Details

Motivation: To optimize hyperparameters for lightweight models in resource-constrained applications like edge devices.

Method: Trains models on ImageNet-1K with consistent settings, conducts ablation studies on hyperparameters, and evaluates performance metrics.

Result: Cosine learning rate decay and adjustable batch size improve accuracy and convergence; RepVGG-A2 achieves >80% Top-1 accuracy with efficient inference.

Conclusion: Provides practical guidance for building efficient real-time image processing models, with code and logs publicly available.

Abstract: Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.

[144] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Xiaobao Wei, Sixiang Chen, Zhuo Li, Yang Wang, Liyun Li, Xianming Liu, Ming Lu, Shanghang Zhang

Main category: cs.CV

TL;DR: FastDriveVLA is a reconstruction-based vision token pruning framework for autonomous driving, prioritizing foreground information to reduce computational costs while maintaining performance.

Details

Motivation: Current visual token pruning methods perform poorly in autonomous driving. Human drivers focus on foreground areas, so retaining such tokens is crucial for decision-making.

Method: Proposes FastDriveVLA with ReconPruner, a plug-and-play pruner using MAE-style pixel reconstruction and adversarial foreground-background training.

Result: Achieves state-of-the-art performance on the nuScenes benchmark across pruning ratios.

Conclusion: FastDriveVLA effectively prunes visual tokens while preserving foreground information, enhancing autonomous driving systems.

Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.

[145] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang, Hongbin Lin, Yueru Luo, Suzhong Fu, Chao Zheng, Xinrui Yan, Shuqi Mei, Kun Tang, Shuguang Cui, Zhen Li

Main category: cs.CV

TL;DR: FASTopoWM is a novel framework for lane segment topology reasoning in autonomous driving, improving temporal perception by integrating fast-slow systems and latent world models.

Details

Motivation: Existing methods lack effective temporal information use, rely too much on historical queries, and are vulnerable to pose estimation failures.

Method: FASTopoWM introduces a fast-slow framework with parallel supervision and latent world models for better temporal propagation.

Result: Outperforms state-of-the-art methods in lane segment detection (37.4% mAP) and centerline perception (46.3% OLS).

Conclusion: FASTopoWM effectively addresses limitations of current methods, enhancing lane topology reasoning for autonomous driving.

Abstract: Lane segment topology reasoning provides comprehensive bird’s-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

[146] Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation

Yingkai Wang, Yaoyao Zhu, Xiuding Cai, Yuhao Xiao, Haotian Wu, Yu Yao

Main category: cs.CV

TL;DR: A domain generalization framework for medical image segmentation improves robustness to domain shifts by using feature perturbations and adaptive consistency constraints, outperforming existing methods.

Details

Motivation: Domain shifts in medical imaging degrade segmentation performance due to variations in imaging conditions, limiting practical deployment.

Method: The framework introduces implicit feature perturbations guided by domain statistics, using a learnable semantic direction selector and covariance-based sampler, along with an adaptive consistency constraint.

Result: Outperforms existing domain generalization approaches on multi-center benchmarks, achieving robust segmentation across diverse domains.

Conclusion: The proposed framework effectively addresses domain shifts in medical image segmentation, enhancing generalizability and reliability.

Abstract: Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains.

Qiang Lu, Waikit Xiu, Xiying Li, Shenyu Hu, Shengbo Sun

Main category: cs.CV

TL;DR: A novel two-stage framework combining open-vocabulary detection and cross-modal learning improves traffic sign recognition by addressing long-tail distribution and multi-scale feature extraction challenges.

Details

Motivation: Traffic sign recognition is crucial for autonomous driving but suffers from long-tail data distribution and small, multi-scale targets, reducing performance.

Method: Proposes NanoVerse YOLO for detection (RepVL-PAN and SPD-Conv modules) and TSR-MCL for classification (contrasting visual and semantic features).

Result: Achieves 78.4% mAP on TT100K dataset, with 91.8% accuracy and 88.9% recall, outperforming existing methods.

Conclusion: The framework effectively mitigates data imbalance and scale variation issues, enhancing recognition in complex scenarios.

Abstract: Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.

[148] MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting

Xingyue Peng, Yuandong Lyu, Lang Zhang, Jian Zhu, Songtao Wang, Jiaxin Deng, Songxin Lu, Weiliang Ma, Dangen She, Peng Jia, XianPeng Lang

Main category: cs.CV

TL;DR: A robust road surface reconstruction framework using occlusion-aware 2D Gaussian surfels and semantic-guided color enhancement, outperforming prior methods in complex urban environments.

Details

Motivation: Road surface reconstruction is crucial for autonomous driving but is challenged by occlusions, visual clutter, and appearance degradation due to lighting/weather changes.

Method: Integrates planar-adapted Gaussian representation, segmentation-guided video inpainting, and semantic-aware color correction in HSV space.

Result: Produces visually coherent and geometrically faithful reconstructions, outperforming existing methods in real-world conditions.

Conclusion: The framework effectively addresses challenges in road surface reconstruction, enhancing accuracy and robustness for autonomous driving applications.

Abstract: Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban environments.While recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.

[149] The Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN, YOLOv XI and YOLOv XII models

Ahmet Can Ömercikoğlu, Mustafa Mansur Yönügül, Pakize Erdoğmuş

Main category: cs.CV

TL;DR: The study evaluates the impact of image resolution on three deep learning face detectors (YOLOv11, YOLOv12, MTCNN), finding YOLOv11 excels in accuracy at higher resolutions, YOLOv12 in recall, and MTCNN in landmark localization but slower inference.

Details

Motivation: Real-world conditions like low-resolution imagery degrade face detection performance, prompting a need to evaluate resolution's impact on popular models.

Method: Evaluated YOLOv11, YOLOv12, and MTCNN on the WIDER FACE dataset across resolutions (160x160, 320x320, 640x640) using precision, recall, mAP50, mAP50-95, and inference time metrics.

Result: YOLOv11 outperforms in accuracy at higher resolutions, YOLOv12 has better recall, and MTCNN is slower but competitive in landmark localization.

Conclusion: The study offers insights for selecting resolution-aware face detection models based on operational needs.

Abstract: Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model’s performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.

[150] IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025

Radu-Andrei Bourceanu, Neil De La Fuente, Jan Grimm, Andrei Jardan, Andriy Manucharyan, Cornelius Weiss, Roman Pflugfelder

Main category: cs.CV

TL;DR: The report reviews six key computer vision papers, covering ResNet, Vision Transformer (ViT), GANs, Latent Diffusion Models (LDMs), DINO, and Masked Autoencoders (MAE), highlighting their innovations and impact.

Details

Motivation: To analyze the evolution of design patterns in computer vision, focusing on foundational architectures, generative models, and self-supervised learning techniques.

Method: Examines six influential papers, detailing their contributions: ResNet (residual connections), ViT (Transformer for images), GANs (adversarial training), LDMs (denoising in latent space), DINO (self-distillation), and MAE (masked reconstruction).

Result: Identifies advancements like deeper networks (ResNet), attention-based models (ViT), high-fidelity synthesis (LDMs), and scalable pre-training (MAE).

Conclusion: The evolution of computer vision design patterns has led to breakthroughs in recognition, generation, and self-supervised learning, with each method addressing specific challenges.

Abstract: This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.

[151] Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

Main category: cs.CV

TL;DR: The paper identifies challenges in pruning layers of large vision-language models (LVLMs) due to modality divergence and proposes a novel framework, Short-LVLM (SVL), to address these issues effectively.

Details

Motivation: The practical use of LVLMs is hindered by their large size and high computational costs. While NLP layer pruning methods exist, their effectiveness for LVLMs is uncertain due to vision-language modality differences.

Method: The authors empirically test NLP pruning methods on LVLMs, identify challenges (non-essential tokens and inter-layer feature gaps), and propose Short-LVLM, a framework leveraging important tokens and mitigating feature gaps.

Result: Short-LVLM achieves a better performance-efficiency balance, is training-free, model-agnostic, and highly compatible.

Conclusion: The study demonstrates the limitations of NLP pruning for LVLMs and introduces Short-LVLM as an effective solution, offering practical advantages for deployment.

Abstract: Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.

[152] VMatcher: State-Space Semi-Dense Local Feature Matching

Ali Youssef

Main category: cs.CV

TL;DR: VMatcher is a hybrid Mamba-Transformer network for efficient semi-dense feature matching, combining Mamba’s linear complexity with Transformer’s attention for state-of-the-art performance.

Details

Motivation: Existing feature matching methods rely on Transformers, which are computationally expensive due to quadratic complexity. VMatcher aims to improve efficiency without sacrificing performance.

Method: VMatcher integrates Mamba’s Selective State-Space Model (SSM) with Transformer’s attention, proposing hierarchical architectures for robust and efficient matching.

Result: VMatcher sets new benchmarks efficiently, demonstrating robustness and practicality for real-time applications.

Conclusion: VMatcher offers a computationally efficient and high-performing solution for feature matching, suitable for real-time use.

Abstract: This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer’s attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba’s highly efficient long-sequence processing with the Transformer’s attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher

Wei Li, Xun Gong, Jiao Li, Xiaobin Sun

Main category: cs.CV

TL;DR: The paper introduces Adaptive Grouped Alignment (AGA), a framework for learning medical visual representations by capturing structured semantics from paired images and reports, avoiding reliance on hard negatives and simplifying clinical reports.

Details

Motivation: Current vision-language pretraining methods in medicine often oversimplify clinical reports and rely on impractical large-scale hard negatives. AGA addresses these limitations by leveraging structured semantics and adaptive grouping.

Method: AGA uses a bidirectional grouping mechanism based on sparse similarity matrices, dynamically learned thresholds, and weighted group representations. It employs an Instance Aware Group Alignment loss and a Bidirectional Cross-modal Grouped Alignment module for fine-grained alignment.

Result: AGA achieves strong performance on image-text retrieval and classification tasks in both fine-tuning and zero-shot settings, as validated on public and private datasets.

Conclusion: AGA effectively captures structured semantics in medical data, eliminating the need for external negatives and outperforming existing methods in vision-language pretraining for medical applications.

Abstract: Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.

[154] UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, Liqiang Nie

Main category: cs.CV

TL;DR: UniEmo is a unified framework integrating emotional understanding and generation, using hierarchical feature extraction and diffusion models, enhanced by feedback loops for improved performance.

Details

Motivation: Emotional understanding and generation are complementary but often treated separately; UniEmo aims to unify them for mutual enhancement.

Method: Proposes a hierarchical emotional understanding chain with expert queries for multi-scale feature extraction, fused with a diffusion model for generation, using emotional correlation and condition loss. Includes joint training and a data filtering algorithm for feedback.

Result: UniEmo outperforms state-of-the-art methods in both emotional understanding and generation tasks.

Conclusion: The unified approach with dual feedback processes enhances emotional understanding and generation, validated by extensive experiments.

Abstract: Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model’s understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

[155] Machine learning and machine learned prediction in chest X-ray images

Shereiff Garrett, Abhinav Adhikari, Sarina Gautam, DaShawn Marquis Morris, Chandra Mani Adhikari

Main category: cs.CV

TL;DR: The paper explores machine learning for medical image analysis, comparing CNN and DenseNet-121 on chest X-rays, with DenseNet-121 showing better focus on critical areas.

Details

Motivation: To leverage machine learning for accurate medical diagnosis using chest X-rays without explicit programming.

Method: Implemented baseline CNN and DenseNet-121 on 5,824 chest X-ray images for binary classification.

Result: Both models performed well, but DenseNet-121 better localized critical image regions for decision-making.

Conclusion: DenseNet-121 is more effective for medical image analysis due to its superior focus on essential features.

Abstract: Machine learning and artificial intelligence are fast-growing fields of research in which data is used to train algorithms, learn patterns, and make predictions. This approach helps to solve seemingly intricate problems with significant accuracy without explicit programming by recognizing complex relationships in data. Taking an example of 5824 chest X-ray images, we implement two machine learning algorithms, namely, a baseline convolutional neural network (CNN) and a DenseNet-121, and present our analysis in making machine-learned predictions in predicting patients with ailments. Both baseline CNN and DenseNet-121 perform very well in the binary classification problem presented in this work. Gradient-weighted class activation mapping shows that DenseNet-121 correctly focuses on essential parts of the input chest X-ray images in its decision-making more than the baseline CNN.

[156] Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation

Haoran Chen, Zexiao Wang, Haidong Cao, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: The paper proposes MP^2A, a progressive alignment strategy for adapting CLIP to unlabeled tasks, mitigating noise and improving domain-invariant feature learning.

Details

Motivation: Existing methods align source and target domains in one shot, struggling with noisy samples and error propagation, especially in multi-source scenarios.

Method: MP^2A trains on high-confidence target samples first, gradually adding harder ones to refine alignment and reduce noise impact.

Result: MP^2A outperforms recent CLIP-based MS-UDA methods on benchmarks like ImageCLEF, Office-Home, and DomainNet.

Conclusion: The progressive alignment strategy effectively mitigates confirmation bias and enhances domain-invariant feature learning, achieving state-of-the-art performance.

Abstract: Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.

[157] NeRF Is a Valuable Assistant for 3D Gaussian Splatting

Shuangkang Fang, I-Chao Shen, Takeo Igarashi, Yufeng Wang, ZeSheng Wang, Yi Yang, Wenrui Ding, Shuchang Zhou

Main category: cs.CV

TL;DR: NeRF-GS combines Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) to overcome 3DGS limitations, achieving state-of-the-art performance.

Details

Motivation: To address 3DGS's sensitivity to initialization, limited spatial awareness, and weak inter-Gaussian correlations by leveraging NeRF's continuous spatial representation.

Method: Jointly optimizes NeRF and 3DGS by aligning their spatial features and optimizing residual vectors for implicit features and Gaussian positions.

Result: Outperforms existing methods on benchmark datasets, demonstrating complementary strengths of NeRF and 3DGS.

Conclusion: NeRF-GS shows that combining NeRF and 3DGS is effective for 3D scene representation, offering insights into hybrid approaches.

Abstract: We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.

[158] Mitigating Resolution-Drift in Federated Learning: Case of Keypoint Detection

Taeheon Lim, Joohyung Lee, Kyungjae Lee, Jungchan Cho

Main category: cs.CV

TL;DR: The paper introduces resolution-adaptive federated learning (RAF) to address resolution drift in non-classification tasks like human pose estimation, improving performance without overfitting.

Details

Motivation: Existing FL research focuses on classification tasks, leaving non-classification tasks like human pose estimation underexplored. The paper identifies resolution drift as a critical issue in such tasks.

Method: RAF uses heatmap-based knowledge distillation, transferring knowledge from higher-resolution (teacher) to lower-resolution (student) outputs to enhance resolution robustness.

Result: RAF mitigates resolution drift, improves performance, and integrates seamlessly into existing FL frameworks. t-SNE analysis shows its potential for other spatial-detail tasks.

Conclusion: RAF effectively addresses resolution drift in FL for non-classification tasks, with broader applicability to tasks requiring spatial detail preservation.

Abstract: The Federated Learning (FL) approach enables effective learning across distributed systems, while preserving user data privacy. To date, research has primarily focused on addressing statistical heterogeneity and communication efficiency, through which FL has achieved success in classification tasks. However, its application to non-classification tasks, such as human pose estimation, remains underexplored. This paper identifies and investigates a critical issue termed ``resolution-drift,’’ where performance degrades significantly due to resolution variability across clients. Unlike class-level heterogeneity, resolution drift highlights the importance of resolution as another axis of not independent or identically distributed (non-IID) data. To address this issue, we present resolution-adaptive federated learning (RAF), a method that leverages heatmap-based knowledge distillation. Through multi-resolution knowledge distillation between higher-resolution outputs (teachers) and lower-resolution outputs (students), our approach enhances resolution robustness without overfitting. Extensive experiments and theoretical analysis demonstrate that RAF not only effectively mitigates resolution drift and achieves significant performance improvements, but also can be integrated seamlessly into existing FL frameworks. Furthermore, although this paper focuses on human pose estimation, our t-SNE analysis reveals distinct characteristics between classification and high-resolution representation tasks, supporting the generalizability of RAF to other tasks that rely on preserving spatial detail.

[159] Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories

Lemar Abdi, Francisco Caetano, Amaan Valiuddin, Christiaan Viviers, Hamdi Joudeh, Fons van der Sommen

Main category: cs.CV

TL;DR: Proposes a reconstruction-free OOD detection method using Stein score-based denoising diffusion models (SBDDM) for efficient and accurate anomaly detection in medical imaging.

Details

Motivation: Unsupervised OOD detection is needed for rare pathological cases, but current methods are computationally expensive and unreliable.

Method: Uses forward diffusion trajectories of SBDDM to capture trajectory curvature via Stein scores, enabling anomaly scoring with minimal steps.

Result: Achieves state-of-the-art performance with 10.43% and 18.10% relative improvements for Near-OOD and Far-OOD detection, reducing computational costs.

Conclusion: SBDDM is a practical solution for real-time, reliable computer-aided diagnosis in medical imaging.

Abstract: In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.

[160] Honey Adulteration Detection using Hyperspectral Imaging and Machine Learning

Mokhtar A. Al-Awadhi, Ratnadeep R. Deshmukh

Main category: cs.CV

TL;DR: A machine learning system using hyperspectral imaging detects honey adulteration with sugar syrup, achieving 96.39% accuracy.

Details

Motivation: To provide an automated, accurate alternative to chemical-based honey adulteration detection methods.

Method: Uses LDA for feature extraction and KNN for classification in two subsystems: botanical origin identification and adulteration detection.

Result: Achieves 96.39% cross-validation accuracy in detecting adulteration.

Conclusion: The system is effective and suitable as an alternative to traditional methods.

Abstract: This paper aims to develop a machine learning-based system for automatically detecting honey adulteration with sugar syrup, based on honey hyperspectral imaging data. First, the floral source of a honey sample is classified by a botanical origin identification subsystem. Then, the sugar syrup adulteration is identified, and its concentration is quantified by an adulteration detection subsystem. Both subsystems consist of two steps. The first step involves extracting relevant features from the honey sample using Linear Discriminant Analysis (LDA). In the second step, we utilize the K-Nearest Neighbors (KNN) model to classify the honey botanical origin in the first subsystem and identify the adulteration level in the second subsystem. We assess the proposed system performance on a public honey hyperspectral image dataset. The result indicates that the proposed system can detect adulteration in honey with an overall cross-validation accuracy of 96.39%, making it an appropriate alternative to the current chemical-based detection methods.

[161] Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification

Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Cosimo Distante, Abdelmalik Taleb-Ahmed

Main category: cs.CV

TL;DR: The paper enhances a dual-teacher self-supervised framework for art style classification by replacing MLP projections with Kolmogorov-Arnold Networks (KANs), improving accuracy and modeling nonlinear feature interactions.

Details

Motivation: Art style classification is challenging due to limited labeled data and complex stylistic elements. Existing methods struggle with global context and nonlinear feature interactions.

Method: The authors replace MLP projection and prediction heads in a dual-teacher framework with KANs to better model nonlinear feature correlations and global compositional context.

Result: Experiments on WikiArt and Pandora18k show improved Top-1 accuracy and better linear probe accuracy compared to the base dual-teacher architecture.

Conclusion: KANs effectively disentangle complex style manifolds, outperforming MLP projections in accuracy and feature modeling.

Abstract: Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov-Arnold Networks (KANs). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs’ spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds, leading to better linear probe accuracy than MLP projections.

[162] Adjustable Spatio-Spectral Hyperspectral Image Compression Network

Martin Hermann Paul Fuchs, Behnood Rasti, Begüm Demir

Main category: cs.CV

TL;DR: The paper introduces HyCASS, a learning-based model for adjustable hyperspectral image compression in spectral and spatial dimensions, analyzing the effects of spectral and spatial compression.

Details

Motivation: The rapid growth of hyperspectral data archives necessitates efficient storage, yet the joint effects of spectral and spatial compression in learning-based HSI compression remain unexplored.

Method: HyCASS employs six modules (spectral/spatial encoders/decoders, CR adapter encoder/decoder) using convolutional layers and transformer blocks to capture redundancies.

Result: Experiments on benchmark datasets show HyCASS outperforms existing models, providing guidelines for balancing spectral and spatial compression.

Conclusion: HyCASS offers an effective solution for adjustable HSI compression, with publicly available code and pre-trained models.

Abstract: With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder; 2) spatial encoder; 3) compression ratio (CR) adapter encoder; 4) CR adapter decoder; 5) spatial decoder; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on two HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass .

[163] I Am Big, You Are Little; I Am Right, You Are Wrong

David A. Kelly, Akchunya Chanchal, Nathan Blake

Main category: cs.CV

TL;DR: The paper explores how minimal sufficient pixel sets reveal differences in how image classifiers (like ConvNext and EVA) focus on images, showing misclassified images require larger pixel sets.

Details

Motivation: Understanding how different vision models make decisions is limited, despite their widespread use. The study aims to uncover insights into model behavior by analyzing their focus (concentration) on minimal pixel sets.

Method: The authors propose using minimal sufficient pixel sets to measure a model’s concentration—key pixels that define an image for the model. They compare the size, position, and overlap of these sets across architectures.

Result: Different architectures (e.g., ConvNext, EVA) exhibit distinct concentration patterns in size and position. Misclassified images correlate with larger pixel sets than correctly classified ones.

Conclusion: The study highlights how minimal pixel sets can reveal model behavior, showing architectural differences and linking misclassification to larger required pixel sets.

Abstract: Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model’s classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model’s `concentration’: the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications.

[164] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes

Bin Xie, Congxuan Zhang, Fagan Wang, Peng Liu, Feng Lu, Zhen Chen, Weiming Hu

Main category: cs.CV

TL;DR: The CST Anti-UAV dataset addresses limitations in UAV tracking by featuring tiny UAVs in complex scenes, with 220 sequences and 240k annotations. It evaluates 20 SOT methods, revealing challenges (35.92% accuracy) and the need for better benchmarks.

Details

Motivation: Existing UAV datasets lack diversity and complexity, limiting real-world applicability. CST Anti-UAV aims to fill this gap for better anti-UAV systems.

Method: Introduces CST Anti-UAV, a thermal infrared dataset with 220 sequences and 240k annotations, focusing on tiny UAVs in complex scenes. Evaluates 20 SOT methods.

Result: State-of-the-art methods achieve only 35.92% accuracy on CST Anti-UAV, highlighting tracking challenges in complex environments.

Conclusion: CST Anti-UAV exposes benchmark limitations and drives advancements in UAV tracking and anti-UAV systems.

Abstract: The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.

[165] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

Ting Huang, Zeyu Zhang, Hao Tang

Main category: cs.CV

TL;DR: 3D-R1 enhances 3D vision-language models with a synthetic dataset, RLHF training, and dynamic view selection, improving reasoning and generalization by 10%.

Details

Motivation: Current 3D VLMs lack robust reasoning due to limited spatial data and static viewpoints.

Method: Constructs Scene-30K dataset, uses RLHF (GRPO) with three rewards, and implements dynamic view selection.

Result: Achieves 10% average improvement on 3D scene benchmarks.

Conclusion: 3D-R1 effectively boosts reasoning and generalization in 3D scene understanding.

Abstract: Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

[166] ART: Adaptive Relation Tuning for Generalized Relation Prediction

Gopika Sudhakaran, Hikaru Shindo, Patrick Schramowski, Simone Schaub-Meyer, Kristian Kersting, Stefan Roth

Main category: cs.CV

TL;DR: ART improves VRD by using instruction tuning and adaptive sampling to generalize better and handle novel relations.

Details

Motivation: Existing VRD models struggle with generalization and complex relations, prompting the need for a more adaptable approach.

Method: ART adapts VLMs through instruction tuning and adaptive sampling, converting VRD datasets into instructional format.

Result: ART outperforms baselines, generalizes to unseen relations, and aids in scene segmentation.

Conclusion: Instruction tuning and adaptive sampling in ART enhance VRD performance and generalization.

Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART’s practical value by using the predicted relations for segmenting complex scenes.

[167] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning

Julia Werner, Oliver Bause, Julius Oexle, Maxime Le Floch, Franz Brinkmann, Jochen Hampe, Oliver Bringmann

Main category: cs.CV

TL;DR: A multi-task neural network is introduced for video capsule endoscopy, combining self-localization and anomaly detection in the small intestine, achieving high accuracy with minimal parameters.

Details

Motivation: The short battery life of capsule endoscopy devices and data sparsity challenge AI integration. This work aims to reduce energy consumption and improve functionality by combining tasks in a single model.

Method: Developed a multi-task neural network with restricted parameters, using the Galar dataset, multi-task methods, and Viterbi decoding for time-series analysis.

Result: Achieved 93.63% accuracy in localization and 87.48% in anomaly detection with only 1 million parameters, outperforming single-task models.

Conclusion: The multi-task approach advances AI in capsule endoscopy, offering efficient, accurate, and deployable solutions for real-time decision-making.

Abstract: Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision- making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant ad- vance in AI-based approaches in this field. Our model achieves an accu- racy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.

[168] FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction

Donghyun Lee, Dawoon Jeong, Jae W. Lee, Hongil Yoon

Main category: cs.CV

TL;DR: FastPoint accelerates 3D point cloud processing by predicting distance trends in farthest point sampling, achieving 2.55x speedup without accuracy loss.

Details

Motivation: Efficiently handling large and irregular point clouds is challenging in deep neural networks.

Method: FastPoint predicts distance trends between sampled points to avoid exhaustive pairwise distance computations.

Result: Achieves 2.55x speedup on NVIDIA RTX 3090 GPU while preserving model performance.

Conclusion: FastPoint offers a significant acceleration for 3D point cloud models without compromising accuracy.

Abstract: Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.

[169] Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion

Mutian Xu, Chongjie Ye, Haolin Liu, Yushuang Wu, Jiahao Chang, Xiaoguang Han

Main category: cs.CV

TL;DR: The paper introduces Stable-Sim2Real, a two-stage depth diffusion model for 3D data simulation, improving realism and performance in real-world 3D tasks.

Details

Motivation: Bridging the gap between synthetic and real-captured 3D data, which current methods struggle to achieve due to predefined physical priors.

Method: A two-stage depth diffusion model: first stage finetunes Stable-Diffusion for residual generation, and the second refines results using a 3D discriminator.

Result: Enhanced performance in real-world 3D tasks and high similarity between simulated and real-captured data.

Conclusion: Stable-Sim2Real effectively improves 3D data simulation, offering a robust solution for real-world applications.

Abstract: 3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes Stable-Diffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: https://mutianxu.github.io/stable-sim2real/.

[170] Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carrión-Ojeda, Stefan Roth, Simone Schaub-Meyer

Main category: cs.CV

TL;DR: EMAT improves few-shot classification and segmentation, especially for small objects, with fewer parameters and novel evaluation settings.

Details

Motivation: Current SOTA struggles with small objects in FS-CS tasks, and existing evaluation settings waste costly annotations.

Method: Proposes EMAT with memory-efficient masked attention, learnable downscaling, and parameter-efficiency enhancements.

Result: Outperforms SOTA on PASCAL-5i and COCO-20i with 4x fewer parameters. Introduces better evaluation settings.

Conclusion: EMAT is efficient and effective for FS-CS, addressing small object challenges and improving evaluation practicality.

Abstract: Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.

[171] Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions

Jinshan Zhen, Yuanyue Ge, Tianxiao Zhu, Hui Zhao, Ya Xiong

Main category: cs.CV

TL;DR: A vision-based pipeline using RGB-D sensing and deep learning for accurate mass estimation of strawberries, handling occlusions and pose variations effectively.

Details

Motivation: Challenges in mass estimation due to occlusions and pose variations in field-grown strawberries.

Method: Combines YOLOv8-Seg for segmentation, CycleGAN for occlusion recovery, tilt-angle correction, and polynomial regression for mass mapping.

Result: Mean errors of 8.11% (isolated) and 10.47% (occluded); CycleGAN outperformed LaMa in occlusion recovery.

Conclusion: Robust solution for automated harvesting and yield monitoring, addressing traditional limitations.

Abstract: Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.

[172] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion

Timing Li, Bing Cao, Jiahe Feng, Haifang Cao, Qinghau Hu, Pengfei Zhu

Main category: cs.CV

TL;DR: Hy-CycleAlign, a hyperbolic space-based image registration method, improves cross-modal alignment and fusion quality with a dual-path cyclic framework and hyperbolic constraints.

Details

Motivation: Existing Euclidean space-based registration methods fail to handle cross-modal misalignment effectively, leading to suboptimal fusion.

Method: Proposes Hy-CycleAlign, a dual-path cyclic registration framework in hyperbolic space, with a Hyperbolic Hierarchy Contrastive Alignment (H²CA) module.

Result: Outperforms existing methods in alignment and fusion, demonstrating hyperbolic space’s effectiveness for multi-modal registration.

Conclusion: Hy-CycleAlign is a novel, effective solution for cross-modal image registration, enhancing fusion quality.

Abstract: Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H$^{2}$CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.

[173] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Yung-Hsu Yang, Luigi Piccinelli, Mattia Segu, Siyuan Li, Rui Huang, Yuqian Fu, Marc Pollefeys, Hermann Blum, Zuria Bauer

Main category: cs.CV

TL;DR: The paper introduces 3D-MOOD, the first end-to-end monocular 3D object detector for open-set settings, addressing challenges like novel environments and object categories. It achieves state-of-the-art results by jointly training 2D and 3D tasks and using geometry prior and canonical image space.

Details

Motivation: Existing monocular 3D object detection methods are limited to closed-set settings, failing in real-world scenarios with new environments and object categories. This paper aims to solve this by enabling open-set detection.

Method: The proposed 3D-MOOD lifts open-set 2D detection into 3D space using a 3D bounding box head, incorporates geometry prior for object queries, and designs a canonical image space for cross-dataset training.

Result: 3D-MOOD achieves state-of-the-art performance on both closed-set (Omni3D) and open-set (Omni3D to Argoverse 2, ScanNet) benchmarks.

Conclusion: The paper successfully addresses open-set monocular 3D object detection, demonstrating superior performance and generalization across diverse scenes and datasets.

Abstract: Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.

[174] Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization

Maxime Pietrantoni, Gabriela Csurka, Torsten Sattler

Main category: cs.CV

TL;DR: The paper introduces Gaussian Splatting Feature Fields (GSFFs) for visual localization, combining 3DGS with implicit feature fields for accurate and privacy-preserving pose estimation.

Details

Motivation: To improve visual localization by leveraging dense geometric information and differentiable rasterization from 3DGS, while addressing privacy concerns.

Method: Proposes GSFFs, aligning 3D scale-aware feature fields with 2D encoders via contrastive learning, and uses 3D clustering for regularization and segmentation.

Result: Achieves state-of-the-art performance on real-world datasets for both privacy-preserving and non-privacy-preserving localization.

Conclusion: GSFFs effectively combine explicit geometry and implicit features, enabling robust and privacy-aware visual localization.

Abstract: Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.

[175] Enhanced Velocity Field Modeling for Gaussian Video Reconstruction

Zhenyang Li, Xiaoyang Bai, Tongchen Zhang, Pengfei Shen, Weiwei Xu, Yifan Peng

Main category: cs.CV

TL;DR: FlowGaussian-VR improves 3D video reconstruction by using a velocity field modeling scheme and adaptive densification, achieving better visual quality and trackable Gaussian trajectories.

Details

Motivation: Existing deformation networks overfit to irregular Gaussian trajectories in complex motion videos, and gradient-based densification fails for dynamic content.

Method: Proposes FlowGaussian-VR with velocity field rendering (VFR) for optical flow optimization and flow-assisted adaptive densification (FAD) for dynamic regions.

Result: Achieves over 2.5 dB PSNR gain, reduces blurry artifacts, and ensures regularized, trackable Gaussian trajectories.

Conclusion: FlowGaussian-VR effectively addresses challenges in dynamic scene reconstruction, enhancing visual quality and motion representation.

Abstract: High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model’s effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.

[176] Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi, Mohamed Ilyas Lakhal, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: BeyondGloss is a gloss-free SLT framework using VideoLLMs for spatio-temporal reasoning, achieving SOTA results on benchmarks.

Details

Motivation: Addressing the modality gap in SLT and capturing fine hand movements.

Method: Uses VideoLLMs with fine-grained textual descriptions, contrastive alignment, and HaMeR distillation.

Result: Achieves state-of-the-art performance on Phoenix14T and CSL-Daily benchmarks.

Conclusion: BeyondGloss effectively bridges the modality gap and improves SLT performance.

Abstract: Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

[177] SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions

Jessica Bader, Leander Girrbach, Stephan Alaniz, Zeynep Akata

Main category: cs.CV

TL;DR: The paper introduces SUB, a benchmark for evaluating Concept Bottleneck Models (CBMs) under distribution shifts, using synthetic images with controlled concept variations.

Details

Motivation: CBMs lack robustness to concept variations under distribution shifts, limiting their reliability in critical fields like medicine.

Method: A novel Tied Diffusion Guidance (TDG) method is used to generate synthetic images with precise control over bird class and attributes, creating the SUB benchmark.

Result: SUB enables rigorous evaluation of CBMs, highlighting their limitations under concept variations.

Conclusion: The SUB benchmark and TDG method contribute to developing more robust interpretable models.

Abstract: Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at https://github.com/ExplainableML/sub and the dataset at http://huggingface.co/datasets/Jessica-bader/SUB.

[178] MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model

Yaoye Zhu, Zhe Wang, Yan Wang

Main category: cs.CV

TL;DR: MamV2XCalib is a V2X-based infrastructure camera calibration method using vehicle-side LiDAR, eliminating manual intervention and reference objects. It introduces a targetless LiDAR-camera calibration method with multi-scale features and 4D correlation volume, addressing calibration failures in V2X scenarios.

Details

Motivation: Traditional manual calibration methods for infrastructure cameras are time-consuming and labor-intensive, often requiring road closures. There's a need for an automated, efficient solution.

Method: MamV2XCalib leverages vehicle-side LiDAR and roadside cameras, combining multi-scale features and a 4D correlation volume for calibration. It uses Mamba to model temporal information and estimate rotation angles, handling data defects and viewpoint differences.

Result: Evaluated on V2X-Seq and TUMTraf-V2X datasets, MamV2XCalib shows better and more stable performance than previous methods, with fewer parameters.

Conclusion: MamV2XCalib provides an effective, robust, and automated solution for infrastructure camera calibration in V2X scenarios, outperforming traditional and single-car LiDAR-camera methods.

Abstract: As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.

[179] MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Zijian Dong, Longteng Duan, Jie Song, Michael J. Black, Andreas Geiger

Main category: cs.CV

TL;DR: MoGA reconstructs high-fidelity 3D Gaussian avatars from single-view images using a generative avatar model and 2D diffusion models, ensuring 3D consistency and realism.

Details

Motivation: The challenge is inferring unseen appearance and geometric details from a single view while maintaining 3D consistency, as previous methods using 2D diffusion models produce inconsistent and unrealistic results.

Method: MoGA integrates a generative avatar model as a prior, projects input images into its latent space, and enforces 3D constraints. It formulates avatar creation as a model inversion process, refining pose estimation and ensuring 3D regularization.

Result: MoGA outperforms state-of-the-art methods, generalizes well to real-world scenarios, and produces animatable Gaussian avatars.

Conclusion: MoGA effectively combines generative models and 2D diffusion to achieve high-fidelity 3D avatar reconstruction from single-view images.

Abstract: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable

[180] DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation

Yuchen Zhou, Yan Luo, Xiangang Wang, Xingjian Gu, Mingzhou Lu

Main category: cs.CV

TL;DR: A directional pure 2D approach balances 3D occupancy prediction accuracy and real-time processing for autonomous driving by slicing 3D voxel features and using directional attention.

Details

Motivation: Current methods prioritize accuracy over real-time needs, hindering autonomous driving performance.

Method: Slices 3D voxel features to retain vertical geometry, uses directional attention for efficient feature extraction.

Result: Achieves 39.3% mIoU and 27.7 FPS on Occ3D-nuScenes, 14.8 FPS on edge devices.

Conclusion: The method effectively balances accuracy and efficiency, suitable for real-time AD systems.

Abstract: Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird’s-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.

[181] Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

Xin Li, Keren Fu, Qijun Zhao

Main category: cs.CV

TL;DR: Vcamba integrates spatial and frequency features for video camouflaged object detection, outperforming state-of-the-art methods with lower computation cost.

Details

Motivation: High similarity between foreground and background in VCOD limits spatial appearance features' discriminability, necessitating enhanced feature representation and motion perception.

Method: Proposes Vcamba with RFVSS for spatial features, AFE for frequency learning, SLMP and FLMP for motion perception, and SFMF for feature fusion.

Result: Vcamba outperforms existing methods on 6 metrics across 2 datasets with reduced computation.

Conclusion: Vcamba’s spatio-frequency approach enhances VCOD accuracy and efficiency, validated by superior performance.

Abstract: Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.

[182] Medical Image De-Identification Benchmark Challenge

Linmin Pei, Granger Sutton, Michael Rutherford, Ulrike Wagner, Tracy Nolan, Kirk Smith, Phillip Farmer, Peter Gu, Ambar Rana, Kailing Chen, Thomas Ferleman, Brian Park, Ye Wu, Jordan Kojouharov, Gargi Singh, Jon Lemon, Tyler Willis, Milos Vukadinovic, Grant Duffy, Bryan He, David Ouyang, Marco Pereanez, Daniel Samber, Derek A. Smith, Christopher Cannistraci, Zahi Fayad, David S. Mendelson, Michele Bufano, Elmar Kotter, Hamideh Haghiri, Rajesh Baidya, Stefan Dvoretskii, Klaus H. Maier-Hein, Marco Nolden, Christopher Ablett, Silvia Siggillino, Sandeep Kaushik, Hongzhu Jiang, Sihan Xie, Zhiyu Wan, Alex Michie, Simon J Doran, Angeline Aurelia Waly, Felix A. Nathaniel Liang, Humam Arshad Mustagfirin, Michelle Grace Felicia, Kuo Po Chih, Rahul Krish, Ghulam Rasool, Nidhal Bouaynaya, Nikolas Koutsoubis, Kyle Naddeo, Kartik Pandit, Tony O’Sullivan, Raj Krish, Qinyan Pan, Scott Gustafson, Benjamin Kopchick, Laura Opsahl-Ong, Andrea Olvera-Morales, Jonathan Pinney, Kathryn Johnson, Theresa Do, Juergen Klenk, Maria Diaz, Arti Singh, Rong Chai, David A. Clunie, Fred Prior, Keyvan Farahani

Main category: cs.CV

TL;DR: The MIDI-B Challenge standardized benchmarking of DICOM image de-identification tools, ensuring HIPAA compliance while preserving research-critical metadata. Ten teams achieved high accuracy (97.91%-99.93%) using diverse methods.

Details

Motivation: To address the need for standardized de-identification of medical images (PHI/PII) while preserving non-PHI metadata for AI research, ensuring compliance with privacy laws.

Method: The challenge involved three phases (training, validation, test) using synthetic PHI/PII in real DICOM images. Teams used open-source/proprietary tools, OCR, and language models.

Result: Ten teams completed the test phase, achieving scores of 97.91% to 99.93% in correct de-identification actions.

Conclusion: The MIDI-B Challenge successfully demonstrated the feasibility of rule-based de-identification, highlighting diverse tool effectiveness and lessons for future improvements.

Abstract: The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge’s design, implementation, results, and lessons learned.

[183] Consistent Point Matching

Halid Ziya Yerebakan, Gerardo Hermosillo Valadez

Main category: cs.CV

TL;DR: Incorporating a consistency heuristic into point-matching improves robustness in medical image matching, surpassing state-of-the-art results and enabling efficient, high-precision navigation without ML.

Details

Motivation: To enhance robustness in matching anatomical locations across medical images without relying on machine learning models or training data.

Method: Incorporates a consistency heuristic into a point-matching algorithm, validated on diverse CT and MRI datasets.

Result: Surpasses state-of-the-art results on the Deep Lesion Tracking dataset and effectively addresses landmark localization.

Conclusion: The method provides efficient, high-precision navigation between medical images with configurable speed-robustness trade-offs.

Abstract: This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.

[184] DivControl: Knowledge Diversion for Controllable Image Generation

Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng

Main category: cs.CV

TL;DR: DivControl is a decomposable pretraining framework for unified controllable image generation, improving generalization and reducing training costs.

Details

Motivation: Existing methods for controllable image generation either require separate models for each condition or use entangled representations, leading to poor generalization and high adaptation costs.

Method: DivControl factorizes ControlNet via SVD into disentangled components (learngenes and tailors) and uses a dynamic gate for knowledge diversion, enabling zero-shot generalization. It also employs a representation alignment loss for better condition fidelity.

Result: DivControl achieves state-of-the-art controllability with 36.4× less training cost, improves performance on basic conditions, and shows strong zero-shot and few-shot performance on unseen conditions.

Conclusion: DivControl offers superior scalability, modularity, and transferability for controllable image generation.

Abstract: Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.

[185] FFGAF-SNN: The Forward-Forward Based Gradient Approximation Free Training Framework for Spiking Neural Networks

Changqing Xu, Ziqiang Yang, Yi Liu, Xinfang Liao, Guiqi Mo, Hao Zeng, Yintang Yang

Main category: cs.CV

TL;DR: A Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks (SNNs) is proposed, eliminating gradient approximation and reducing computational complexity. It includes a class-aware complexity adaptation mechanism for efficient resource allocation, achieving high test accuracies and low computational power consumption.

Details

Motivation: Training SNNs is challenging due to non-differentiability and high computational demands of gradient approximation methods, limiting deployment on edge devices.

Method: Proposes a Forward-Forward (FF) based training framework treating spiking activations as black-box modules, avoiding gradient approximation. Introduces a class-aware complexity adaptation mechanism for dynamic loss optimization.

Result: Achieves test accuracies of 99.58%, 92.13%, and 75.64% on MNIST, Fashion-MNIST, and CIFAR-10 datasets, outperforming existing FF-based SNN methods. Reduces memory access and computational power consumption.

Conclusion: The proposed framework effectively trains SNNs without gradient approximation, offering high accuracy and efficiency, making it suitable for edge devices.

Abstract: Spiking Neural Networks (SNNs) offer a biologically plausible framework for energy-efficient neuromorphic computing. However, it is a challenge to train SNNs due to their non-differentiability, efficiently. Existing gradient approximation approaches frequently sacrifice accuracy and face deployment limitations on edge devices due to the substantial computational requirements of backpropagation. To address these challenges, we propose a Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks, which treats spiking activations as black-box modules, thereby eliminating the need for gradient approximation while significantly reducing computational complexity. Furthermore, we introduce a class-aware complexity adaptation mechanism that dynamically optimizes the loss function based on inter-class difficulty metrics, enabling efficient allocation of network resources across different categories. Experimental results demonstrate that our proposed training framework achieves test accuracies of 99.58%, 92.13%, and 75.64% on the MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively, surpassing all existing FF-based SNN approaches. Additionally, our proposed method exhibits significant advantages in terms of memory access and computational power consumption.

[186] Adaptively Distilled ControlNet: Accelerated Training and Superior Sampling for Medical Image Synthesis

Kunpeng Qiu, Zhiying Zhou, Yongxin Guo

Main category: cs.CV

TL;DR: A framework called Adaptively Distilled ControlNet improves medical image segmentation by using dual-model distillation for training and privacy-preserving generation.

Details

Motivation: Medical image annotation is limited by privacy and labeling efforts, restricting segmentation model performance.

Method: Uses a teacher model (conditioned on mask-image pairs) to regularize a mask-only student model via noise alignment and adaptive regularization. Only the student is used during sampling.

Result: Achieves state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4%/4.2% on KiTS19; SANet gains 2.6%/3.5% on Polyps.

Conclusion: The framework effectively enhances segmentation accuracy and privacy in medical image generation.

Abstract: Medical image annotation is constrained by privacy concerns and labor-intensive labeling, significantly limiting the performance and generalization of segmentation models. While mask-controllable diffusion models excel in synthesis, they struggle with precise lesion-mask alignment. We propose \textbf{Adaptively Distilled ControlNet}, a task-agnostic framework that accelerates training and optimization through dual-model distillation. Specifically, during training, a teacher model, conditioned on mask-image pairs, regularizes a mask-only student model via predicted noise alignment in parameter space, further enhanced by adaptive regularization based on lesion-background ratios. During sampling, only the student model is used, enabling privacy-preserving medical image generation. Comprehensive evaluations on two distinct medical datasets demonstrate state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4%/4.2% on KiTS19, while SANet achieves 2.6%/3.5% gains on Polyps, highlighting its effectiveness and superiority. Code is available at GitHub.

[187] SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan, Tobias Czempiel, Chi Xu, Daniel S. Elson, Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA is an interactive HSI segmentation framework combining RGB models and spectral analysis, achieving high accuracy with minimal user input and handling diverse spectral data.

Details

Motivation: Addressing challenges in HSI medical imaging like data limitations and hardware variations by leveraging user-guided segmentation and spectral fusion.

Method: Combines RGB foundation models with spectral analysis, using user clicks to guide segmentation and spectral similarity, independent of spectral band count/resolution.

Result: Achieves 81.0-93.4% DICE scores with 1-5 clicks on neurosurgical and porcine datasets, effective in few-shot/zero-shot learning.

Conclusion: SAMSA provides a flexible, efficient framework for HSI medical image analysis, integrating diverse spectral datasets seamlessly.

Abstract: Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA’s effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.

[188] OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction

Yang Gao, Po-Chien Luan, Kaouther Messaoud, Lan Feng, Alexandre Alahi

Main category: cs.CV

TL;DR: OmniTraj, a Transformer-based model, addresses zero-shot transfer challenges in human trajectory prediction by explicitly conditioning on temporal metadata, achieving state-of-the-art performance without fine-tuning.

Details

Motivation: The paper addresses the limitation of pre-trained models in zero-shot transfer to unseen datasets with varying temporal dynamics, which restricts scalability and practical utility.

Method: The authors propose OmniTraj, a Transformer-based model pre-trained on a heterogeneous dataset, which uses explicit conditioning on frame rate for temporal generalization.

Result: OmniTraj reduces prediction error by over 70% in zero-shot transfer scenarios and achieves state-of-the-art results on four datasets after fine-tuning.

Conclusion: Explicit temporal conditioning is a highly effective solution for zero-shot transfer in human trajectory prediction, as demonstrated by OmniTraj’s performance.

Abstract: While large-scale pre-training has advanced human trajectory prediction, a critical challenge remains: zero-shot transfer to unseen dataset with varying temporal dynamics. State-of-the-art pre-trained models often require fine-tuning to adapt to new datasets with different frame rates or observation horizons, limiting their scalability and practical utility. In this work, we systematically investigate this limitation and propose a robust solution. We first demonstrate that existing data-aware discrete models struggle when transferred to new scenarios with shifted temporal setups. We then isolate the temporal generalization from dataset shift, revealing that a simple, explicit conditioning mechanism for temporal metadata is a highly effective solution. Based on this insight, we present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset. Our experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance, reducing prediction error by over 70% in challenging cross-setup scenarios. After fine-tuning, OmniTraj achieves state-of-the-art results on four datasets, including NBA, JTA, WorldPose, and ETH-UCY. The code is publicly available: https://github.com/vita-epfl/omnitraj

[189] I2V-GS: Infrastructure-to-Vehicle View Transformation with Gaussian Splatting for Autonomous Driving Data Generation

Jialei Chen, Wuhao Xu, Sipeng He, Baoru Huang, Dongchun Ren

Main category: cs.CV

TL;DR: I2V-GS introduces a method to synthesize autonomous driving data from infrastructure views using Gaussian Splatting, outperforming existing methods.

Details

Motivation: Current driving data collection is costly and inefficient; synthesizing data from real-world images offers a solution.

Method: Uses adaptive depth warp for dense training views, cascade inpainting for view expansion, and cross-view confidence-guided optimization.

Result: I2V-GS improves synthesis quality, outperforming StreetGaussian by significant margins in key metrics.

Conclusion: I2V-GS is a pioneering framework for infrastructure-vehicle view transformation, enhancing autonomous driving dataset generation.

Abstract: Vast and high-quality data are essential for end-to-end autonomous driving systems. However, current driving data is mainly collected by vehicles, which is expensive and inefficient. A potential solution lies in synthesizing data from real-world images. Recent advancements in 3D reconstruction demonstrate photorealistic novel view synthesis, highlighting the potential of generating driving data from images captured on the road. This paper introduces a novel method, I2V-GS, to transfer the Infrastructure view To the Vehicle view with Gaussian Splatting. Reconstruction from sparse infrastructure viewpoints and rendering under large view transformations is a challenging problem. We adopt the adaptive depth warp to generate dense training views. To further expand the range of views, we employ a cascade strategy to inpaint warped images, which also ensures inpainting content is consistent across views. To further ensure the reliability of the diffusion model, we utilize the cross-view information to perform a confidenceguided optimization. Moreover, we introduce RoadSight, a multi-modality, multi-view dataset from real scenarios in infrastructure views. To our knowledge, I2V-GS is the first framework to generate autonomous driving datasets with infrastructure-vehicle view transformation. Experimental results demonstrate that I2V-GS significantly improves synthesis quality under vehicle view, outperforming StreetGaussian in NTA-Iou, NTL-Iou, and FID by 45.7%, 34.2%, and 14.9%, respectively.

[190] UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

Zihan Cheng, Liangtai Zhou, Dian Chen, Ni Tang, Xiaotong Luo, Yanyun Qu

Main category: cs.CV

TL;DR: A novel unified image restoration framework using latent diffusion models (LDMs) with Degradation-Aware Feature Fusion (DAFF) and Detail-Aware Expert Module (DAEM) for handling diverse degradations and enhancing detail recovery.

Details

Motivation: To address the challenges of All-in-One Image Restoration (AiOIR) by leveraging the generative capacity of diffusion models while mitigating detail loss.

Method: Proposes a framework integrating low-quality visual priors into LDMs, featuring DAFF for adaptive degradation handling and DAEM for detail recovery.

Result: Achieves state-of-the-art performance in multi-task and mixed degradation settings, demonstrating practical potential.

Conclusion: The framework effectively unifies image restoration tasks using diffusion priors, with code to be released.

Abstract: All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.

[191] Explainable Image Classification with Reduced Overconfidence for Tissue Characterisation

Alfie Roddan, Chi Xu, Serine Ajlouni, Irini Kakaletri, Patra Charalampaki, Stamatia Giannarou

Main category: cs.CV

TL;DR: Proposes a method integrating risk estimation into pixel attribution for improved explainability in image classification, outperforming state-of-the-art.

Details

Motivation: Address overconfidence in deep learning models' predictions and pixel attribution, aiming for safer intraoperative tumor resections.

Method: Iteratively applies classification and pixel attribution to create PA maps, generating pixel-wise distributions and enhanced PA maps with risk estimation via CV.

Result: Outperforms state-of-the-art on pCLE data and ImageNet, providing improved explainability and risk estimation.

Conclusion: The method enhances explainability and risk awareness in image classification, aiding safer clinical decision-making.

Abstract: The deployment of Machine Learning models intraoperatively for tissue characterisation can assist decision making and guide safe tumour resections. For image classification models, pixel attribution methods are popular to infer explainability. However, overconfidence in deep learning model’s predictions translates to overconfidence in pixel attribution. In this paper, we propose the first approach which incorporates risk estimation into a pixel attribution method for improved image classification explainability. The proposed method iteratively applies a classification model with a pixel attribution method to create a volume of PA maps. This volume is used for the first time, to generate a pixel-wise distribution of PA values. We introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. Hence, the proposed method not only provides an improved PA map but also produces an estimation of risk on the output PA values. Performance evaluation on probe-based Confocal Laser Endomicroscopy (pCLE) data and ImageNet verifies that our improved explainability method outperforms the state-of-the-art.

[192] DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching

Emery Pierson, Lei Li, Angela Dai, Maks Ovsjanikov

Main category: cs.CV

TL;DR: The paper introduces a data-driven method for deep functional maps, replacing axiomatic regularization with a learned generative model for improved accuracy and generality in non-rigid shape matching.

Details

Motivation: Existing methods rely on axiomatic models for regularization, limiting accuracy and applicability. The goal is to replace these with data-driven approaches.

Method: Train a generative model of functional maps using score-based generative modeling, then use it to promote structural properties of ground truth maps.

Result: The learned model outperforms axiomatic approaches in zero-shot non-rigid shape matching and is category-agnostic.

Conclusion: Data-driven regularization fully replaces axiomatic methods, offering better results and broader applicability.

Abstract: Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/

[193] RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, Jianbing Shen

Main category: cs.CV

TL;DR: The paper introduces RAGNet, a large-scale affordance segmentation benchmark, and AffordanceNet, a framework for robotic grasping, addressing the lack of reasoning-based data in open-world scenarios.

Details

Motivation: Current robotic grasping systems lack reasoning-based large-scale affordance prediction data, limiting their effectiveness in open-world scenarios.

Method: The authors build RAGNet, a benchmark with 273k images and 26k reasoning instructions, and propose AffordanceNet, a framework combining a VLM and a grasping network.

Result: Experiments show AffordanceNet’s strong open-world generalization ability in affordance segmentation and real-robot tasks.

Conclusion: The work provides a robust solution for affordance-based grasping in diverse scenarios, with publicly available data and code.

Abstract: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.

[194] Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao, Yi Zhao, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: DIAS improves Object-Centric Learning by reducing slot redundancy and introducing self-distillation, achieving state-of-the-art performance in object discovery and recognition.

Details

Motivation: Existing Slot Attention methods suffer from redundant slots and lack of internal supervision, leading to poor object segmentation.

Method: DIAS re-initializes slots to reduce redundancy and uses self-distillation by aligning early and late attention maps.

Result: DIAS outperforms existing methods in object discovery, recognition, and advanced visual tasks.

Conclusion: DIAS addresses key limitations of Slot Attention, offering a more efficient and accurate approach for OCL tasks.

Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.

[195] SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting

Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Yuhui Zheng, Mingtao Feng, Guangming Shi

Main category: cs.CV

TL;DR: The paper introduces SeqAffordSplat, a benchmark for sequential 3D affordance reasoning, and SeqSplatNet, an end-to-end framework leveraging LLMs and geometric pre-training to advance affordance understanding in complex scenes.

Details

Motivation: Current 3D affordance reasoning methods are limited to single-object, single-step interactions, failing to address long-horizon, multi-object tasks needed for real-world applications.

Method: Proposes SeqSplatNet, which uses a large language model to autoregressively generate text with segmentation tokens, guiding a conditional decoder for 3D mask production. Includes geometric pre-training and semantic feature fusion from 2D Vision Foundation Models.

Result: SeqSplatNet achieves state-of-the-art performance on the SeqAffordSplat benchmark, advancing affordance reasoning to sequential, scene-level tasks.

Conclusion: The work bridges the gap in 3D affordance reasoning, enabling complex, sequential interactions in 3DGS environments through innovative pre-training and feature fusion techniques.

Abstract: 3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. To handle complex scene geometry, we introduce a pre-training strategy, Conditional Geometric Reconstruction, where the model learns to reconstruct complete affordance region masks from known geometric observations, thereby building a robust geometric prior. Furthermore, to resolve semantic ambiguities, we design a feature injection mechanism that lifts rich semantic features from 2D Vision Foundation Models (VFM) and fuses them into the 3D decoder at multiple scales. Extensive experiments demonstrate that our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.

[196] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions

Li Siyao, Yao Feng, Omid Tehari, Chen Change Loy, Michael J. Black

Main category: cs.CV

TL;DR: A novel ‘half-physics’ method enhances SMPL-X models by enabling dynamic physical interactions without unrealistic issues like interpenetration, while maintaining kinematic control and real-time performance.

Details

Motivation: Current 3D human models (e.g., SMPL-X) lack physical interaction capabilities, leading to unrealistic dynamics and interpenetration.

Method: Proposes a ‘half-physics’ mechanism to convert kinematic motion into physics simulation, ensuring plausible interactions without learning.

Result: Eliminates penetration and unrealistic dynamics, operates in real time, and generalizes across shapes and motions.

Conclusion: The method successfully integrates physical interactions into kinematic models, preserving motion fidelity and avoiding training complexities.

Abstract: While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a “half-physics” mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions

[197] Divided Attention: Unsupervised Multi-Object Discovery with Contextually Separated Slots

Dong Lao, Zhengyang Hu, Francesco Locatello, Yanchao Yang, Stefano Soatto

Main category: cs.CV

TL;DR: DivA is an unsupervised motion segmentation model that segments images into moving regions without semantic labels, achieving real-time performance and narrowing the gap with supervised methods.

Details

Motivation: To explore object emergence in visual perception without semantic supervision or pre-trained features, aiming for real-time, scalable motion segmentation.

Method: Uses a multi-modal conditional encoder-decoder architecture with optical flow and color image modalities, fostering information separation among latent codes (slots).

Result: DivA handles varying object counts and resolutions, achieves 104 FPS, and reduces the performance gap to supervised methods to ≤12%. It also improves contrastive learning for static classifiers.

Conclusion: DivA demonstrates robust unsupervised motion segmentation, bridging the gap with supervised methods and enabling efficient object proposal for downstream tasks.

Abstract: We investigate the emergence of objects in visual perception in the absence of any semantic annotation. The resulting model has received no supervision, does not use any pre-trained features, and yet it can segment the domain of an image into multiple independently moving regions. The resulting motion segmentation method can handle an unknown and varying number of objects in real-time. The core multi-modal conditional encoder-decoder architecture has one modality (optical flow) feed the encoder to produce a collection of latent codes (slots), and the other modality (color image) conditions the decoder to generate the first modality (flow) from the slots. The training criterion is designed to foster ‘information separation’ among the slots, while the architecture explicitly allocates activations to individual slots, leading to a method we call Divided Attention (DivA). At test time, DivA handles a different number of objects and different image resolution than seen at training, and is invariant to permutations of the slots. DivA achieves state-of-the-art performance while tripling the runtime speed of comparable methods, up to 104 FPS, and reduces the performance gap from supervised methods to 12% or less. Objects bootstrapped by DivA can then be used to prime static classifiers via contrastive learning. On fewer than 5,000 video clips, training DINO on DivA’s object proposals narrows the performance gap to ImageNet-based training by up to 30.2% compared to training directly on the video frames.

[198] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, Deva Ramanan

Main category: cs.CV

TL;DR: The paper proposes a method for dynamic scene reconstruction from sparse-view videos, addressing limitations of dense multi-view setups.

Details

Motivation: Prior methods require expensive, dense multi-view setups, which are impractical for diverse real-world scenes. The goal is to reconstruct dynamic human behaviors from sparse-view cameras.

Method: The approach aligns independent monocular reconstructions from each camera to ensure time- and view-consistent dynamic scene reconstructions.

Result: The method outperforms prior art, especially in novel view rendering, as demonstrated on PanopticStudio and Ego-Exo4D datasets.

Conclusion: The proposed sparse-view reconstruction method is effective, practical, and achieves high-quality results, with code and data publicly available.

Abstract: We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.

[199] Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, Baining Guo

Main category: cs.CV

TL;DR: A novel framework for video-to-4D generation using a Direct 4DMesh-to-GS Variation Field VAE and Gaussian Variation Field diffusion model, achieving high-quality dynamic 3D content from single video inputs.

Details

Motivation: Addressing the challenges of costly data construction and high-dimensional representation in 4D diffusion modeling for dynamic 3D content generation.

Method: Introduces a Direct 4DMesh-to-GS Variation Field VAE to encode canonical Gaussian Splats and their temporal variations, followed by a Gaussian Variation Field diffusion model trained on synthetic data.

Result: Superior generation quality compared to existing methods and remarkable generalization to in-the-wild video inputs.

Conclusion: The framework paves the way for generating high-quality animated 3D content from videos, even with synthetic training data.

Abstract: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

[200] EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts

Yuto Haneji, Taichi Nishimura, Hirotaka Kameko, Keisuke Shirai, Tomoya Yoshida, Keiya Kajimura, Koki Yamamoto, Taiyu Cui, Tomohiro Nishimoto, Shinsuke Mori

Main category: cs.CV

TL;DR: The paper introduces the EgoOops dataset for mistake detection in text-following activities, combining video-text alignment and mistake labels, showing text integration improves accuracy.

Details

Motivation: Existing mistake detection methods focus on visually apparent errors in free-style activities, neglecting text-following tasks where text reference is essential.

Method: Proposes the EgoOops dataset with egocentric videos and three annotations: video-text alignment, mistake labels, and descriptions. Introduces a detection approach combining video-text alignment and classification.

Result: Experiments confirm procedural texts are crucial for accurate mistake detection in text-following activities.

Conclusion: The EgoOops dataset and proposed method address gaps in mistake detection, emphasizing the importance of text integration.

Abstract: Mistake action detection is crucial for developing intelligent archives that detect workers’ errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Additionally, current mistake datasets rarely use procedural texts for video recording except for cooking. To fill these gaps, this paper proposes the EgoOops dataset, where egocentric videos record erroneous activities when following procedural texts across diverse domains. It features three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. We also propose a mistake detection approach, combining video-text alignment and mistake label classification to leverage the texts. Our experimental results show that incorporating procedural texts is essential for mistake detection. Data is available through https://y-haneji.github.io/EgoOops-project-page/.

[201] Prompt-Based Exemplar Super-Compression and Regeneration for Class-Incremental Learning

Ruxiao Duan, Jieneng Chen, Adam Kortylewski, Alan Yuille, Yaoyao Liu

Main category: cs.CV

TL;DR: PESCR introduces a novel method for class-incremental learning by using a pre-trained diffusion model to compress and regenerate diverse exemplars, reducing memory use and improving performance.

Details

Motivation: Address the limitation of replay-based CIL methods, which suffer from poor exemplar diversity due to memory constraints.

Method: Uses a pre-trained diffusion model to compress images into visual/textual prompts, regenerating diverse exemplars later. Includes partial compression and diffusion-based augmentation to reduce domain gaps.

Result: Achieves 3.2% higher performance than previous state-of-the-art on ImageNet-100.

Conclusion: PESCR effectively enhances exemplar diversity and quantity, significantly boosting CIL performance without fine-tuning or storing the diffusion model.

Abstract: Replay-based methods in class-incremental learning (CIL) have attained remarkable success. Despite their effectiveness, the inherent memory restriction results in saving a limited number of exemplars with poor diversity. In this paper, we introduce PESCR, a novel approach that substantially increases the quantity and enhances the diversity of exemplars based on a pre-trained general-purpose diffusion model, without fine-tuning it on target datasets or storing it in the memory buffer. Images are compressed into visual and textual prompts, which are saved instead of the original images, decreasing memory consumption by a factor of 24. In subsequent phases, diverse exemplars are regenerated by the diffusion model. We further propose partial compression and diffusion-based data augmentation to minimize the domain gap between generated exemplars and real images. PESCR significantly improves CIL performance across multiple benchmarks, e.g., 3.2% above the previous state-of-the-art on ImageNet-100.

[202] FovEx: Human-Inspired Explanations for Vision Transformers and Convolutional Neural Networks

Mahadev Prasad Panda, Matteo Tiezzi, Martina Vilas, Gemma Roig, Bjoern M. Eskofier, Dario Zanca

Main category: cs.CV

TL;DR: FovEx is a novel XAI method combining foveation and gradients for efficient, architecture-agnostic explanations, outperforming existing techniques and aligning with human gaze patterns.

Details

Motivation: Current XAI methods are either architecture-dependent or computationally expensive, limiting their practicality and trustworthiness.

Method: FovEx integrates biologically inspired foveated perturbations with gradient-based explorations to efficiently identify and combine locations of interest for attribution maps.

Result: FovEx achieves state-of-the-art performance on transformers and convolutional models, with strong alignment to human gaze patterns (+14% NSS vs. RISE, +203% NSS vs. GradCAM).

Conclusion: FovEx bridges the interpretation gap between humans and machines, offering a versatile and efficient XAI solution.

Abstract: Explainability in artificial intelligence (XAI) remains a crucial aspect for fostering trust and understanding in machine learning models. Current visual explanation techniques, such as gradient-based or class-activation-based methods, often exhibit a strong dependence on specific model architectures. Conversely, perturbation-based methods, despite being model-agnostic, are computationally expensive as they require evaluating models on a large number of forward passes. In this work, we introduce Foveation-based Explanations (FovEx), a novel XAI method inspired by human vision. FovEx seamlessly integrates biologically inspired perturbations by iteratively creating foveated renderings of the image and combines them with gradient-based visual explorations to determine locations of interest efficiently. These locations are selected to maximize the performance of the model to be explained with respect to the downstream task and then combined to generate an attribution map. We provide a thorough evaluation with qualitative and quantitative assessments on established benchmarks. Our method achieves state-of-the-art performance on both transformers (on 4 out of 5 metrics) and convolutional models (on 3 out of 5 metrics), demonstrating its versatility among various architectures. Furthermore, we show the alignment between the explanation map produced by FovEx and human gaze patterns (+14% in NSS compared to RISE, +203% in NSS compared to GradCAM). This comparison enhances our confidence in FovEx’s ability to close the interpretation gap between humans and machines.

[203] SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image

Dimitrije Antić, Georgios Paschalidis, Shashank Tripathi, Theo Gevers, Sai Kumar Dwivedi, Dimitrios Tzionas

Main category: cs.CV

TL;DR: SDFit is a render-and-compare framework for 3D pose and shape estimation from images, addressing generalization and occlusion challenges with a learned morphable SDF model and foundational features.

Details

Motivation: Existing methods struggle with real-world generalization, lack refinement loops, and ignore pixel alignment. SDFit aims to overcome these by leveraging morphable SDFs and foundational models.

Method: SDFit uses a learned morphable SDF model, retrieves initial shapes from databases, and refines pose/shape via 2D-3D correspondences.

Result: SDFit matches SotA on unoccluded images and excels in robustness to occlusions/uncommon poses without retraining.

Conclusion: SDFit advances generalization in 3D pose/shape estimation, offering robustness and flexibility for real-world applications.

Abstract: Recovering 3D object pose and shape from a single image is a challenging and ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and the lack of 3D ground truth for natural images. Existing deep-network methods are trained on synthetic datasets to predict 3D shapes, so they often struggle generalizing to real-world images. Moreover, they lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without directly considering pixel alignment. To tackle these limitations, we develop a novel render-and-compare optimization framework, called SDFit. This has three key innovations: First, it uses a learned category-specific and morphable signed-distance-function (mSDF) model, and fits this to an image by iteratively refining both 3D pose and shape. The mSDF robustifies inference by constraining the search on the manifold of valid shapes, while allowing for arbitrary shape topologies. Second, SDFit retrieves an initial 3D shape that likely matches the image, by exploiting foundational models for efficient look-up into 3D shape databases. Third, SDFit initializes pose by establishing rich 2D-3D correspondences between the image and the mSDF through foundational features. We evaluate SDFit on three image datasets, i.e., Pix3D, Pascal3D+, and COMIC. SDFit performs on par with SotA feed-forward networks for unoccluded images and common poses, but is uniquely robust to occlusions and uncommon poses. Moreover, it requires no retraining for unseen images. Thus, SDFit contributes new insights for generalizing in the wild. Code is available at https://anticdimi.github.io/sdfit.

Yudong Zhang, Ruobing Xie, Xingwu Sun, Yiqing Huang, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang

Main category: cs.CV

TL;DR: DHCP detects hallucinations in LVLMs by analyzing cross-modal attention patterns without extra training or inference steps.

Details

Motivation: LVLMs suffer from hallucination issues (object, attribute, relational), requiring accurate detection methods.

Method: Developed DHCP, a lightweight detector leveraging cross-modal attention pattern variations between hallucination and non-hallucination states.

Result: DHCP achieves remarkable performance in hallucination detection.

Conclusion: DHCP advances LVLM reliability by providing insights into hallucination identification and analysis.

Abstract: Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. However, they continue to suffer from significant hallucination issues, including object, attribute, and relational hallucinations. To accurately detect these hallucinations, we investigated the variations in cross-modal attention patterns between hallucination and non-hallucination states. Leveraging these distinctions, we developed a lightweight detector capable of identifying hallucinations. Our proposed method, Detecting Hallucinations by Cross-modal Attention Patterns (DHCP), is straightforward and does not require additional LVLM training or extra LVLM inference steps. Experimental results show that DHCP achieves remarkable performance in hallucination detection. By offering novel insights into the identification and analysis of hallucinations in LVLMs, DHCP contributes to advancing the reliability and trustworthiness of these models. The code is available at https://github.com/btzyd/DHCP.

[205] Understanding implementation pitfalls of distance-based metrics for image segmentation

Gasper Podobnik, Tomaz Vrtovec

Main category: cs.CV

TL;DR: The paper highlights inconsistencies in distance-based metric implementations (e.g., Hausdorff distance) across 11 open-source tools, revealing significant deviations and biases in medical imaging validation. It offers recommendations for tool selection and future improvements.

Details

Motivation: To address unrecognized discrepancies in distance-based metric implementations in medical imaging, which undermine benchmarking and introduce bias in clinical applications.

Method: Conceptual and empirical analysis of 11 open-source tools, evaluating computational steps and performance on 2D/3D image datasets.

Result: Deviations in Hausdorff distance exceeded 100 mm, with statistically significant differences between tools, impacting segmentation comparisons.

Conclusion: Prior cross-study comparisons may be invalid due to tool discrepancies. Recommendations for tool selection and future implementations are provided.

Abstract: Distance-based metrics, such as the Hausdorff distance (HD), are widely used to validate segmentation performance in (bio)medical imaging. However, their implementation is complex, and critical differences across open-source tools remain largely unrecognized by the community. These discrepancies undermine benchmarking efforts, introduce bias in biomarker calculations, and potentially distort medical device development and clinical commissioning. In this study, we systematically dissect 11 open-source tools that implement distance-based metric computation by performing both a conceptual analysis of their computational steps and an empirical analysis on representative two- and three-dimensional image datasets. Alarmingly, we observed deviations in HD exceeding 100 mm and identified multiple statistically significant differences between tools - demonstrating that statistically significant improvements on the same set of segmentations can be achieved simply by selecting a particular implementation. These findings cast doubts on the validity of prior comparisons of results across studies without accounting for the differences in metric implementations. To address this, we provide practical recommendations for tool selection; additionally, our conceptual analysis informs about the future evolution of implementing open-source tools.

[206] LidaRefer: Context-aware Outdoor 3D Visual Grounding for Autonomous Driving

Yeong-Seung Baek, Heung-Seon Oh

Main category: cs.CV

TL;DR: LidaRefer is a context-aware 3D visual grounding framework for outdoor scenes, addressing challenges like background dominance and lack of referential annotations. It uses object-centric feature selection, transformer-based alignment, and a novel supervision strategy (DiSCo) with pseudo-labeling, achieving top performance on Talk2Car-3D.

Details

Motivation: Outdoor 3D visual grounding is underexplored due to challenges like background dominance and lack of referential annotations for contextual learning.

Method: LidaRefer employs object-centric feature selection, a transformer-based encoder-decoder for cross-modal alignment, and DiSCo for explicit spatial relationship modeling. Pseudo-labeling is used for referential non-target objects.

Result: LidaRefer achieves state-of-the-art performance on the Talk2Car-3D dataset.

Conclusion: The proposed framework effectively addresses outdoor 3D VG challenges, demonstrating superior performance through innovative feature selection and supervision strategies.

Abstract: 3D visual grounding (VG) aims to locate objects or regions within 3D scenes guided by natural language descriptions. While indoor 3D VG has advanced, outdoor 3D VG remains underexplored due to two challenges: (1) large-scale outdoor LiDAR scenes are dominated by background points and contain limited foreground information, making cross-modal alignment and contextual understanding more difficult; and (2) most outdoor datasets lack spatial annotations for referential non-target objects, which hinders explicit learning of referential context. To this end, we propose LidaRefer, a context-aware 3D VG framework for outdoor scenes. LidaRefer incorporates an object-centric feature selection strategy to focus on semantically relevant visual features while reducing computational overhead. Then, its transformer-based encoder-decoder architecture excels at establishing fine-grained cross-modal alignment between refined visual features and word-level text features, and capturing comprehensive global context. Additionally, we present Discriminative-Supportive Collaborative localization (DiSCo), a novel supervision strategy that explicitly models spatial relationships between target, contextual, and ambiguous objects for accurate target identification. To enable this without manual labeling, we introduce a pseudo-labeling approach that retrieves 3D localization labels for referential non-target objects. LidaRefer achieves state-of-the-art performance on Talk2Car-3D dataset under various evaluation settings.

[207] MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Xi Fang, Jiankun Wang, Xiaochen Cai, Shangqian Chen, Shuwen Yang, Haoyi Tao, Nan Wang, Lin Yao, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: MolParser is a novel OCSR method for extracting chemical structures from documents, outperforming existing methods by leveraging a large annotated dataset and active learning.

Details

Motivation: The rapid growth of chemistry publications and patents, with key information embedded in molecular figures, necessitates efficient extraction methods for large-scale applications.

Method: MolParser uses an extended SMILES encoding rule, builds the largest annotated dataset (MolParser-7M), and employs active learning and curriculum learning for training.

Result: MolParser outperforms classical and learning-based OCSR methods, especially in handling Markush structures and real-world document variations.

Conclusion: MolParser offers a robust solution for chemical structure extraction, with potential for broader applications, and the dataset is publicly available.

Abstract: In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly available in huggingface.

[208] Multi-Task Label Discovery via Hierarchical Task Tokens for Partially Annotated Dense Predictions

Jingdong Zhang, Hanrong Ye, Xin Li, Wenping Wang, Dan Xu

Main category: cs.CV

TL;DR: The paper proposes a novel method using hierarchical task tokens (global and fine-grained) to address the lack of pixel-wise supervision in multi-task dense prediction with partial annotations.

Details

Motivation: Existing methods for multi-task dense prediction with partial annotations rely on cross-task relations or adversarial training, but lack direct pixel-wise supervision and require heavy mapping networks.

Method: The approach learns compact hierarchical task tokens (global and fine-grained) to discover consistent pixel-wise supervision signals at feature and prediction levels. Global tokens enable cross-task feature interactions, while fine-grained tokens interact with task-specific feature maps to generate pseudo labels.

Result: Experiments on NYUD-v2, Cityscapes, and PASCAL Context datasets show significant improvements over state-of-the-art methods.

Conclusion: The proposed hierarchical task token method effectively addresses the supervision gap in partially annotated multi-task dense prediction, outperforming existing approaches.

Abstract: In recent years, simultaneous learning of multiple dense prediction tasks with partially annotated label data has emerged as an important research area. Previous works primarily focus on leveraging cross-task relations or conducting adversarial training for extra regularization, which achieve promising performance improvements, while still suffering from the lack of direct pixel-wise supervision and extra training of heavy mapping networks. To effectively tackle this challenge, we propose a novel approach to optimize a set of compact learnable hierarchical task tokens, including global and fine-grained ones, to discover consistent pixel-wise supervision signals in both feature and prediction levels. Specifically, the global task tokens are designed for effective cross-task feature interactions in a global context. Then, a group of fine-grained task-specific spatial tokens for each task is learned from the corresponding global task tokens. It is embedded to have dense interactions with each task-specific feature map. The learned global and local fine-grained task tokens are further used to discover pseudo task-specific dense labels at different levels of granularity, and they can be utilized to directly supervise the learning of the multi-task dense prediction framework. Extensive experimental results on challenging NYUD-v2, Cityscapes, and PASCAL Context datasets demonstrate significant improvements over existing state-of-the-art methods for partially annotated multi-task dense prediction.

[209] Revisiting the Evaluation Bias Introduced by Frame Sampling Strategies in Surgical Video Segmentation Using SAM2

Utku Ozbulak, Seyed Amir Mousavi, Francesca Tozzi, Niki Rashidian, Wouter Willaert, Wesley De Neve, Joris Vankerschaver

Main category: cs.CV

TL;DR: The study examines how inconsistent annotation density and frame rates in surgical video datasets affect zero-shot segmentation model evaluation, revealing biases in sparse sampling and advocating for real-time, high-FPS evaluation.

Details

Motivation: To address the variability in annotation protocols for surgical video segmentation and its impact on model evaluation, particularly in real-time AI-assisted surgery.

Method: Investigated the influence of annotation density and frame rate sampling on zero-shot segmentation models (using SAM2 for cholecystectomy), comparing sparse vs. high-FPS evaluations and surveying human preferences.

Result: Sparse evaluations misleadingly favor lower frame rates due to smoothing effects, while real-time conditions show higher frame rates improve stability. Human participants preferred high-FPS segmentations.

Conclusion: Inconsistent dataset protocols introduce evaluation bias; real-time, high-FPS benchmarking is crucial for accurate assessment in surgical video AI.

Abstract: Real-time video segmentation is a promising opportunity for AI-assisted surgery, offering intraoperative guidance by identifying tools and anatomical structures. Despite growing interest in surgical video segmentation, annotation protocols vary widely across datasets – some provide dense, frame-by-frame labels, while others rely on sparse annotations sampled at low frame rates such as 1 FPS. In this study, we investigate how such inconsistencies in annotation density and frame rate sampling influence the evaluation of zero-shot segmentation models, using SAM2 as a case study for cholecystectomy procedures. Surprisingly, we find that under conventional sparse evaluation settings, lower frame rates can appear to outperform higher ones due to a smoothing effect that conceals temporal inconsistencies. However, when assessed under real-time streaming conditions, higher frame rates yield superior segmentation stability, particularly for dynamic objects like surgical graspers. To understand how these differences align with human perception, we conducted a survey among surgeons, nurses, and machine learning engineers and found that participants consistently preferred high-FPS segmentation overlays, reinforcing the importance of evaluating every frame in real-time applications rather than relying on sparse sampling strategies. Our findings highlight the risk of evaluation bias that is introduced by inconsistent dataset protocols and bring attention to the need for temporally fair benchmarking in surgical video AI.

[210] Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models

Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang

Main category: cs.CV

TL;DR: The paper introduces PAR, a novel framework for accelerating LVLM inference by adaptively pruning tokens and layers, balancing performance and efficiency without retraining.

Details

Motivation: High computational costs of LVLMs limit their application. Existing methods either require retraining or fail to consistently select relevant tokens.

Method: PAR uses a meta-router to organize pruning flows across tokens and layers in a self-supervised manner.

Result: PAR achieves a superior balance between performance and efficiency and offers flexible pruning versions for various scenarios.

Conclusion: PAR provides an effective, flexible solution for LVLM inference acceleration, with publicly available code.

Abstract: Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational costs pose a significant barrier to wide application. To enhance inference efficiency, most existing approaches can be categorized as parameter-dependent or token-dependent strategies to reduce computational demands. However, parameter-dependent methods require retraining LVLMs to recover performance while token-dependent strategies struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of acceleration scenarios. The code for this work is publicly available at https://github.com/ASGO-MM/Pruning-All-Rounder.

[211] Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space

Jian Zhu, Zhengyu Jia, Tian Gao, Jiaxin Deng, Shidi Li, Lang Zhang, Fu Liu, Peng Jia, Xianpeng Lang

Main category: cs.CV

TL;DR: The paper introduces EOT-WM, a driving world model that unifies ego and other vehicle trajectories in videos for realistic driving simulation, outperforming state-of-the-art methods.

Details

Motivation: Existing world models focus on ego vehicle trajectories, neglecting interactions with other vehicles, limiting realistic simulation.

Method: EOT-WM projects trajectories into image coordinates, aligns them with video latents using a Spatial-Temporal Variational Auto Encoder, and employs a trajectory-injected diffusion Transformer for video generation.

Result: The model achieves 30% better FID and 55% better FVD on nuScenes, and can predict unseen driving scenes.

Conclusion: EOT-WM enhances driving simulation by integrating ego-other vehicle trajectories, improving realism and controllability.

Abstract: Advanced end-to-end autonomous driving systems predict other vehicles’ motions and plan ego vehicle’s trajectory. The world model that can foresee the outcome of the trajectory has been used to evaluate the autonomous driving system. However, existing world models predominantly emphasize the trajectory of the ego vehicle and leave other vehicles uncontrollable. This limitation hinders their ability to realistically simulate the interaction between the ego vehicle and the driving scenario. In this paper, we propose a driving World Model named EOT-WM, unifying Ego-Other vehicle Trajectories in videos for driving simulation. Specifically, it remains a challenge to match multiple trajectories in the BEV space with each vehicle in the video to control the video generation. We first project ego-other vehicle trajectories in the BEV space into the image coordinate for vehicle-trajectory match via pixel positions. Then, trajectory videos are encoded by the Spatial-Temporal Variational Auto Encoder to align with driving video latents spatially and temporally in the unified visual space. A trajectory-injected diffusion Transformer is further designed to denoise the noisy video latents for video generation with the guidance of ego-other vehicle trajectories. In addition, we propose a metric based on control latent similarity to evaluate the controllability of trajectories. Extensive experiments are conducted on the nuScenes dataset, and the proposed model outperforms the state-of-the-art method by 30% in FID and 55% in FVD. The model can also predict unseen driving scenes with self-produced trajectories.

[212] Acknowledging Focus Ambiguity in Visual Questions

Chongyan Chen, Yu-Yun Tseng, Zhuoheng Li, Anush Venkatesh, Danna Gurari

Main category: cs.CV

TL;DR: The paper introduces VQ-FocusAmbiguity, the first VQA dataset addressing ambiguity in image regions for questions, and benchmarks models on tasks related to recognizing and locating ambiguous regions.

Details

Motivation: Existing VQA datasets lack accounting for ambiguity in image regions referred to by questions, prompting the creation of VQ-FocusAmbiguity.

Method: The authors introduce a new dataset, analyze its properties, and benchmark modern models on tasks involving focus ambiguity recognition and region localization.

Result: The dataset proves challenging for modern models, highlighting gaps in handling ambiguity.

Conclusion: The dataset is shared publicly to encourage progress in addressing focus ambiguity in VQA.

Abstract: No published work on visual question answering (VQA) accounts for ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each plausible image region a question could refer to when arriving at valid answers. We next analyze and compare our dataset to existing datasets to reveal its unique properties. Finally, we benchmark modern models for two novel tasks related to acknowledging focus ambiguity: recognizing whether a visual question has focus ambiguity and locating all plausible focus regions within the image. Results show that the dataset is challenging for modern models. To facilitate future progress on these tasks, we publicly share the dataset with an evaluation server at https://vizwiz.org/tasks-and-datasets/focus-ambiguity-in-visual-questions.

[213] Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction

Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang Liu

Main category: cs.CV

TL;DR: A dual-stage framework for occlusion-robust two-hand reconstruction from monocular images, combining 2D priors from vision foundation models and diffusion-based 3D refinement.

Details

Motivation: Addressing challenges like misalignment and penetration artifacts in two-hand reconstruction due to complex postures and occlusions.

Method: Uses a fusion alignment encoder for 2D priors and a diffusion model for 3D refinement, guided by collision gradients.

Result: Achieves state-of-the-art performance on InterHand2.6M, HIC, and FreiHAND datasets, improving occlusion handling and interaction robustness.

Conclusion: The proposed method effectively aligns and refines two-hand interactions, offering a robust solution for monocular image reconstruction.

Abstract: Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models and diffusion-based generative 3D interaction refinement to achieve occlusion-robust two-hand reconstruction. First, we introduce a lightweight fusion alignment encoder that aligns fused multimodal 2D priors like key points, segmentation maps, and depth cues from vision foundation models during training. This provides robust structured guidance, further enabling efficient inference without heavy foundation model encoders at test time while maintaining high reconstruction accuracy. Second, we implement a two-hand diffusion model explicitly trained to convert interpenetrated 3D poses into plausible, penetration-free counterparts. Through collision gradient-guided denoising, the model rectifies artifacts while preserving natural spatial relationships between hands. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, HIC, and FreiHAND datasets, significantly advancing occlusion handling and interaction robustness. Our code will be publicly released.

[214] Learning from Rendering: Realistic and Controllable Extreme Rainy Image Synthesis for Autonomous Driving Simulation

Kaibin Zhou, Kaifeng Huang, Hao Deng, Zelin Tao, Ziniu Liu, Lin Zhang, Shengjie Zhao

Main category: cs.CV

TL;DR: A learning-from-rendering rainy image synthesizer, CARLARain, is proposed to enhance realism and controllability in simulating extreme rainfall for autonomous driving perception models, improving semantic segmentation accuracy by 5%-8%.

Details

Motivation: Extreme rainfall conditions are rare and costly to capture in real-world settings, and existing simulators lack realism and controllability, limiting effective model evaluation.

Method: Combines rendering-based realism with learning-based controllability to create CARLARain, a simulator for paired rainy-clean images and labels under complex illumination.

Result: Improves semantic segmentation models’ accuracy (mIoU) by 5%-8% on synthetic data and enhances performance in real extreme rainy scenarios.

Conclusion: CARLARain effectively addresses the limitations of existing simulators, offering a reliable tool for evaluating and enhancing perception models in extreme weather.

Abstract: Autonomous driving simulators provide an effective and low-cost alternative for evaluating or enhancing visual perception models. However, the reliability of evaluation depends on the diversity and realism of the generated scenes. Extreme weather conditions, particularly extreme rainfalls, are rare and costly to capture in real-world settings. While simulated environments can help address this limitation, existing rainy image synthesizers often suffer from poor controllability over illumination and limited realism, which significantly undermines the effectiveness of the model evaluation. To that end, we propose a learning-from-rendering rainy image synthesizer, which combines the benefits of the realism of rendering-based methods and the controllability of learning-based methods. To validate the effectiveness of our extreme rainy image synthesizer on semantic segmentation task, we require a continuous set of well-labeled extreme rainy images. By integrating the proposed synthesizer with the CARLA driving simulator, we develop CARLARain an extreme rainy street scene simulator which can obtain paired rainy-clean images and labels under complex illumination conditions. Qualitative and quantitative experiments validate that CARLARain can effectively improve the accuracy of semantic segmentation models in extreme rainy scenes, with the models’ accuracy (mIoU) improved by 5% - 8% on the synthetic dataset and significantly enhanced in real extreme rainy scenarios under complex illuminations. Our source code and datasets are available at https://github.com/kb824999404/CARLARain/.

[215] Vector-Quantized Vision Foundation Models for Object-Centric Learning

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Main category: cs.CV

TL;DR: The paper proposes VQ-VFM-OCL (VVO), a unified architecture for Object-Centric Learning (OCL) that leverages Vision Foundation Model (VFM) representations more effectively by sharing quantized VFM features as reconstruction targets.

Details

Motivation: Existing OCL methods struggle with complex object textures and underutilize VFM representations.

Method: VVO unifies OCL methods by quantizing VFM representations as shared reconstruction targets, enhancing aggregation and supervision.

Result: VVO outperforms baselines in object discovery, recognition, and downstream tasks across various VFMs and decoders.

Conclusion: Shared quantization of VFM representations improves OCL performance, validated by experiments and mathematical analysis.

Abstract: Perceiving visual scenes as objects and background–like humans do–Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. OCL’s self-supervision of reconstructing the input from these aggregated slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. However, existing methods leverage VFM representations in diverse ways and often fail to fully exploit their potential. In response, we propose a clean architecture–Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO)–that unifies mainstream OCL methods. The key to our unification is simple yet effective, just shared quantizing the same VFM representation as the reconstruction target. Through mathematical modeling and statistical verification, we further analyze why VFM representations facilitate OCL aggregation and how their shared quantization as reconstruction targets strengthens OCL supervision. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. The implementation and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.

[216] VRM: Knowledge Distillation via Virtual Relation Matching

Weijia Zhang, Fei Xie, Weidong Cai, Chao Ma

Main category: cs.CV

TL;DR: The paper revives relational knowledge distillation (KD) by addressing key issues like overfitting and spurious responses, introducing virtual relation matching (VRM) with affinity graphs and dynamic pruning, achieving state-of-the-art results.

Details

Motivation: Relational KD methods lag behind instance-matching ones due to susceptibility to overfitting and spurious responses. The paper aims to revive and improve relational KD.

Method: Proposes VRM, transferring affinity graphs with inter-sample, inter-class, and inter-view correlations, and dynamically pruning unreliable edges to mitigate spurious responses.

Result: VRM achieves 74.0% accuracy for ResNet50-to-MobileNetV2 on ImageNet and improves DeiT-T by 14.44% on CIFAR-100, setting new benchmarks.

Conclusion: VRM effectively revives relational KD, offering richer guidance and stronger regularization, outperforming existing methods across datasets and tasks.

Abstract: Knowledge distillation (KD) aims to transfer the knowledge of a more capable yet cumbersome teacher model to a lightweight student model. In recent years, relation-based KD methods have fallen behind, as their instance-matching counterparts dominate in performance. In this paper, we revive relational KD by identifying and tackling several key issues in relation-based methods, including their susceptibility to overfitting and spurious responses. Specifically, we transfer novelly constructed affinity graphs that compactly encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view correlations by exploiting virtual views and relations as a new kind of knowledge. As a result, the student has access to richer guidance signals and stronger regularisation throughout the distillation process. To further mitigate the adverse impact of spurious responses, we prune the affinity graphs by dynamically detaching redundant and unreliable edges. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO datasets demonstrate the superior performance of the proposed virtual relation matching (VRM) method, where it consistently sets new state-of-the-art records over a range of models, architectures, tasks, and set-ups. For instance, VRM for the first time hits 74.0% accuracy for ResNet50-to-MobileNetV2 distillation on ImageNet, and improves DeiT-T by 14.44% on CIFAR-100 with a ResNet56 teacher.

[217] LiteGS: A High-performance Framework to Train 3DGS in Subminutes via System and Algorithm Codesign

Kaimin Liao, Hua Wang, Zhi Chen, Luchao Wang, Yaohua Tang

Main category: cs.CV

TL;DR: LiteGS optimizes 3D Gaussian Splatting (3DGS) training by addressing low-level computation, mid-level data management, and top-level algorithm layers, achieving up to 13.4x speedup with comparable or better quality.

Details

Motivation: 3DGS is promising but suffers from high training costs, prompting the need for an optimized framework.

Method: LiteGS introduces hardware-aware optimizations, dynamic spatial sorting, and a robust densification criterion to improve efficiency.

Result: LiteGS achieves up to 13.4x faster training than original 3DGS and surpasses SOTA lightweight models by 1.4x, while setting new accuracy records.

Conclusion: LiteGS significantly improves 3DGS training efficiency and quality, making it a superior alternative for 3D representation tasks.

Abstract: 3D Gaussian Splatting (3DGS) has emerged as promising alternative in 3D representation. However, it still suffers from high training cost. This paper introduces LiteGS, a high performance framework that systematically optimizes the 3DGS training pipeline from multiple aspects. At the low-level computation layer, we design a warp-based raster'' associated with two hardware-aware optimizations to significantly reduce gradient reduction overhead. At the mid-level data management layer, we introduce dynamic spatial sorting based on Morton coding to enable a performant Cluster-Cull-Compact’’ pipeline and improve data locality, therefore reducing cache misses. At the top-level algorithm layer, we establish a new robust densification criterion based on the variance of the opacity gradient, paired with a more stable opacity control mechanism, to achieve more precise parameter growth. Experimental results demonstrate that LiteGS accelerates the original 3DGS training by up to 13.4x with comparable or superior quality and surpasses the current SOTA in lightweight models by up to 1.4x speedup. For high-quality reconstruction tasks, LiteGS sets a new accuracy record and decreases the training time by an order of magnitude.

[218] Tile and Slide : A New Framework for Scaling NeRF from Local to Global 3D Earth Observation

Camille Billouard, Dawa Derksen, Alexandre Constantin, Bruno Vallet

Main category: cs.CV

TL;DR: Snake-NeRF scales NeRF to large scenes by tiling without overlap, using out-of-core methods and a novel 2x2 tile progression strategy.

Details

Motivation: Current NeRF methods are limited to small scenes due to memory constraints during training.

Method: Divides scenes into non-overlapping NeRFs, crops images with overlap, and uses a 2x2 tile progression strategy with a segmented sampler.

Result: Enables processing large satellite images on a single GPU with linear time complexity and no quality loss.

Conclusion: Snake-NeRF effectively scales NeRF for large scenes without compromising quality.

Abstract: Neural Radiance Fields (NeRF) have recently emerged as a paradigm for 3D reconstruction from multiview satellite imagery. However, state-of-the-art NeRF methods are typically constrained to small scenes due to the memory footprint during training, which we study in this paper. Previous work on large-scale NeRFs palliate this by dividing the scene into NeRFs. This paper introduces Snake-NeRF, a framework that scales to large scenes. Our out-of-core method eliminates the need to load all images and networks simultaneously, and operates on a single device. We achieve this by dividing the region of interest into NeRFs that 3D tile without overlap. Importantly, we crop the images with overlap to ensure each NeRFs is trained with all the necessary pixels. We introduce a novel $2\times 2$ 3D tile progression strategy and segmented sampler, which together prevent 3D reconstruction errors along the tile edges. Our experiments conclude that large satellite images can effectively be processed with linear time complexity, on a single GPU, and without compromise in quality.

[219] Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Zhumei Wang, Zechen Hu, Ruoxi Guo, Huaijin Pi, Ziyong Feng, Sida Peng, Xiaowei Zhou, Mingtao Pei, Siyuan Huang

Main category: cs.CV

TL;DR: Mocap-2-to-3 is a framework for recovering metrically accurate 3D human motion from monocular input by leveraging 2D data pre-training and multi-view synthesis, outperforming state-of-the-art methods.

Details

Motivation: Existing methods for absolute human motion recovery from monocular inputs are limited by dependency on 3D training data and difficulty in estimating metric-scale poses.

Method: The framework decomposes 3D motion into multi-view syntheses, pre-trains a single-view diffusion model on 2D data, and fine-tunes a multi-view model with 3D data. It also introduces a novel motion representation decoupling local pose and global movements.

Result: The method achieves superior performance in camera-space motion realism and world-grounded positioning, with strong generalization.

Conclusion: Mocap-2-to-3 effectively addresses challenges in monocular motion recovery, offering improved accuracy and generalization.

Abstract: Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.

[220] Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

Main category: cs.CV

TL;DR: Meta CLIP 2 improves multilingual CLIP training, outperforming English-only CLIP and mSigLIP in zero-shot tasks, setting new benchmarks without system-level changes.

Details

Motivation: Addressing challenges in scaling CLIP training to worldwide web data, including lack of curation for non-English data and performance degradation in multilingual settings.

Method: Proposes Meta CLIP 2, a recipe for training CLIP from scratch on global web-scale image-text pairs, with minimal changes to ensure mutual benefits from English and non-English data.

Result: Surpasses English-only CLIP by 0.8% and mSigLIP by 0.7% in zero-shot ImageNet classification, achieving SOTA on multilingual benchmarks like CVQA (57.4%), Babel-ImageNet (50.2%), and XM3600 (64.3%).

Conclusion: Meta CLIP 2 effectively overcomes the ‘curse of multilinguality’ and enhances performance across English and multilingual tasks, setting new standards for CLIP models.

Abstract: Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP’s training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., “curse of multilinguality” that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.

[221] WeakSupCon: Weakly Supervised Contrastive Learning for Encoder Pre-training

Bodong Zhang, Hamid Manoochehri, Xiwen Li, Beatrice S. Knudsen, Tolga Tasdizen

Main category: cs.CV

TL;DR: A novel weakly supervised feature representation learning method, WeakSupCon, is proposed to improve MIL classification by utilizing bag-level labels and contrastive losses.

Details

Motivation: Existing MIL approaches neglect weakly supervised feature representation learning, relying on self-supervised or pre-trained features.

Method: WeakSupCon employs multi-task learning with distinct contrastive losses for samples of different bag labels.

Result: WeakSupCon outperforms self-supervised methods in MIL classification across three datasets, even with limited resources.

Conclusion: WeakSupCon effectively leverages weak labels for feature learning, enhancing MIL performance.

Abstract: Weakly supervised multiple instance learning (MIL) is a challenging task given that only bag-level labels are provided, while each bag typically contains multiple instances. This topic has been extensively studied in histopathological image analysis, where labels are usually available only at the whole slide image (WSI) level, while each WSI could be divided into thousands of small image patches for training. The dominant MIL approaches focus on feature aggregation and take fixed patch features as inputs. However, weakly supervised feature representation learning in MIL settings is always neglected. Those features used to be generated by self-supervised learning methods that do not utilize weak labels, or by foundation encoders pre-trained on other large datasets. In this paper, we propose a novel weakly supervised feature representation learning method called Weakly Supervised Contrastive Learning (WeakSupCon) that utilizes bag-level labels. In our method, we employ multi-task learning and define distinct contrastive losses for samples with different bag labels. Our experiments demonstrate that the features generated using WeakSupCon with limited computing resources significantly enhance MIL classification performance compared to self-supervised approaches across three datasets. Our WeakSupCon code is available at github.com/BzhangURU/Paper_WeakSupCon

[222] VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong

Main category: cs.CV

TL;DR: VL-Cogito, a multimodal reasoning model, uses Progressive Curriculum Reinforcement Learning (PCuRL) to improve performance across diverse tasks by dynamically adjusting training difficulty and reasoning path length.

Details

Motivation: Existing models struggle with unstable performance in multimodal tasks due to their complexity and diversity.

Method: VL-Cogito employs PCuRL with two innovations: an online difficulty soft weighting mechanism and a dynamic length reward mechanism.

Result: VL-Cogito outperforms existing models in multimodal benchmarks across mathematics, science, logic, and general understanding.

Conclusion: The PCuRL framework effectively enhances multimodal reasoning, balancing efficiency and correctness.

Abstract: Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.

[223] PLMP – Point-Line Minimal Problems for Projective SfM

Kim Kiehn, Albin Ahlbäck, Kathlén Kohn

Main category: cs.CV

TL;DR: The paper classifies 291 minimal problems in Structure-from-Motion (SfM) involving points and lines observed by uncalibrated cameras, identifying 73 with unique solutions. It analyzes solution counts and introduces methods to factorize and identify minimal problems.

Details

Motivation: To systematically classify and understand minimal problems in SfM for uncalibrated cameras, enabling efficient solutions and insights into problem difficulty.

Method: The study identifies and classifies minimal problems, computes solution counts, and uses stabilizer subgroups to factorize and analyze problems.

Result: Found 291 minimal problems, 73 with unique solutions. Problems have up to 9 cameras, 7 points, and 12 lines. Solution counts are relatively low.

Conclusion: The work provides a comprehensive classification and tools for analyzing minimal SfM problems, aiding in understanding and solving them efficiently.

Abstract: We completely classify all minimal problems for Structure-from-Motion (SfM) where arrangements of points and lines are fully observed by multiple uncalibrated pinhole cameras. We find 291 minimal problems, 73 of which have unique solutions and can thus be solved linearly. Two of the linear problems allow an arbitrary number of views, while all other minimal problems have at most 9 cameras. All minimal problems have at most 7 points and at most 12 lines. We compute the number of solutions of each minimal problem, as this gives a measurement of the problem’s intrinsic difficulty, and find that these number are relatively low (e.g., when comparing with minimal problems for calibrated cameras). Finally, by exploring stabilizer subgroups of subarrangements, we develop a geometric and systematic way to 1) factorize minimal problems into smaller problems, 2) identify minimal problems in underconstrained problems, and 3) formally prove non-minimality.

[224] Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin

Main category: cs.CV

TL;DR: Ultra3D is a 3D generation framework that improves efficiency in sparse voxel modeling using VecSet representation and Part Attention, achieving high-resolution results without quality loss.

Details

Motivation: Existing sparse voxel frameworks are computationally inefficient due to quadratic complexity in attention mechanisms, hindering high-resolution 3D generation.

Method: Ultra3D uses VecSet for coarse layout generation and Part Attention for localized feature refinement, reducing token count and computation.

Result: Achieves 6.7x speed-up in latent generation and supports 1024 resolution with state-of-the-art visual fidelity and user preference.

Conclusion: Ultra3D efficiently accelerates 3D generation while maintaining high quality, making it a promising solution for high-resolution modeling.

Abstract: Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

[225] Recovering Partially Corrupted Objects via Sketch-Guided Bidirectional Feature Interaction

Yongle Zhang, Yimin Liu, Yan Huang, Qiang Wu

Main category: cs.CV

TL;DR: A bidirectional feature interaction framework enhances sketch-guided object inpainting by integrating context and sketch information for better spatial control.

Details

Motivation: Existing text-guided diffusion models lack precise pixel-level spatial control, and sketch-guided methods often neglect contextual information from uncorrupted areas, leading to inconsistency.

Method: Proposes a bidirectional interaction framework (context-to-sketch and sketch-to-inpainting) within a pretrained Stable Diffusion model, using multi-scale latents and sketch-conditional affine transformation.

Result: Outperforms state-of-the-art methods on benchmark datasets, achieving enhanced spatial fidelity in object inpainting.

Conclusion: The bidirectional interaction effectively addresses sketch-guided inconsistency, improving structural control for partially corrupted objects.

Abstract: Text-guided diffusion models have achieved remarkable success in object inpainting by providing high-level semantic guidance through text prompts. However, they often lack precise pixel-level spatial control, especially in scenarios involving partially corrupted objects where critical uncorrupted cues remain. To overcome this limitation, sketch-guided methods have been introduced, using either indirect gradient modulation or direct sketch injection to improve structural control. Yet, existing approaches typically establish a one-way mapping from the sketch to the masked regions only, neglecting the contextual information from unmasked object areas. This leads to a disconnection between the sketch and the uncorrupted content, thereby causing sketch-guided inconsistency and structural mismatch. To tackle this challenge, we propose a sketch-guided bidirectional feature interaction framework built upon a pretrained Stable Diffusion model. Our bidirectional interaction features two complementary directions, context-to-sketch and sketch-to-inpainting, that enable fine-grained spatial control for partially corrupted object inpainting. In the context-to-sketch direction, multi-scale latents from uncorrupted object regions are propagated to the sketch branch to generate a visual mask that adapts the sketch features to the visible context and denoising progress. In the sketch-to-inpainting direction, a sketch-conditional affine transformation modulates the influence of sketch guidance based on the learned visual mask, ensuring consistency with uncorrupted object content. This interaction is applied at multiple scales within the encoder of the diffusion U-Net, enabling the model to restore object structures with enhanced spatial fidelity. Extensive experiments on two newly constructed benchmark datasets demonstrate that our approach outperforms state-of-the-art methods.

[226] RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation

Yuwen Du, Anning Hu, Zichen Chao, Yifan Lu, Junhao Ge, Genjia Liu, Weitao Wu, Lanjun Wang, Siheng Chen

Main category: cs.CV

TL;DR: RoCo-Sim is a simulation framework for roadside collaborative perception, addressing data issues like calibration errors and multi-view consistency through dynamic editing and style transfer, outperforming SOTA methods.

Details

Motivation: Existing roadside perception methods focus on model design but neglect data issues, leading to poor performance. RoCo-Sim aims to enhance perception by tackling these data challenges.

Method: RoCo-Sim includes Camera Extrinsic Optimization, Multi-View Occlusion-Aware Sampler, DepthSAM for foreground-background modeling, and a Scalable Post-Processing Toolkit.

Result: RoCo-Sim improves 3D object detection, outperforming SOTA by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70.

Conclusion: RoCo-Sim fills a critical gap in roadside perception simulation, offering a robust solution for collaborative perception.

Abstract: Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by 83.74 on Rcooper-Intersection and 83.12 on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon: https://github.com/duyuwen-duen/RoCo-Sim

[227] PatchTraj: Unified Time-Frequency Representation Learning via Dynamic Patches for Trajectory Prediction

Yanghong Liu, Xingping Dong, Ming Li, Weixing Zhang, Yidong Lou

Main category: cs.CV

TL;DR: PatchTraj is a dynamic patch-based framework for pedestrian trajectory prediction, integrating time-frequency joint modeling to address limitations in existing methods. It achieves state-of-the-art performance on multiple datasets.

Details

Motivation: Existing methods inadequately model human motion dynamics and lack interaction between time representations and frequency components. PatchTraj aims to overcome these challenges.

Method: Decomposes trajectories into time sequences and frequency components, uses dynamic patch partitioning for multi-scale segmentation, and employs adaptive embedding with hierarchical feature aggregation. Cross-modal attention enhances temporal and spectral cues.

Result: Achieves significant improvements, notably 26.7% in ADE and 17.4% in FDE on the JRDB dataset, outperforming existing methods.

Conclusion: PatchTraj demonstrates strong expressive power and potential for embodied intelligence, offering accurate trajectory predictions.

Abstract: Pedestrian trajectory prediction is crucial for autonomous driving and robotics. While existing point-based and grid-based methods expose two main limitations: insufficiently modeling human motion dynamics, as they fail to balance local motion details with long-range spatiotemporal dependencies, and the time representations lack interaction with their frequency components in jointly modeling trajectory sequences. To address these challenges, we propose PatchTraj, a dynamic patch-based framework that integrates time-frequency joint modeling for trajectory prediction. Specifically, we decompose the trajectory into raw time sequences and frequency components, and employ dynamic patch partitioning to perform multi-scale segmentation, capturing hierarchical motion patterns. Each patch undergoes adaptive embedding with scale-aware feature extraction, followed by hierarchical feature aggregation to model both fine-grained and long-range dependencies. The outputs of the two branches are further enhanced via cross-modal attention, facilitating complementary fusion of temporal and spectral cues. The resulting enhanced embeddings exhibit strong expressive power, enabling accurate predictions even when using a vanilla Transformer architecture. Extensive experiments on ETH-UCY, SDD, NBA, and JRDB datasets demonstrate that our method achieves state-of-the-art performance. Notably, on the egocentric JRDB dataset, PatchTraj attains significant relative improvements of 26.7% in ADE and 17.4% in FDE, underscoring its substantial potential in embodied intelligence.

[228] Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling

Christopher Xie, Armen Avetisyan, Henry Howard-Jenkins, Yawar Siddiqui, Julian Straub, Richard Newcombe, Vasileios Balntas, Jakob Engel

Main category: cs.CV

TL;DR: A human-in-the-loop system improves 3D scene layout estimation by integrating user feedback for local corrections, leveraging a multi-task version of SceneScript for better accuracy.

Details

Motivation: To enhance 3D scene layout estimation by incorporating human feedback for local error correction, enabling more accurate modeling of complex layouts.

Method: Introduces a local correction task where users identify errors, prompting automatic fixes. Uses a multi-task SceneScript model for global predictions and local corrections, structured as an NLP infilling task.

Result: The system maintains global prediction performance while significantly improving local correction ability, allowing iterative refinement via a user-friendly workflow.

Conclusion: The approach enables accurate modeling of complex layouts by diverging from the training distribution, validated through a human-in-the-loop system.

Abstract: We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript, a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as “infilling”, a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a low-friction “one-click fix’’ workflow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.

[229] VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Tengjin Weng, Jingyi Wang, Wenhao Jiang, Zhong Ming

Main category: cs.CV

TL;DR: VisNumBench evaluates MLLMs’ number sense, finding they lag behind humans, with no significant improvement from multimodal math or CoT models. Larger models show slight gains.

Details

Motivation: Assess if MLLMs can develop human-like intuitive number sense.

Method: Created VisNumBench with 1,900 QA pairs from synthetic/real-world data, testing seven numerical attributes and four estimation tasks. Evaluated 17 MLLMs.

Result: MLLMs perform below human levels; no major gains from math/CoT models; larger models show modest improvements.

Conclusion: VisNumBench is a valuable tool for advancing MLLMs’ number sense abilities.

Abstract: Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs’ number sense abilities. Code and dataset are available at https://wwwtttjjj.github.io/VisNumBench/.

[230] Learning 3D Scene Analogies with Neural Contextual Scene Maps

Junho Kim, Gwangtak Bae, Eun Sun Lee, Young Min Kim

Main category: cs.CV

TL;DR: The paper introduces 3D scene analogies to align spatial relationships in 3D environments, enabling applications in AR/VR and robotics.

Details

Motivation: Machines need to understand scene contexts for tasks in noisy or unseen 3D environments, but data-driven learning alone is insufficient.

Method: Proposes neural contextual scene maps to extract descriptor fields for semantic and geometric contexts, aligning them coarsely to finely for robust map estimation.

Result: The approach effectively identifies scene analogies and transfers trajectories or object placements in diverse indoor scenes.

Conclusion: The method shows promise for robotics and AR/VR applications, with code available for further exploration.

Abstract: Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications. Project page including the code is available through this link: https://82magnolia.github.io/3d_scene_analogies/.

[231] Beyond the Encoder: Joint Encoder-Decoder Contrastive Pre-Training Improves Dense Prediction

Sébastien Quetin, Tapotosh Ghosh, Farhad Maleki

Main category: cs.CV

TL;DR: DeCon is a novel encoder-decoder self-supervised learning framework that jointly pre-trains both components, outperforming traditional methods in dense prediction tasks.

Details

Motivation: Current contrastive learning methods neglect joint pre-training of encoders and decoders, limiting potential performance gains in dense prediction tasks.

Method: DeCon extends SSL architectures to include decoders and introduces a weighted encoder-decoder contrastive loss with non-competing objectives for joint pre-training.

Result: DeCon achieves state-of-the-art results on COCO and other benchmarks, showing improved performance across diverse architectures and scenarios.

Conclusion: Joint pre-training of encoder-decoder architectures enhances representation power and boosts performance in dense prediction tasks.

Abstract: Contrastive learning methods in self-supervised settings have primarily focused on pre-training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre-training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder-decoder self-supervised learning (SSL) framework that supports joint contrastive pre-training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder-decoder contrastive loss with non-competing objectives to enable the joint pre-training of encoder-decoder architectures. By adapting an established contrastive SSL framework for dense prediction tasks, DeCon achieves new state-of-the-art results: on COCO object detection and instance segmentation when pre-trained on COCO dataset; across almost all dense downstream benchmark tasks when pre-trained on COCO+ and ImageNet-1K. Our results demonstrate that joint pre-training enhances the representation power of the encoder and improves performance in dense prediction tasks. This gain persists across heterogeneous decoder architectures, various encoder architectures, and in out-of-domain limited-data scenarios.

[232] ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling

Radu Beche, Sergiu Nedevschi

Main category: cs.CV

TL;DR: ClaraVid is a synthetic aerial dataset addressing limitations of existing datasets, offering high-resolution images, depth maps, and segmentation. It introduces DSP, a complexity metric, to benchmark neural reconstruction methods.

Details

Motivation: Existing synthetic aerial datasets are limited by task-specific constraints, unrealistic scenes, and rendering artifacts, hindering holistic scene understanding.

Method: ClaraVid provides 16,917 high-resolution images with dense depth maps, panoptic segmentation, and dynamic object masks. The Delentropic Scene Profile (DSP) is introduced to measure scene complexity.

Result: DSP reveals a correlation between scene complexity and reconstruction accuracy, with higher delentropy linked to increased errors.

Conclusion: ClaraVid and DSP advance aerial scene understanding by providing a robust dataset and a reliable complexity metric for neural reconstruction tasks.

Abstract: The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032x3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. The data and code are available on the project page at https://rdbch.github.io/claravid/

[233] Accenture-NVS1: A Novel View Synthesis Dataset

Thomas Sugg, Kyle O’Brien, Lekh Poudel, Alex Dumouchelle, Michelle Jou, Marc Bosch, Deva Ramanan, Srinivasa Narasimhan, Shubham Tulsiani

Main category: cs.CV

TL;DR: ACC-NVS1 is a dataset for Novel View Synthesis, focusing on airborne and ground imagery, with 148,000 images from six diverse scenes.

Details

Motivation: To address challenges like varying altitudes and transient objects in Novel View Synthesis research, supplementing existing datasets.

Method: Data collected in Austin, TX, and Pittsburgh, PA (2023-2024) from airborne and ground cameras across six scenes.

Result: A dataset of 148,000 images, enhancing resources for research without serving as a benchmark.

Conclusion: ACC-NVS1 provides valuable supplementary data for Novel View Synthesis research, filling gaps in existing datasets.

Abstract: This paper introduces ACC-NVS1, a specialized dataset designed for research on Novel View Synthesis specifically for airborne and ground imagery. Data for ACC-NVS1 was collected in Austin, TX and Pittsburgh, PA in 2023 and 2024. The collection encompasses six diverse real-world scenes captured from both airborne and ground cameras, resulting in a total of 148,000 images. ACC-NVS1 addresses challenges such as varying altitudes and transient objects. This dataset is intended to supplement existing datasets, providing additional resources for comprehensive research, rather than serving as a benchmark.

[234] LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks

Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mengfei Shi, Xia Xie, Shengyong Chen

Main category: cs.CV

TL;DR: The paper introduces LIDAR, a lightweight network for efficient pixel-level crack segmentation using multimodal data, outperforming SOTA methods with minimal computational cost.

Details

Motivation: The challenge of achieving accurate crack segmentation with low computational cost and adaptive cross-modal feature fusion motivates the proposed method.

Method: LIDAR combines the LacaVSS module for adaptive cue modeling and the LD3CF module for cross-modal fusion, using LDMK convolution to reduce computational overhead.

Result: LIDAR achieves 0.8204 F1 and 0.8465 mIoU on a light-field depth dataset with only 5.35M parameters.

Conclusion: The proposed LIDAR network effectively addresses computational and adaptive fusion challenges in crack segmentation, demonstrating superior performance.

Abstract: Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.

[235] GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Shijie Ma, Yuying Ge, Teng Wang, Yuxin Guo, Yixiao Ge, Ying Shan

Main category: cs.CV

TL;DR: The paper explores enhancing CLIP’s fine-grained visual perception by leveraging generative models, identifying optimal conditioning, denoising, and generation strategies, and introduces GenHancer for improved performance.

Details

Motivation: To address CLIP's limitations in fine-grained visual details by integrating generative models, uncovering effective methods for representation enhancement.

Method: Investigates conditioning mechanisms, denoising configurations, and generation paradigms, proposing a two-stage training strategy and lightweight denoisers.

Result: GenHancer outperforms prior methods, achieving a 6.0% improvement on OpenAICLIP, and enhances multimodal large language models.

Conclusion: The study provides a versatile and effective approach (GenHancer) for enhancing CLIP’s visual representations, with publicly available models and codes.

Abstract: The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP’s visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

[236] Robust Adverse Weather Removal via Spectral-based Spatial Grouping

Yuhwan Jeong, Yunseo Yang, Youngho Yoon, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: The paper proposes SSGformer, a novel method for multi-weather image restoration using spectral decomposition and group-wise attention to handle diverse adverse weather degradations.

Details

Motivation: Adverse weather conditions create complex degradation patterns, and existing All-in-One models struggle to capture localized distortions.

Method: SSGformer decomposes images into high-frequency edge features and low-frequency information, uses multi-head linear attention, and introduces a group-wise attention mechanism with a grouping-mask for spatial similarity.

Result: Extensive experiments demonstrate SSGformer’s superiority in handling varied adverse weather degradations.

Conclusion: SSGformer effectively addresses diverse weather-related distortions, offering robust and consistent performance.

Abstract: Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-in-One (AiO) models. However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions. To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition. We utilize multi-head linear attention to effectively model the relationship between these features. The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions. We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies. Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations.

Byeongjun Kwon, Munchurl Kim

Main category: cs.CV

TL;DR: PRO is a tile-based framework for high-resolution depth estimation, addressing inefficiency and discontinuity in patch-based methods while improving generalization.

Details

Motivation: Existing depth estimation models struggle with high-resolution images due to training-inference resolution mismatch, patch reassembly issues, and reliance on synthetic data.

Method: PRO uses Grouped Patch Consistency Training and Bias Free Masking to enhance efficiency and generalization.

Result: PRO improves depth estimation accuracy and efficiency, validated on multiple datasets.

Conclusion: PRO effectively integrates with existing models, offering a scalable solution for high-resolution depth estimation.

Abstract: Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches, resulting in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after training on synthetic data. Zero-shot evaluations on Booster, ETH3D, Middlebury 2014, and NuScenes demonstrate that our PRO can be seamlessly integrated into existing depth estimation models.

[238] HRVVS: A High-resolution Video Vasculature Segmentation Network via Hierarchical Autoregressive Residual Priors

Xincheng Yao, Yijun Yang, Kangwei Guo, Ruiqiang Xiao, Haipeng Zhou, Haisu Tao, Jian Yang, Lei Zhu

Main category: cs.CV

TL;DR: The paper introduces a high-quality dataset for hepatic vasculature segmentation in surgical videos and proposes a novel network (HRVVS) that outperforms state-of-the-art methods.

Details

Motivation: The lack of appropriate datasets and the complexity of hepatic vasculature segmentation in surgical videos motivated the creation of a new dataset and method.

Method: The HRVVS network integrates a pretrained VAR model into a hierarchical encoder and uses a dynamic memory decoder to minimize redundant information while preserving details.

Result: HRVVS significantly outperforms existing methods on surgical video datasets.

Conclusion: The proposed HRVVS and dataset advance hepatic vasculature segmentation, with code and data made publicly available.

Abstract: The segmentation of the hepatic vasculature in surgical videos holds substantial clinical significance in the context of hepatectomy procedures. However, owing to the dearth of an appropriate dataset and the inherently complex task characteristics, few researches have been reported in this domain. To address this issue, we first introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames. On this basis, we propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS. We innovatively embed a pretrained visual autoregressive modeling (VAR) model into different layers of the hierarchical encoder as prior information to reduce the information degradation generated during the downsampling process. In addition, we designed a dynamic memory decoder on a multi-view segmentation network to minimize the transmission of redundant information while preserving more details between frames. Extensive experiments on surgical video datasets demonstrate that our proposed HRVVS significantly outperforms the state-of-the-art methods. The source code and dataset will be publicly available at {https://github.com/scott-yjyang/HRVVS}.

[239] Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion

Xingyu Hu, Junjun Jiang, Chenyang Wang, Kui Jiang, Xianming Liu, Jiayi Ma

Main category: cs.CV

TL;DR: The paper proposes ‘TITA’, a unified image fusion framework that balances task-invariant and task-specific features to improve performance and generalization across diverse fusion tasks.

Details

Motivation: Existing unified fusion frameworks often overlook task-specific characteristics, limiting performance and generalization to unseen tasks.

Method: TITA uses Interaction-enhanced Pixel Attention (IPA) for task-invariant interaction and Operation-based Adaptive Fusion (OAF) for task-specific adaptation, along with Fast Adaptive Multitask Optimization (FAMO) to handle gradient conflicts.

Result: TITA achieves competitive performance in three fusion scenarios and generalizes well to unseen tasks.

Conclusion: TITA effectively balances task-invariant and task-specific features, outperforming specialized methods and demonstrating strong generalization.

Abstract: Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality through a unified framework applicable to diverse fusion tasks. While treating all fusion tasks as a unified problem facilitates task-invariant knowledge sharing, it often overlooks task-specific characteristics, thereby limiting the overall performance. Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks. However, this dependence during inference restricts the model’s generalization to unseen fusion tasks. To address these issues, we propose a novel unified image fusion framework named “TITA”, which dynamically balances both Task-invariant Interaction and Task-specific Adaptation. For task-invariant interaction, we introduce the Interaction-enhanced Pixel Attention (IPA) module to enhance pixel-wise interactions for better multi-source complementary information extraction. For task-specific adaptation, the Operation-based Adaptive Fusion (OAF) module dynamically adjusts operation weights based on task properties. Additionally, we incorporate the Fast Adaptive Multitask Optimization (FAMO) strategy to mitigate the impact of gradient conflicts across tasks during joint training. Extensive experiments demonstrate that TITA not only achieves competitive performance compared to specialized methods across three image fusion scenarios but also exhibits strong generalization to unseen fusion tasks. The source codes are released at https://github.com/huxingyuabc/TITA.

[240] EP-Diffuser: An Efficient Diffusion Model for Traffic Scene Generation and Prediction via Polynomial Representations

Yue Yao, Mohamed-Khalil Bouzidi, Daniel Goehring, Joerg Reichardt

Main category: cs.CV

TL;DR: EP-Diffuser is a diffusion-based model for predicting diverse traffic scene evolutions, outperforming SotA models in accuracy and plausibility while being parameter-efficient.

Details

Motivation: Existing models focus on the most likely future, but safe autonomous driving requires covering plausible motion alternatives.

Method: EP-Diffuser, a diffusion-based generative model, predicts diverse scene continuations conditioned on road layout and agent history.

Result: Achieves high accuracy and plausibility on Argoverse 2 and shows robustness in OoD tests on Waymo Open dataset.

Conclusion: EP-Diffuser is a scalable, efficient solution for diverse traffic scene prediction, enhancing autonomous vehicle safety.

Abstract: As the prediction horizon increases, predicting the future evolution of traffic scenes becomes increasingly difficult due to the multi-modal nature of agent motion. Most state-of-the-art (SotA) prediction models primarily focus on forecasting the most likely future. However, for the safe operation of autonomous vehicles, it is equally important to cover the distribution for plausible motion alternatives. To address this, we introduce EP-Diffuser, a novel parameter-efficient diffusion-based generative model designed to capture the distribution of possible traffic scene evolutions. Conditioned on road layout and agent history, our model acts as a predictor and generates diverse, plausible scene continuations. We benchmark EP-Diffuser against two SotA models in terms of accuracy and plausibility of predictions on the Argoverse 2 dataset. Despite its significantly smaller model size, our approach achieves both highly accurate and plausible traffic scene predictions. We further evaluate model generalization ability in an out-of-distribution (OoD) test setting using Waymo Open dataset and show superior robustness of our approach. The code and model checkpoints are available at: https://github.com/continental/EP-Diffuser.

[241] KAN or MLP? Point Cloud Shows the Way Forward

Yan Shi, Qingdong He, Yijun Liu, Xiaoyu Liu, Jingyong Su

Main category: cs.CV

TL;DR: PointKAN applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis, improving feature representation and efficiency over MLPs.

Details

Motivation: MLPs struggle with capturing local geometric features in point clouds due to fixed activation functions, poor parameter efficiency, and redundancy.

Method: Introduces Geometric Affine Module (GAM) for robust feature transformation, Local Feature Processing (LFP) for parallel feature extraction, and Efficient-KANs to reduce parameters.

Result: Outperforms PointMLP on benchmarks (ModelNet40, ScanObjectNN, ShapeNetPart) with fewer parameters and lower computational cost.

Conclusion: Demonstrates KANs’ potential in 3D vision, offering a new direction for point cloud research.

Abstract: Multi-Layer Perceptrons (MLPs) have become one of the fundamental architectural component in point cloud analysis due to its effective feature learning mechanism. However, when processing complex geometric structures in point clouds, MLPs’ fixed activation functions struggle to efficiently capture local geometric features, while suffering from poor parameter efficiency and high model redundancy. In this paper, we propose PointKAN, which applies Kolmogorov-Arnold Networks (KANs) to point cloud analysis tasks to investigate their efficacy in hierarchical feature representation. First, we introduce a Geometric Affine Module (GAM) to transform local features, improving the model’s robustness to geometric variations. Next, in the Local Feature Processing (LFP), a parallel structure extracts both group-level features and global context, providing a rich representation of both fine details and overall structure. Finally, these features are combined and processed in the Global Feature Processing (GFP). By repeating these operations, the receptive field gradually expands, enabling the model to capture complete geometric information of the point cloud. To overcome the high parameter counts and computational inefficiency of standard KANs, we develop Efficient-KANs in the PointKAN-elite variant, which significantly reduces parameters while maintaining accuracy. Experimental results demonstrate that PointKAN outperforms PointMLP on benchmark datasets such as ModelNet40, ScanObjectNN, and ShapeNetPart, with particularly strong performance in Few-shot Learning task. Additionally, PointKAN achieves substantial reductions in parameter counts and computational complexity (FLOPs). This work highlights the potential of KANs-based architectures in 3D vision and opens new avenues for research in point cloud understanding.

[242] Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang

Main category: cs.CV

TL;DR: The paper introduces Step1X-Edit, an open-source image editing model that rivals closed-source models like GPT-4o and Gemini2 Flash, using Multimodal LLM and diffusion decoding.

Details

Motivation: To bridge the performance gap between open-source and closed-source image editing models by developing a state-of-the-art open alternative.

Method: Uses Multimodal LLM for processing images and instructions, extracts latent embeddings, and integrates them with a diffusion decoder. A high-quality dataset is generated for training, and GEdit-Bench is created for evaluation.

Result: Step1X-Edit surpasses open-source baselines and nears the performance of proprietary models on GEdit-Bench.

Conclusion: Step1X-Edit advances open-source image editing, offering competitive performance against leading closed-source models.

Abstract: In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user’s editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.

[243] Sparfels: Fast Reconstruction from Sparse Unposed Imagery

Shubhendu Jena, Amine Ouasfi, Mae Younes, Adnane Boukhayma

Main category: cs.CV

TL;DR: A method for sparse view reconstruction using surface element splatting, achieving fast performance on consumer GPUs and state-of-the-art results in sparse, uncalibrated settings.

Details

Motivation: Addressing the underexplored challenge of shape recovery from sparse, noisy, or unposed camera views, where existing methods rely on data priors or external geometry.

Method: Leverages a 3D foundation model with task heads (point maps, camera initializations) to instantiate a bundle-adjusting 2D Gaussian Splatting model, guided by image correspondences. Introduces a novel splatted color variance formulation for efficient computation.

Result: Achieves state-of-the-art performance in sparse, uncalibrated reconstruction and novel view synthesis on multi-view datasets.

Conclusion: The proposed pipeline is efficient, simple, and effective for sparse view reconstruction, outperforming existing methods in challenging settings.

Abstract: We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets.

[244] BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng

Main category: cs.CV

TL;DR: The paper introduces GenBuster-200K, a large-scale AI-generated video dataset, and BusterX, a detection framework combining MLLM and reinforcement learning for explainable AI-video detection.

Details

Motivation: Addressing the lack of large-scale, high-quality AI-generated video datasets and the need for explainable detection methods due to rising misinformation risks.

Method: Proposes GenBuster-200K (200K high-res video clips) and BusterX (MLLM + reinforcement learning) for detection and explanation.

Result: BusterX outperforms state-of-the-art methods, validated by extensive comparisons and ablation studies.

Conclusion: GenBuster-200K and BusterX fill critical gaps in AI-video detection, offering scalability and explainability.

Abstract: Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of large-scale, high-quality AI-generated video datasets for forgery detection. In addition, existing detection approaches predominantly treat the task as binary classification, lacking explainability in model decision-making and failing to provide actionable insights or guidance for the public. To address these challenges, we propose \textbf{GenBuster-200K}, a large-scale AI-generated video dataset featuring 200K high-resolution video clips, diverse latest generative techniques, and real-world scenes. We further introduce \textbf{BusterX}, a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning for authenticity determination and explainable rationale. To our knowledge, GenBuster-200K is the {\it \textbf{first}} large-scale, high-quality AI-generated video dataset that incorporates the latest generative techniques for real-world scenarios. BusterX is the {\it \textbf{first}} framework to integrate MLLM with reinforcement learning for explainable AI-generated video detection. Extensive comparisons with state-of-the-art methods and ablation studies validate the effectiveness and generalizability of BusterX. The code, models, and datasets will be released.

[245] Uncovering Cultural Representation Disparities in Vision-Language Models

Ram Mohan Rao Kadiyala, Siddhant Gupta, Jebish Purbey, Srishti Yadav, Suman Debnath, Alejandro Salamanca, Desmond Elliott

Main category: cs.CV

TL;DR: The paper investigates cultural biases in Vision-Language Models (VLMs) using the Country211 dataset, revealing performance disparities across countries and question formats.

Details

Motivation: Concerns about potential biases in VLMs prompted an evaluation of their cultural biases through a country identification task.

Method: Evaluated VLMs on the Country211 dataset using open-ended and multiple-choice questions, including multilingual and adversarial settings.

Result: Significant performance variations were found, indicating biases inherited from pre-training data.

Conclusion: VLMs exhibit cultural biases influenced by training data, impacting their generalization across diverse global contexts.

Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities across a range of tasks, yet concerns about their potential biases exist. This work investigates the extent to which prominent VLMs exhibit cultural biases by evaluating their performance on an image-based country identification task at a country level. Utilizing the geographically diverse Country211 dataset, we probe several large vision language models (VLMs) under various prompting strategies: open-ended questions, multiple-choice questions (MCQs) including challenging setups like multilingual and adversarial settings. Our analysis aims to uncover disparities in model accuracy across different countries and question formats, providing insights into how training data distribution and evaluation methodologies might influence cultural biases in VLMs. The findings highlight significant variations in performance, suggesting that while VLMs possess considerable visual understanding, they inherit biases from their pre-training data and scale that impact their ability to generalize uniformly across diverse global contexts.

[246] YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation

Weichao Pan, Bohan Xu, Xu Wang, Chengze Lv, Shuoyang Wang, Zhenke Duan

Main category: cs.CV

TL;DR: YOLO-FireAD improves fire detection with attention-guided inverted residuals and dual-pooling downscale fusion, achieving higher accuracy and efficiency than YOLOv8 variants.

Details

Motivation: Addressing feature extraction limitations and information loss in existing YOLO-based models for fire detection in dynamic environments.

Method: Proposes YOLO-FireAD with Attention-guided Inverted Residual Block (AIR) and Dual Pool Downscale Fusion Block (DPDF) to enhance fire features and preserve multi-scale patterns.

Result: Outperforms YOLOv8 variants with 1.3-5.5% higher mAP75, fewer parameters (1.45M), and lower computational cost (4.6G).

Conclusion: YOLO-FireAD is efficient and accurate for fire detection, suitable for dynamic environments.

Abstract: Fire detection in dynamic environments faces continuous challenges, including the interference of illumination changes, many false detections or missed detections, and it is difficult to achieve both efficiency and accuracy. To address the problem of feature extraction limitation and information loss in the existing YOLO-based models, this study propose You Only Look Once for Fire Detection with Attention-guided Inverted Residual and Dual-pooling Downscale Fusion (YOLO-FireAD) with two core innovations: (1) Attention-guided Inverted Residual Block (AIR) integrates hybrid channel-spatial attention with inverted residuals to adaptively enhance fire features and suppress environmental noise; (2) Dual Pool Downscale Fusion Block (DPDF) preserves multi-scale fire patterns through learnable fusion of max-average pooling outputs, mitigating small-fire detection failures. Extensive evaluation on two public datasets shows the efficient performance of our model. Our proposed model keeps the sum amount of parameters (1.45M, 51.8% lower than YOLOv8n) (4.6G, 43.2% lower than YOLOv8n), and mAP75 is higher than the mainstream real-time object detection models YOLOv8n, YOL-Ov9t, YOLOv10n, YOLO11n, YOLOv12n and other YOLOv8 variants 1.3-5.5%. For more details, please visit our repository: https://github.com/JEFfersusu/YOLO-FireAD

[247] EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models

Feng Jiang, Zihao Zheng, Xiuping Cui, Maoliang Li, JIayu Chen, Xiang Chen

Main category: cs.CV

TL;DR: EaqVLA is a framework optimizing Vision-Language-Action (VLA) models via encoding-aligned quantization, addressing token alignment issues and improving performance.

Details

Motivation: Existing VLA models face high computing/storage costs, and token alignment issues hinder quantization.

Method: Proposes EaqVLA, using encoding-aligned quantization with mixed precision, guided by granularity analysis.

Result: EaqVLA achieves minimal quantization loss and significant acceleration compared to existing methods.

Conclusion: EaqVLA effectively optimizes VLA models, balancing cost and performance.

Abstract: With the development of Embodied Artificial intelligence, the end-to-end control policy such as Vision-Language-Action (VLA) model has become the mainstream. Existing VLA models faces expensive computing/storage cost, which need to be optimized. Quantization is considered as the most effective method which can not only reduce the memory cost but also achieve computation acceleration. However, we find the token alignment of VLA models hinders the application of existing quantization methods. To address this, we proposed an optimized framework called EaqVLA, which apply encoding-aligned quantization to VLA models. Specifically, we propose an complete analysis method to find the misalignment in various granularity. Based on the analysis results, we propose a mixed precision quantization with the awareness of encoding alignment. Experiments shows that the porposed EaqVLA achieves better quantization performance (with the minimal quantization loss for end-to-end action control and xxx times acceleration) than existing quantization methods.

[248] DisTime: Distribution-based Time Representation for Video Large Language Models

Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, Yang Liu

Main category: cs.CV

TL;DR: DisTime enhances Video-LLMs’ temporal localization with continuous embeddings and a Distribution-based Time Decoder, creating InternVid-TG, a large dataset, and achieving top performance in time-sensitive tasks.

Details

Motivation: Video-LLMs struggle with precise temporal localization due to discrete time representations and limited datasets.

Method: DisTime uses learnable tokens for continuous temporal embeddings and a Distribution-based Time Decoder. It also introduces InternVid-TG, a large dataset with automated annotations.

Result: DisTime outperforms benchmarks in time-sensitive tasks and maintains competitive Video QA performance.

Conclusion: DisTime effectively improves temporal understanding in Video-LLMs and sets a new standard with its dataset and framework.

Abstract: Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at https://github.com/josephzpng/DisTime.

[249] Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection

Geonu Lee, Yujeong Oh, Geonhui Jang, Soyoung Lee, Jeonghyo Song, Sungmin Cha, YoungJoon Yoo

Main category: cs.CV

TL;DR: A new benchmark, Continual-MEGA, is introduced for continual learning in anomaly detection, featuring a diverse dataset and a novel zero-shot generalization scenario. The proposed method outperforms existing approaches, and the ContinualAD dataset boosts model performance.

Details

Motivation: To better reflect real-world deployment scenarios in continual learning for anomaly detection, addressing gaps in existing benchmarks and evaluation settings.

Method: Introduces Continual-MEGA benchmark with a large, diverse dataset (including ContinualAD) and a novel zero-shot generalization scenario. Proposes a unified baseline algorithm for robustness and generalization.

Result: Key findings: (1) Existing methods need improvement, especially in pixel-level defect localization; (2) Proposed method outperforms prior approaches; (3) ContinualAD dataset enhances model performance.

Conclusion: The Continual-MEGA benchmark and proposed method advance continual learning in anomaly detection, with released code and dataset for broader use.

Abstract: In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in https://github.com/Continual-Mega/Continual-Mega.

[250] Comparative Performance of Finetuned ImageNet Pre-trained Models for Electronic Component Classification

Yidi Shao, Longfei Zhou, Fangshuo Tang, Xinyi Shi, Dalang Chen, Shengtao Xia

Main category: cs.CV

TL;DR: The paper compares 12 ImageNet pre-trained models for electronic component classification, finding all effective, with MobileNet-V2 achieving the highest accuracy (99.95%) and EfficientNet-B0 the lowest (92.26%).

Details

Motivation: To evaluate the effectiveness of ImageNet pre-trained models in reducing labor costs and advancing technology in electronic component classification.

Method: Comparison of twelve ImageNet pre-trained models for classifying electronic components.

Result: All models performed well, with MobileNet-V2 (99.95%) and EfficientNet-B0 (92.26%) as the highest and lowest performers, respectively.

Conclusion: ImageNet pre-trained models are highly effective for electronic component classification, validating their practical use in manufacturing.

Abstract: Electronic component classification and detection are crucial in manufacturing industries, significantly reducing labor costs and promoting technological and industrial development. Pre-trained models, especially those trained on ImageNet, are highly effective in image classification, allowing researchers to achieve excellent results even with limited data. This paper compares the performance of twelve ImageNet pre-trained models in classifying electronic components. Our findings show that all models tested delivered respectable accuracies. MobileNet-V2 recorded the highest at 99.95%, while EfficientNet-B0 had the lowest at 92.26%. These results underscore the substantial benefits of using ImageNet pre-trained models in image classification tasks and confirm the practical applicability of these methods in the electronics manufacturing sector.

[251] ZIP: Scalable Crowd Counting via Zero-Inflated Poisson Modeling

Yiming Ma, Victor Sanchez, Tanaya Guha

Main category: cs.CV

TL;DR: ZIP, a crowd counting framework using Zero-Inflated Poisson likelihood, outperforms MSE-based methods by addressing spatial sparsity and discrete count data issues.

Details

Motivation: Current MSE-based methods dilute supervision signals due to spatial sparsity and mismatch with discrete count data.

Method: ZIP models blockwise counts with Zero-Inflated Poisson likelihood, handling excess zeros and respecting discreteness.

Result: ZIP surpasses state-of-the-art methods across benchmarks (ShanghaiTech, UCF-QNRF, NWPU-Crowd) and model scales.

Conclusion: ZIP is a scalable and superior alternative to MSE-based crowd counting methods.

Abstract: Most crowd counting methods directly regress blockwise density maps using Mean Squared Error (MSE) losses. This practice has two key limitations: (1) it fails to account for the extreme spatial sparsity of annotations - over 95% of 8x8 blocks are empty across standard benchmarks, so supervision signals in informative regions are diluted by the predominant zeros; (2) MSE corresponds to a Gaussian error model that poorly matches discrete, non-negative count data. To address these issues, we introduce ZIP, a scalable crowd counting framework that models blockwise counts with a Zero-Inflated Poisson likelihood: a zero-inflation term learns the probability a block is structurally empty (handling excess zeros), while the Poisson component captures expected counts when people are present (respecting discreteness). We provide a generalization analysis showing a tighter risk bound for ZIP than MSE-based losses and DMCount provided that the training resolution is moderately large. To assess the scalability of ZIP, we instantiate it on backbones spanning over 100x in parameters/compute. Experiments on ShanghaiTech A & B, UCF-QNRF, and NWPU-Crowd demonstrate that ZIP consistently surpasses state-of-the-art methods across all model scales.

[252] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

Jiahe Chen, Jiaying He, Qian Shao, Qiyuan Chen, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu

Main category: cs.CV

TL;DR: Dynamic Logits Calibration (DLC) is introduced to reduce hallucinations in LVLMs by dynamically aligning text generation with visual evidence during decoding, outperforming existing methods.

Details

Motivation: LVLMs often generate text contradicting visual input (hallucination), and current training-free decoding strategies have limitations like static constraints and inefficiency.

Method: DLC uses CLIP to assess semantic alignment between images and generated text, evaluates Relative Visual Advantage (RVA) of tokens, and adjusts logits adaptively. An adaptive weighting mechanism balances visual guidance and text quality.

Result: DLC significantly reduces hallucinations across benchmarks and LVLM architectures (e.g., LLaVA, InstructBLIP, MiniGPT-4) while maintaining high inference efficiency.

Conclusion: DLC is an effective, efficient decoding-time solution to mitigate hallucinations, enhancing LVLM reliability for practical use.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.

[253] DeepShade: Enable Shade Simulation by Text-conditioned Image Generation

Longchao Da, Xiangrui Liu, Mithun Shivakoti, Thirulogasankar Pranav Kutralingam, Yezhou Yang, Hua Wei

Main category: cs.CV

TL;DR: The paper introduces DeepShade, a diffusion-based model for synthesizing shade variations, addressing the lack of shade data in routing systems. It uses a Blender-based dataset and improves shade prediction for urban planning in extreme heat.

Details

Motivation: Heatwaves threaten public health, but current routing systems lack shade information due to noisy satellite data and limited training data. This work aims to fill this gap.

Method: 1. Build a dataset using Blender simulations for diverse urban layouts and solar angles. 2. Propose DeepShade, a diffusion model combining RGB and edge features with contrastive learning.

Result: DeepShade improves shade image generation, demonstrated by calculating shade ratios for route planning in Tempe, Arizona.

Conclusion: The work benefits urban planning in extreme heat and has practical environmental applications.

Abstract: Heatwaves pose a significant threat to public health, especially as global warming intensifies. However, current routing systems (e.g., online maps) fail to incorporate shade information due to the difficulty of estimating shades directly from noisy satellite imagery and the limited availability of training data for generative models. In this paper, we address these challenges through two main contributions. First, we build an extensive dataset covering diverse longitude-latitude regions, varying levels of building density, and different urban layouts. Leveraging Blender-based 3D simulations alongside building outlines, we capture building shadows under various solar zenith angles throughout the year and at different times of day. These simulated shadows are aligned with satellite images, providing a rich resource for learning shade patterns. Second, we propose the DeepShade, a diffusion-based model designed to learn and synthesize shade variations over time. It emphasizes the nuance of edge features by jointly considering RGB with the Canny edge layer, and incorporates contrastive learning to capture the temporal change rules of shade. Then, by conditioning on textual descriptions of known conditions (e.g., time of day, solar angles), our framework provides improved performance in generating shade images. We demonstrate the utility of our approach by using our shade predictions to calculate shade ratios for real-world route planning in Tempe, Arizona. We believe this work will benefit society by providing a reference for urban planning in extreme heat weather and its potential practical applications in the environment.

[254] Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Utkarsh Shandilya, Marsha Mariya Kappan, Sanyam Jain, Vijeta Sharma

Main category: cs.CV

TL;DR: The paper evaluates CLIP for human action recognition in healthcare, revealing its limitations under masking strategies and proposing a noise-based enhancement to improve accuracy and reduce bias.

Details

Motivation: Human action recognition is vital in healthcare, but traditional models struggle with generalization. Vision-language models like CLIP offer potential solutions.

Method: CLIP is tested on UCF-101 with three masking strategies: black masking, feature-specific masking, and isolation masking. A noise-based enhancement is proposed to address limitations.

Result: CLIP shows inconsistent performance under masking, but the proposed noise method improves accuracy and reduces bias.

Conclusion: Challenges remain for clinical applications, and future work should focus on improving generalizability in healthcare scenarios.

Abstract: Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP exhibits inconsistent behavior and frequent misclassifications, particularly when essential visual cues are obscured. To overcome these limitations, we propose incorporating class-specific noise, learned via a custom loss function, to reinforce attention to class-defining features. This enhancement improves classification accuracy and model confidence while reducing bias. We conclude with a discussion on the challenges of applying such models in clinical domains and outline directions for future work to improve generalizability across domain-independent healthcare scenarios.

[255] Butter: Frequency Consistency and Hierarchical Fusion for Autonomous Driving Object Detection

Xiaojian Lin, Wenxin Zhang, Yuchu Jiang, Wangyu Wu, Yiran Guo, Kangxu Wang, Zongzheng Zhang, Guijin Wang, Lei Jin, Hao Zhao

Main category: cs.CV

TL;DR: Butter is a novel object detection framework for autonomous driving, enhancing hierarchical feature representations with FAFCE and PHFFNet modules, improving accuracy and efficiency.

Details

Motivation: Existing architectures like YOLO and DETR struggle with feature consistency across scales and balancing precision with computational efficiency in dynamic environments.

Method: Butter introduces FAFCE for adaptive frequency filtering to refine feature consistency and PHFFNet for progressive hierarchical feature fusion.

Result: Experiments on BDD100K, KITTI, and Cityscapes show Butter improves detection accuracy while reducing model complexity.

Conclusion: Butter offers a balanced solution for real-time autonomous driving, advancing hierarchical feature refinement and integration.

Abstract: Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at https://github.com/Aveiro-Lin/Butter, facilitating further research and validation within the autonomous driving community.

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

Main category: cs.CV

TL;DR: BusterX++ is a novel framework for cross-modal detection of synthetic media, using RL post-training and hybrid reasoning to outperform single-modality methods.

Details

Motivation: The rise of generative AI has increased misinformation risks, but current detection systems are limited by single-modality designs.

Method: BusterX++ employs reinforcement learning post-training, multi-stage training, thinking rewards, and hybrid reasoning.

Result: The framework shows stable and substantial performance improvements, validated by the GenBuster++ benchmark.

Conclusion: BusterX++ effectively addresses cross-modal synthetic media detection, offering enhanced transparency and interpretability.

Abstract: Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.

[257] MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image

DongFu Yin, Xiaotian Chen, Fei Richard Yu, Xuanchen Li, Xinhao Zhang

Main category: cs.CV

TL;DR: MVG4D is a novel framework for generating high-fidelity, temporally consistent 4D content from a single image using multi-view synthesis and 4D Gaussian Splatting.

Details

Motivation: Producing high-quality dynamic 4D content remains challenging due to issues like motion discontinuity and background degradation.

Method: MVG4D combines multi-view synthesis with 4D Gaussian Splatting, using an image matrix module for coherent multi-view images and a deformation network for temporal extension.

Result: MVG4D outperforms state-of-the-art methods in metrics like CLIP-I, PSNR, and FVD, reducing flickering and enhancing details.

Conclusion: MVG4D advances efficient and controllable 4D generation, improving AR/VR experiences.

Abstract: Advances in generative modeling have significantly enhanced digital content creation, extending from 2D images to complex 3D and 4D scenes. Despite substantial progress, producing high-fidelity and temporally consistent dynamic 4D content remains a challenge. In this paper, we propose MVG4D, a novel framework that generates dynamic 4D content from a single still image by combining multi-view synthesis with 4D Gaussian Splatting (4D GS). At its core, MVG4D employs an image matrix module that synthesizes temporally coherent and spatially diverse multi-view images, providing rich supervisory signals for downstream 3D and 4D reconstruction. These multi-view images are used to optimize a 3D Gaussian point cloud, which is further extended into the temporal domain via a lightweight deformation network. Our method effectively enhances temporal consistency, geometric fidelity, and visual realism, addressing key challenges in motion discontinuity and background degradation that affect prior 4D GS-based methods. Extensive experiments on the Objaverse dataset demonstrate that MVG4D outperforms state-of-the-art baselines in CLIP-I, PSNR, FVD, and time efficiency. Notably, it reduces flickering artifacts and sharpens structural details across views and time, enabling more immersive AR/VR experiences. MVG4D sets a new direction for efficient and controllable 4D generation from minimal inputs.

[258] Exemplar Med-DETR: Toward Generalized and Robust Lesion Detection in Mammogram Images and beyond

Sheethal Bhat, Bogdan Georgescu, Adarsh Bhandary Panambur, Mathias Zinnen, Tri-Thien Nguyen, Awais Mansoor, Karim Khalifa Elbarbary, Siming Bayer, Florin-Cristian Ghesu, Sasa Grbic, Andreas Maier

Main category: cs.CV

TL;DR: Exemplar Med-DETR, a multi-modal contrastive detector, improves abnormality detection in medical images by leveraging class-specific exemplar features and cross-attention, achieving state-of-the-art results across diverse datasets.

Details

Motivation: Existing methods struggle with learning effective class-specific features for abnormality detection in medical images, especially in dense tissues like mammograms.

Method: The paper introduces Exemplar Med-DETR, which uses cross-attention with intuitive class-specific exemplar features and an iterative training strategy.

Result: Achieves mAP of 0.7 for mass detection and 0.55 for calcifications in mammograms, with improvements in chest X-rays and angiography. Outperforms existing methods significantly.

Conclusion: Exemplar Med-DETR demonstrates robust and generalizable performance, advancing medical imaging detection systems.

Abstract: Detecting abnormalities in medical images poses unique challenges due to differences in feature representations and the intricate relationship between anatomical structures and abnormalities. This is especially evident in mammography, where dense breast tissue can obscure lesions, complicating radiological interpretation. Despite leveraging anatomical and semantic context, existing detection methods struggle to learn effective class-specific features, limiting their applicability across different tasks and imaging modalities. In this work, we introduce Exemplar Med-DETR, a novel multi-modal contrastive detector that enables feature-based detection. It employs cross-attention with inherently derived, intuitive class-specific exemplar features and is trained with an iterative strategy. We achieve state-of-the-art performance across three distinct imaging modalities from four public datasets. On Vietnamese dense breast mammograms, we attain an mAP of 0.7 for mass detection and 0.55 for calcifications, yielding an absolute improvement of 16 percentage points. Additionally, a radiologist-supported evaluation of 100 mammograms from an out-of-distribution Chinese cohort demonstrates a twofold gain in lesion detection performance. For chest X-rays and angiography, we achieve an mAP of 0.25 for mass and 0.37 for stenosis detection, improving results by 4 and 7 percentage points, respectively. These results highlight the potential of our approach to advance robust and generalizable detection systems for medical imaging.

[259] Dual-Stream Global-Local Feature Collaborative Representation Network for Scene Classification of Mining Area

Shuqi Fan, Haoyi Wang, Xianju Li

Main category: cs.CV

TL;DR: A dual-branch fusion model for mining area scene classification, combining global and local features, achieves 83.63% accuracy, outperforming other models.

Details

Motivation: To improve geological monitoring and resource planning by accurately classifying complex mining landscapes.

Method: Proposes a dual-branch model with a Multi-scale Global Transformer Branch, Local Enhancement Collaborative Representation Branch, and Dual-Branch Deep Feature Fusion Module, using multi-loss computation.

Result: Achieves 83.63% accuracy, surpassing other models in performance metrics.

Conclusion: The model effectively integrates multi-scale and local features, enhancing classification accuracy for mining scenes.

Abstract: Scene classification of mining areas provides accurate foundational data for geological environment monitoring and resource development planning. This study fuses multi-source data to construct a multi-modal mine land cover scene classification dataset. A significant challenge in mining area classification lies in the complex spatial layout and multi-scale characteristics. By extracting global and local features, it becomes possible to comprehensively reflect the spatial distribution, thereby enabling a more accurate capture of the holistic characteristics of mining scenes. We propose a dual-branch fusion model utilizing collaborative representation to decompose global features into a set of key semantic vectors. This model comprises three key components:(1) Multi-scale Global Transformer Branch: It leverages adjacent large-scale features to generate global channel attention features for small-scale features, effectively capturing the multi-scale feature relationships. (2) Local Enhancement Collaborative Representation Branch: It refines the attention weights by leveraging local features and reconstructed key semantic sets, ensuring that the local context and detailed characteristics of the mining area are effectively integrated. This enhances the model’s sensitivity to fine-grained spatial variations. (3) Dual-Branch Deep Feature Fusion Module: It fuses the complementary features of the two branches to incorporate more scene information. This fusion strengthens the model’s ability to distinguish and classify complex mining landscapes. Finally, this study employs multi-loss computation to ensure a balanced integration of the modules. The overall accuracy of this model is 83.63%, which outperforms other comparative models. Additionally, it achieves the best performance across all other evaluation metrics.

[260] Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

Yanming Xiu, Maria Gorlatova

Main category: cs.CV

TL;DR: The paper introduces a taxonomy for visual information manipulation (VIM) attacks in AR, creates a dataset (AR-VIM), and proposes a detection framework (VIM-Sense) combining vision-language models and OCR, achieving high accuracy and low latency.

Details

Motivation: AR virtual content can mislead users through subtle manipulations, necessitating methods to detect and mitigate such attacks.

Method: A taxonomy categorizes VIM attacks into formats (character, phrase, pattern) and purposes (replacement, obfuscation, extra wrong info). A dataset (AR-VIM) is built, and VIM-Sense, a multimodal framework, is proposed for detection.

Result: VIM-Sense achieves 88.94% accuracy on AR-VIM, outperforming baselines, with detection latencies of ~7 seconds in simulated and real-world tests.

Conclusion: The work provides a robust solution for detecting VIM attacks in AR, demonstrating the effectiveness of multimodal semantic reasoning.

Abstract: The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM. It consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect such attacks, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system reaches an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.

[261] Indian Sign Language Detection for Real-Time Translation using Machine Learning

Rajat Singhal, Jatin Gupta, Akhil Sharma, Anushka Gupta, Navya Sharma

Main category: cs.CV

TL;DR: The paper proposes a real-time Indian Sign Language (ISL) detection and translation system using a CNN, achieving 99.95% accuracy to bridge communication gaps for the deaf and hard-of-hearing in India.

Details

Motivation: Addressing the scarcity of skilled interpreters and accessible translation technologies for ISL, which hinders effective communication for deaf and mute communities in India.

Method: A Convolutional Neural Network (CNN) trained on a comprehensive ISL dataset, integrated with MediaPipe for hand tracking and motion detection.

Result: The model achieves 99.95% classification accuracy, demonstrating high precision in discerning nuanced visual features of signs.

Conclusion: The system offers a reliable, real-time solution for ISL translation, enhancing communication accessibility for deaf and mute communities in India.

Abstract: Gestural language is used by deaf & mute communities to communicate through hand gestures & body movements that rely on visual-spatial patterns known as sign languages. Sign languages, which rely on visual-spatial patterns of hand gestures & body movements, are the primary mode of communication for deaf & mute communities worldwide. Effective communication is fundamental to human interaction, yet individuals in these communities often face significant barriers due to a scarcity of skilled interpreters & accessible translation technologies. This research specifically addresses these challenges within the Indian context by focusing on Indian Sign Language (ISL). By leveraging machine learning, this study aims to bridge the critical communication gap for the deaf & hard-of-hearing population in India, where technological solutions for ISL are less developed compared to other global sign languages. We propose a robust, real-time ISL detection & translation system built upon a Convolutional Neural Network (CNN). Our model is trained on a comprehensive ISL dataset & demonstrates exceptional performance, achieving a classification accuracy of 99.95%. This high precision underscores the model’s capability to discern the nuanced visual features of different signs. The system’s effectiveness is rigorously evaluated using key performance metrics, including accuracy, F1 score, precision & recall, ensuring its reliability for real-world applications. For real-time implementation, the framework integrates MediaPipe for precise hand tracking & motion detection, enabling seamless translation of dynamic gestures. This paper provides a detailed account of the model’s architecture, the data preprocessing pipeline & the classification methodology. The research elaborates the model architecture, preprocessing & classification methodologies for enhancing communication in deaf & mute communities.

[262] Priority-Aware Clinical Pathology Hierarchy Training for Multiple Instance Learning

Sungrae Hong, Kyungeun Kim, Juhyeon Kim, Sol Lee, Jisu Shin, Chanjae Song, Mun Yong Yi

Main category: cs.CV

TL;DR: A new MIL method addresses priority issues in clinical diagnosis by using vertical and horizontal hierarchies, improving accuracy and symptom prioritization.

Details

Motivation: Existing MIL approaches in clinical settings fail to address priority among pathological symptoms and diagnostic classes, leading to misdiagnosis.

Method: Proposes a method with vertical inter-hierarchy and horizontal intra-hierarchy to align MIL predictions and prioritize clinically serious classes.

Result: Experiments show reduced misdiagnosis and better symptom prioritization in multiclass scenarios.

Conclusion: The method effectively improves clinical MIL tasks by addressing priority issues and validating predictions against complex cases.

Abstract: Multiple Instance Learning (MIL) is increasingly being used as a support tool within clinical settings for pathological diagnosis decisions, achieving high performance and removing the annotation burden. However, existing approaches for clinical MIL tasks have not adequately addressed the priority issues that exist in relation to pathological symptoms and diagnostic classes, causing MIL models to ignore priority among classes. To overcome this clinical limitation of MIL, we propose a new method that addresses priority issues using two hierarchies: vertical inter-hierarchy and horizontal intra-hierarchy. The proposed method aligns MIL predictions across each hierarchical level and employs an implicit feature re-usability during training to facilitate clinically more serious classes within the same level. Experiments with real-world patient data show that the proposed method effectively reduces misdiagnosis and prioritizes more important symptoms in multiclass scenarios. Further analysis verifies the efficacy of the proposed components and qualitatively confirms the MIL predictions against challenging cases with multiple symptoms.

[263] Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy

Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc

Main category: cs.CV

TL;DR: CoP introduces a multi-task learning framework for BEV 3D object detection, leveraging spatial occupancy to improve feature refinement and spatial representation, outperforming existing methods.

Details

Motivation: Existing BEV methods neglect environmental contexts like roads, limiting comprehensive perception. CoP addresses this by integrating spatial occupancy for better feature refinement.

Method: CoP uses dense occupancy ground truths (LDO), voxel-height-guided sampling (VHS), and a global-local collaborative feature fusion (CFF) module to enhance BEV representations.

Result: CoP achieves 49.5% mAP and 59.2% NDS on the nuScenes benchmark, outperforming existing vision-based frameworks.

Conclusion: CoP effectively bridges gaps in spatial representation and feature refinement, demonstrating superior performance in BEV 3D object detection.

Abstract: Vision-based bird’s-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5% mAP and 59.2% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.

[264] TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

Kejia Zhang, Keda Tao, Zhiming Luo, Chang Liu, Jiasheng Tang, Huan Wang

Main category: cs.CV

TL;DR: TARS, a token-adaptive preference strategy, improves multimodal large language models (MLLMs) by reducing hallucinations through min-max optimization, outperforming standard DPO and matching GPT-4o.

Details

Motivation: MLLMs often produce factually incorrect or visually ungrounded outputs, reducing reliability. Existing DPO strategies overfit to superficial cues, impairing grounding.

Method: TARS reformulates DPO as a min-max problem, maximizing token-level shifts under semantic constraints while minimizing preference loss.

Result: TARS reduces hallucination rates from 26.4% to 13.2% and cognition value from 2.5 to 0.4, using only 4.8k samples.

Conclusion: TARS effectively mitigates hallucinations in MLLMs, preserving causal grounding and outperforming standard DPO.

Abstract: Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.

[265] Color as the Impetus: Transforming Few-Shot Learner

Chaofei Qi, Zhitai Liu, Jianbin Qiu

Main category: cs.CV

TL;DR: The paper introduces ColorSense Learner, a bio-inspired meta-learning framework that mimics human color perception for few-shot learning, achieving strong generalization and robustness.

Details

Motivation: To leverage human color perception for improving few-shot learning by focusing on color-channel interactions, a neglected aspect in conventional methods.

Method: Proposes ColorSense Learner for inter-channel feature extraction and interactive learning, and ColorSense Distiller for knowledge distillation to enhance meta-learning.

Result: Demonstrates strong generalization, robustness, and transferability across eleven few-shot benchmarks.

Conclusion: The framework effectively bridges the gap in meta-learning by utilizing color perception, outperforming conventional methods.

Abstract: Humans possess innate meta-learning capabilities, partly attributable to their exceptional color perception. In this paper, we pioneer an innovative viewpoint on few-shot learning by simulating human color perception mechanisms. We propose the ColorSense Learner, a bio-inspired meta-learning framework that capitalizes on inter-channel feature extraction and interactive learning. By strategically emphasizing distinct color information across different channels, our approach effectively filters irrelevant features while capturing discriminative characteristics. Color information represents the most intuitive visual feature, yet conventional meta-learning methods have predominantly neglected this aspect, focusing instead on abstract feature differentiation across categories. Our framework bridges the gap via synergistic color-channel interactions, enabling better intra-class commonality extraction and larger inter-class differences. Furthermore, we introduce a meta-distiller based on knowledge distillation, ColorSense Distiller, which incorporates prior teacher knowledge to augment the student network’s meta-learning capacity. We’ve conducted comprehensive coarse/fine-grained and cross-domain experiments on eleven few-shot benchmarks for validation. Numerous experiments reveal that our methods have extremely strong generalization ability, robustness, and transferability, and effortless handle few-shot classification from the perspective of color perception.

[266] Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying, Henghui Ding, Guangquan Jie, Yu-Gang Jiang

Main category: cs.CV

TL;DR: OmniAVS introduces a new dataset and OISA, a method for multimodal reasoning in audio-visual segmentation, outperforming existing approaches.

Details

Motivation: To address challenges in integrating multimodal information and reasoning about audiovisual content in RAVS.

Method: Proposes OmniAVS dataset with diverse multimodal expressions and OISA, a model using MLLM for reasoning-based segmentation.

Result: OISA outperforms existing methods on OmniAVS and achieves competitive results on related tasks.

Conclusion: OmniAVS and OISA advance RAVS by enabling deeper multimodal understanding and reasoning.

Abstract: Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,104 videos and 61,095 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

cs.AI

[267] Unifying Post-hoc Explanations of Knowledge Graph Completions

Alessandro Lonardi, Samy Badreddine, Tarek R. Besold, Pablo Sanchez Martin

Main category: cs.AI

TL;DR: The paper proposes a unified framework for post-hoc explainability in Knowledge Graph Completion (KGC), balancing effectiveness and conciseness, and improves evaluation protocols for reproducibility.

Details

Motivation: Lack of formalization and consistent evaluations in post-hoc explainability for KGC hinders reproducibility and cross-study comparisons.

Method: Introduces a general framework for post-hoc explanations via multi-objective optimization and suggests improved evaluation protocols using metrics like Mean Reciprocal Rank and Hits@k.

Result: Unifies existing post-hoc explainability algorithms and refines evaluation standards.

Conclusion: The work aims to enhance reproducibility and impact in KGC explainability research by unifying methods and improving evaluations.

Abstract: Post-hoc explainability for Knowledge Graph Completion (KGC) lacks formalization and consistent evaluations, hindering reproducibility and cross-study comparisons. This paper argues for a unified approach to post-hoc explainability in KGC. First, we propose a general framework to characterize post-hoc explanations via multi-objective optimization, balancing their effectiveness and conciseness. This unifies existing post-hoc explainability algorithms in KGC and the explanations they produce. Next, we suggest and empirically support improved evaluation protocols using popular metrics like Mean Reciprocal Rank and Hits@$k$. Finally, we stress the importance of interpretability as the ability of explanations to address queries meaningful to end-users. By unifying methods and refining evaluation standards, this work aims to make research in KGC explainability more reproducible and impactful.

[268] Data Readiness for Scientific AI at Scale

Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral

Main category: cs.AI

TL;DR: The paper introduces a two-dimensional readiness framework for assessing Data Readiness for AI (DRAI) in leadership-scale scientific datasets, focusing on common preprocessing patterns and domain-specific constraints.

Details

Motivation: To address challenges in transforming scientific data for scalable AI training, particularly for transformer-based generative models, by providing a standardized framework.

Method: Analyzes workflows in climate, nuclear fusion, bio/health, and materials domains, introducing a readiness framework with Data Readiness Levels and Data Processing Stages tailored to HPC environments.

Result: A conceptual maturity matrix is developed to characterize scientific data readiness and guide infrastructure development for reproducible AI in science.

Conclusion: The framework supports standardized, cross-domain data readiness for scalable and reproducible AI applications in scientific research.

Abstract: This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

Zhenyu Pan, Yutong Zhang, Jianshu Zhang, Haoran Lu, Haozheng Luo, Yuwei Han, Philip S. Yu, Manling Li, Han Liu

Main category: cs.AI

TL;DR: The study explores balancing reasoning ability and bias mitigation in Multimodal Large Language Models (MLLMs), identifying a 1:4 mix of debias-focused and reasoning-centric samples with reinforcement learning as optimal.

Details

Motivation: To address the trade-off between improving logical reasoning and reducing social biases in MLLMs.

Method: Benchmarked three bias-mitigation strategies (SFT, KD, RL) and varied sample proportions to analyze the reasoning-versus-bias trade-off.

Result: A 1:4 mix with reinforcement learning reduced stereotype scores by 10% while retaining 88% of reasoning accuracy.

Conclusion: The findings provide actionable insights for balancing fairness and capability in MLLMs.

Abstract: Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities. To push their reasoning ability further, recent studies explore advanced prompting schemes and post-training fine-tuning. Although these techniques improve logical accuracy, they frequently leave the models’ outputs burdened with pronounced social biases. Clarifying how reasoning gains interact with bias mitigation-and whether the two objectives inherently trade off-therefore remains an open and pressing research problem. Our study begins by benchmarking three bias-mitigation strategies-supervised fine-uning (SFT), knowledge distillation (KD), and rule-based reinforcement learning (RL)-under identical conditions, establishing their baseline strengths and weaknesses. Building on these results, we vary the proportion of debias-focused and reasoning-centric samples within each paradigm to chart the reasoning-versus-bias trade-off. Our sweeps reveal a consistent sweet spot: a roughly 1:4 mix trained with reinforcement learning cuts stereotype scores by 10% while retaining 88% of the model’s original reasoning accuracy, offering concrete guidance for balancing fairness and capability in MLLMs.

[270] Moravec’s Paradox: Towards an Auditory Turing Test

David Noever, Forrest McKee

Main category: cs.AI

TL;DR: AI systems struggle with auditory tasks humans find easy, failing 93% of challenges in a new auditory Turing test.

Details

Motivation: Highlight the gap between human and machine auditory processing, inspired by Moravec's paradox.

Method: Introduced an auditory Turing test with 917 challenges across seven categories, evaluating state-of-the-art models like GPT-4 and Whisper.

Result: AI models failed catastrophically (93% failure rate), with the best model achieving only 6.9% accuracy vs. humans’ 52%.

Conclusion: Current AI lacks mechanisms for human-like auditory scene analysis, calling for new approaches integrating selective attention and context-aware perception.

Abstract: This research work demonstrates that current AI systems fail catastrophically on auditory tasks that humans perform effortlessly. Drawing inspiration from Moravec’s paradox (i.e., tasks simple for humans often prove difficult for machines, and vice versa), we introduce an auditory Turing test comprising 917 challenges across seven categories: overlapping speech, speech in noise, temporal distortion, spatial audio, coffee-shop noise, phone distortion, and perceptual illusions. Our evaluation of state-of-the-art audio models including GPT-4’s audio capabilities and OpenAI’s Whisper reveals a striking failure rate exceeding 93%, with even the best-performing model achieving only 6.9% accuracy on tasks that humans solved at 7.5 times higher success (52%). These results expose focusing failures in how AI systems process complex auditory scenes, particularly in selective attention, noise robustness, and contextual adaptation. Our benchmark not only quantifies the human-machine auditory gap but also provides insights into why these failures occur, suggesting that current architectures lack fundamental mechanisms for human-like auditory scene analysis. The traditional design of audio CAPTCHAs highlights common filters that humans evolved but machines fail to select in multimodal language models. This work establishes a diagnostic framework for measuring progress toward human-level machine listening and highlights the need for novel approaches integrating selective attention, physics-based audio understanding, and context-aware perception into multimodal AI systems.

[271] Argumentatively Coherent Judgmental Forecasting

Deniz Gorur, Antonio Rago, Francesca Toni

Main category: cs.AI

TL;DR: The paper introduces ‘argumentative coherence’ in judgmental forecasting, showing its practical value in improving accuracy for both human and LLM-based forecasts, despite users not naturally aligning with it.

Details

Motivation: To study the properties of forecasts from an argumentative perspective and advocate for the importance of coherence in reasoning.

Method: Formally define argumentative coherence and evaluate its impact through experiments with human and LLM-based forecasters, along with crowd-sourced user studies.

Result: Filtering incoherent predictions improves forecasting accuracy, but users do not naturally align with coherence.

Conclusion: Mechanisms to filter incoherent opinions are needed in argumentation-based judgmental forecasting.

Abstract: Judgmental forecasting employs human opinions to make predictions about future events, rather than exclusively historical data as in quantitative forecasting. When these opinions form an argumentative structure around forecasts, it is useful to study the properties of the forecasts from an argumentative perspective. In this paper, we advocate and formally define a property of argumentative coherence, which, in essence, requires that a forecaster’s reasoning is coherent with their forecast. We then conduct three evaluations with our notion of coherence. First, we assess the impact of enforcing coherence on human forecasters as well as on Large Language Model (LLM)-based forecasters, given that they have recently shown to be competitive with human forecasters. In both cases, we show that filtering out incoherent predictions improves forecasting accuracy consistently, supporting the practical value of coherence in both human and LLM-based forecasting. Then, via crowd-sourced user experiments, we show that, despite its apparent intuitiveness and usefulness, users do not generally align with this coherence property. This points to the need to integrate, within argumentation-based judgmental forecasting, mechanisms to filter out incoherent opinions before obtaining group forecasting predictions.

[272] Tractable Responsibility Measures for Ontology-Mediated Query Answering

Meghyn Bienvenu, Diego Figueira, Pierre Lafourcade

Main category: cs.AI

TL;DR: The paper analyzes the complexity of computing Shapley-value-based responsibility scores (WSMS) in ontology-mediated query answering, showing polynomial data complexity for first-order-rewritable queries but intractability for certain ontology languages. It identifies tractable cases in DL-Lite dialects.

Details

Motivation: To quantify the contributions of facts to query answers using responsibility measures, specifically focusing on WSMS in ontology-mediated query answering.

Method: Exploits database results to analyze complexity, focusing on first-order-rewritable queries and ontology languages encoding reachability. Examines combined complexity for atomic and conjunctive queries.

Result: Polynomial data complexity for first-order-rewritable queries, ‘shP’-hard for reachability-encoding ontologies. Intractability for atomic queries with conjunction and unions of conjunctive queries. Tractable cases identified in DL-Lite dialects.

Conclusion: WSMS computation is tractable for certain query classes in DL-Lite dialects but intractable for others, highlighting the need for structural restrictions to ensure efficiency.

Abstract: Recent work on quantitative approaches to explaining query answers employs responsibility measures to assign scores to facts in order to quantify their respective contributions to obtaining a given answer. In this paper, we study the complexity of computing such responsibility scores in the setting of ontology-mediated query answering, focusing on a very recently introduced family of Shapley-value-based responsibility measures defined in terms of weighted sums of minimal supports (WSMS). By exploiting results from the database setting, we can show that such measures enjoy polynomial data complexity for classes of ontology-mediated queries that are first-order-rewritable, whereas the problem becomes “shP”-hard when the ontology language can encode reachability queries (via axioms like $\exists R. A \sqsubseteq A$). To better understand the tractability frontier, we next explore the combined complexity of WSMS computation. We prove that intractability applies already to atomic queries if the ontology language supports conjunction, as well as to unions of `well-behaved’ conjunctive queries, even in the absence of an ontology. By contrast, our study yields positive results for common DL-Lite dialects: by means of careful analysis, we identify classes of structurally restricted conjunctive queries (which intuitively disallow undesirable interactions between query atoms) that admit tractable WSMS computation.

[273] Solution-aware vs global ReLU selection: partial MILP strikes back for DNN verification

Yuke Liao, Blaise Genest, Kuldeep Meel, Shaan Aryaman

Main category: cs.AI

TL;DR: The paper proposes a divide-and-conquer approach using partial MILP calls and introduces a solution-aware ReLU scoring (SAS) method to efficiently select critical ReLU variables, reducing binary variables by 6x while maintaining accuracy. Hybrid MILP implementation improves verification efficiency.

Details

Motivation: To address the inefficiency of previous methods in handling complex instances by reducing the number of costly binary variables and improving ReLU selection.

Method: Uses a divide-and-conquer strategy with partial MILP calls, introduces SAS for ReLU scoring, and adapts BaB-SR and BaB-FSB as global scoring functions. Implements Hybrid MILP with α,β-CROWN and partial MILP.

Result: SAS reduces binary variables by 6x, maintains accuracy, and Hybrid MILP reduces undecided instances by 40% with reasonable runtime (46s-417s).

Conclusion: The proposed SAS and Hybrid MILP approach significantly improves efficiency and accuracy in verifying large CNNs.

Abstract: To handle complex instances, we revisit a divide-and-conquer approach to break down the complexity: instead of few complex BaB calls, we rely on many small {\em partial} MILP calls. The crucial step is to select very few but very important ReLUs to treat using (costly) binary variables. The previous attempts were suboptimal in that respect. To select these important ReLU variables, we propose a novel {\em solution-aware} ReLU scoring ({\sf SAS}), as well as adapt the BaB-SR and BaB-FSB branching functions as {\em global} ReLU scoring ({\sf GS}) functions. We compare them theoretically as well as experimentally, and {\sf SAS} is more efficient at selecting a set of variables to open using binary variables. Compared with previous attempts, SAS reduces the number of binary variables by around 6 times, while maintaining the same level of accuracy. Implemented in {\em Hybrid MILP}, calling first $\alpha,\beta$-CROWN with a short time-out to solve easier instances, and then partial MILP, produces a very accurate yet efficient verifier, reducing by up to $40%$ the number of undecided instances to low levels ($8-15%$), while keeping a reasonable runtime ($46s-417s$ on average per instance), even for fairly large CNNs with 2 million parameters.

[274] How Far Are AI Scientists from Changing the World?

Qiujie Xie, Yixuan Weng, Minjun Zhu, Fuchen Shen, Shulin Huang, Zhen Lin, Jiahui Zhou, Zilan Mao, Zijie Yang, Linyi Yang, Jian Wu, Yue Zhang

Main category: cs.AI

TL;DR: The paper surveys the progress of AI Scientist systems, assessing their potential to revolutionize scientific research and identifying key challenges and goals.

Details

Motivation: To evaluate how close AI Scientists are to transforming scientific research and uncovering unknown phenomena.

Method: A prospect-driven review analyzing current achievements, bottlenecks, and critical components of AI Scientist systems.

Result: Identifies limitations and gaps in current AI Scientist systems, outlining goals for future scientific AI.

Conclusion: The survey aims to clarify the current state, missing elements, and ultimate objectives for AI in scientific discovery.

Abstract: The emergence of large language models (LLMs) is propelling automated scientific discovery to the next level, with LLM-based Artificial Intelligence (AI) Scientist systems now taking the lead in scientific research. Several influential works have already appeared in the field of AI Scientist systems, with AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans, may soon become a reality. In this survey, we focus on the central question: How far are AI scientists from changing the world and reshaping the scientific research paradigm? To answer this question, we provide a prospect-driven review that comprehensively analyzes the current achievements of AI Scientist systems, identifying key bottlenecks and the critical components required for the emergence of a scientific agent capable of producing ground-breaking discoveries that solve grand challenges. We hope this survey will contribute to a clearer understanding of limitations of current AI Scientist systems, showing where we are, what is missing, and what the ultimate goals for scientific AI should be.

[275] AI Must not be Fully Autonomous

Tosin Adewumi, Lama Alkhaled, Florent Imbert, Hui Han, Nudrat Habib, Karl Löwenmark

Main category: cs.AI

TL;DR: The paper argues against fully autonomous AI (level 3) due to risks, advocating for human oversight. It presents theories, arguments, counterarguments, and evidence of AI risks.

Details

Motivation: The risks of fully autonomous AI, especially with the potential rise of artificial superintelligence (ASI), necessitate responsible human oversight.

Method: The paper discusses autonomy theories, AI, and agents, and provides 12 arguments, 6 counterarguments with rebuttals, and 15 pieces of evidence.

Result: The analysis supports the need for human oversight to mitigate risks associated with fully autonomous AI.

Conclusion: Fully autonomous AI (level 3) should be avoided; responsible human oversight is essential to manage risks.

Abstract: Autonomous Artificial Intelligence (AI) has many benefits. It also has many risks. In this work, we identify the 3 levels of autonomous AI. We are of the position that AI must not be fully autonomous because of the many risks, especially as artificial superintelligence (ASI) is speculated to be just decades away. Fully autonomous AI, which can develop its own objectives, is at level 3 and without responsible human oversight. However, responsible human oversight is crucial for mitigating the risks. To ague for our position, we discuss theories of autonomy, AI and agents. Then, we offer 12 distinct arguments and 6 counterarguments with rebuttals to the counterarguments. We also present 15 pieces of recent evidence of AI misaligned values and other risks in the appendix.

[276] DSBC : Data Science task Benchmarking with Context engineering

Ram Mohan Rao Kadiyala, Siddhant Gupta, Jebish Purbey, Giulio Martini, Suman Debnath, Hamza Farooq

Main category: cs.AI

TL;DR: The paper introduces a benchmark for evaluating data science agents powered by LLMs, testing three models across three approaches and eight task categories, revealing performance disparities and practical deployment factors.

Details

Motivation: Despite the rapid adoption of LLM-based data science agents, there is a lack of systematic benchmarks to evaluate their efficacy and limitations.

Method: The study evaluates three LLMs (Claude-4.0-Sonnet, Gemini-2.5-Flash, OpenAI-o4-Mini) using three approaches (zero-shot with context engineering, multi-step with context engineering, SmolAgent) across eight data science task categories. It also examines sensitivity to prompting issues and temperature parameters.

Result: Findings show distinct performance disparities among models and methodologies, highlighting critical factors for practical deployment.

Conclusion: The benchmark dataset and framework aim to support future research for more robust and effective data science agents.

Abstract: Recent advances in large language models (LLMs) have significantly impacted data science workflows, giving rise to specialized data science agents designed to automate analytical tasks. Despite rapid adoption, systematic benchmarks evaluating the efficacy and limitations of these agents remain scarce. In this paper, we introduce a comprehensive benchmark specifically crafted to reflect real-world user interactions with data science agents by observing usage of our commercial applications. We evaluate three LLMs: Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini across three approaches: zero-shot with context engineering, multi-step with context engineering, and with SmolAgent. Our benchmark assesses performance across a diverse set of eight data science task categories, additionally exploring the sensitivity of models to common prompting issues, such as data leakage and slightly ambiguous instructions. We further investigate the influence of temperature parameters on overall and task-specific outcomes for each model and approach. Our findings reveal distinct performance disparities among the evaluated models and methodologies, highlighting critical factors that affect practical deployment. The benchmark dataset and evaluation framework introduced herein aim to provide a foundation for future research of more robust and effective data science agents.

[277] LLM4Rail: An LLM-Augmented Railway Service Consulting Platform

Zhuo Li, Xianghuai Deng, Chiwei Feng, Hanmeng Li, Shenjie Wang, Haichao Zhang, Teng Jia, Conlin Chen, Louis Linchun Wu, Jia Wang

Main category: cs.AI

TL;DR: LLM4Rail is a novel LLM-augmented railway service platform using the QTAO framework for personalized services like ticketing, food recommendations, and chitchat. It includes the CRFD-25 dataset for railway catering and a zero-shot recommender system.

Details

Motivation: To meet the demand for individualized railway services by leveraging LLMs for personalized and accurate responses.

Method: Proposes the QTAO prompting framework for reasoning and action integration, and introduces the CRFD-25 dataset with a zero-shot conversational recommender system.

Result: LLM4Rail effectively provides personalized railway services, including food recommendations aligned with the CRFD-25 dataset.

Conclusion: LLM4Rail demonstrates the potential of LLMs in enhancing railway services through personalized and context-aware solutions.

Abstract: Large language models (LLMs) have significantly reshaped different walks of business. To meet the increasing demands for individualized railway service, we develop LLM4Rail - a novel LLM-augmented railway service consulting platform. Empowered by LLM, LLM4Rail can provide custom modules for ticketing, railway food & drink recommendations, weather information, and chitchat. In LLM4Rail, we propose the iterative “Question-Thought-Action-Observation (QTAO)” prompting framework. It meticulously integrates verbal reasoning with task-oriented actions, that is, reasoning to guide action selection, to effectively retrieve external observations relevant to railway operation and service to generate accurate responses. To provide personalized onboard dining services, we first construct the Chinese Railway Food and Drink (CRFD-25) - a publicly accessible takeout dataset tailored for railway services. CRFD-25 covers a wide range of signature dishes categorized by cities, cuisines, age groups, and spiciness levels. We further introduce an LLM-based zero-shot conversational recommender for railway catering. To address the unconstrained nature of open recommendations, the feature similarity-based post-processing step is introduced to ensure all the recommended items are aligned with CRFD-25 dataset.

[278] Chatting with your ERP: A Recipe

Jorge Ruiz Gómez, Lidia Andrés Susinos, Jorge Alamo Olivé, Sonia Rey Osorno, Manuel Luis Gonzalez Hernández

Main category: cs.AI

TL;DR: A dual-agent LLM architecture for translating natural language to SQL in ERP systems improves reliability.

Details

Motivation: To enhance interaction with industrial ERP systems using natural language queries.

Method: Proposes a dual-agent architecture with reasoning and critique stages for reliable SQL generation.

Result: The agent successfully interprets queries and generates executable SQL.

Conclusion: The dual-agent approach improves reliability in natural language to SQL translation for ERP systems.

Abstract: This paper presents the design, implementation, and evaluation behind a Large Language Model (LLM) agent that chats with an industrial production-grade ERP system. The agent is capable of interpreting natural language queries and translating them into executable SQL statements, leveraging open-weight LLMs. A novel dual-agent architecture combining reasoning and critique stages was proposed to improve query generation reliability.

[279] Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

Mingzhe Li, Xin Lu, Yanyan Zhao

Main category: cs.AI

TL;DR: Self-Foveate is an LLM-driven method for synthesizing diverse and challenging instructions from unsupervised text, using a multi-level foveation approach.

Details

Motivation: Existing automated instruction synthesis methods lack diversity and difficulty, relying heavily on human effort.

Method: Proposes ‘Micro-Scatter-Macro’ multi-level foveation to guide LLMs in extracting fine-grained information from text.

Result: Validated effectiveness across multiple corpora and model architectures.

Conclusion: Self-Foveate enhances instruction synthesis, outperforming conventional methods.

Abstract: Large language models (LLMs) with instruction following capabilities have demonstrated impressive problem-solving abilities. While synthesizing instructional data from unsupervised text has become a common approach for training such models, conventional methods rely heavily on human effort for data annotation. Although existing automated synthesis paradigms have alleviated this constraint, they still exhibit significant limitations in ensuring adequate diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an innovative LLM-driven method for instruction synthesis. This approach introduces a “Micro-Scatter-Macro” multi-level foveation methodology that effectively guides the LLM to deeply excavate fine-grained information embedded in unsupervised text, thereby enhancing both the diversity and difficulty of synthesized instructions. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures validate the effectiveness and superiority of our proposed method. We publicly release our data and codes: https://github.com/Mubuky/Self-Foveate

[280] Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery

Kacper Kadziolka, Saber Salehkaleybar

Main category: cs.AI

TL;DR: The paper explores causal discovery in large language models, showing that reasoning-first architectures outperform traditional methods. A modular in-context pipeline improves performance significantly.

Details

Motivation: To address the challenge of causal inference in large language models and improve robustness in causal discovery tasks.

Method: Uses OpenAI’s o-series and DeepSeek-R models on the Corr2Cause benchmark, introducing a modular in-context pipeline inspired by Tree-of-Thoughts and Chain-of-Thoughts.

Result: Reasoning-first architectures achieve significant gains, with the pipeline yielding nearly three-fold improvements over baselines.

Conclusion: Advanced reasoning models show promise, but structured in-context frameworks are crucial for maximizing their potential in causal discovery.

Abstract: Causal inference remains a fundamental challenge for large language models. Recent advances in internal reasoning with large language models have sparked interest in whether state-of-the-art reasoning models can robustly perform causal discovery-a task where conventional models often suffer from severe overfitting and near-random performance under data perturbations. We study causal discovery on the Corr2Cause benchmark using the emergent OpenAI’s o-series and DeepSeek-R model families and find that these reasoning-first architectures achieve significantly greater native gains than prior approaches. To capitalize on these strengths, we introduce a modular in-context pipeline inspired by the Tree-of-Thoughts and Chain-of-Thoughts methodologies, yielding nearly three-fold improvements over conventional baselines. We further probe the pipeline’s impact by analyzing reasoning chain length, complexity, and conducting qualitative and quantitative comparisons between conventional and reasoning models. Our findings suggest that while advanced reasoning models represent a substantial leap forward, carefully structured in-context frameworks are essential to maximize their capabilities and offer a generalizable blueprint for causal discovery across diverse domains.

[281] Causal Identification of Sufficient, Contrastive and Complete Feature Sets in Image Classification

David A Kelly, Hana Chockler

Main category: cs.AI

TL;DR: The paper introduces causal explanations for image classifiers, combining formal rigor with practical applicability, and introduces contrastive and complete causal explanations.

Details

Motivation: Existing explanation methods for image classifiers lack formal rigor or rely on impractical assumptions. Causal explanations bridge this gap.

Method: The paper defines causal explanations, proves their formal properties, and introduces contrastive and complete causal explanations. Algorithms are implemented to compute these efficiently.

Result: Experiments show varying patterns of sufficiency, contrastiveness, and completeness across models. Algorithms are efficient (6s/image on ResNet50) and black-box.

Conclusion: Causal explanations offer a rigorous yet practical solution for explaining image classifiers, with efficient and black-box computability.

Abstract: Existing algorithms for explaining the outputs of image classifiers are based on a variety of approaches and produce explanations that lack formal rigor. On the other hand, logic-based explanations are formally and rigorously defined but their computability relies on strict assumptions about the model that do not hold on image classifiers. In this paper, we show that causal explanations, in addition to being formally and rigorously defined, enjoy the same formal properties as logic-based ones, while still lending themselves to black-box algorithms and being a natural fit for image classifiers. We prove formal properties of causal explanations and introduce contrastive causal explanations for image classifiers. Moreover, we augment the definition of explanation with confidence awareness and introduce complete causal explanations: explanations that are classified with exactly the same confidence as the original image. We implement our definitions, and our experimental results demonstrate that different models have different patterns of sufficiency, contrastiveness, and completeness. Our algorithms are efficiently computable, taking on average 6s per image on a ResNet50 model to compute all types of explanations, and are totally black-box, needing no knowledge of the model, no access to model internals, no access to gradient, nor requiring any properties, such as monotonicity, of the model.

[282] Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding

Shiyue Wang, Haozheng Xu, Yuhan Zhang, Jingran Lin, Changhong Lu, Xiangfeng Wang, Wenhao Li

Main category: cs.AI

TL;DR: This survey bridges classical and learning-based methods in Multi-Agent Path Finding (MAPF), presenting a unified framework, analyzing evaluation disparities, and suggesting future directions like game-theoretic MAPF and neural solvers.

Details

Motivation: MAPF is critical for real-world multi-robot coordination, but research is divided between classical and learning-based methods. This survey aims to unify and standardize the field.

Method: The survey reviews search-based, compilation-based, and data-driven MAPF methods, analyzing 200+ papers for evaluation disparities and proposing a taxonomy of metrics and benchmarks.

Result: Classical methods are tested on larger-scale instances (200x200 grids, 1000+ agents) compared to learning-based approaches (10-100 agents), highlighting evaluation inconsistencies.

Conclusion: The survey calls for standardized benchmarking and explores future directions like game-theoretic MAPF and neural solver architectures, serving as a reference for researchers and practitioners.

Abstract: Multi-Agent Path Finding (MAPF) is a fundamental problem in artificial intelligence and robotics, requiring the computation of collision-free paths for multiple agents navigating from their start locations to designated goals. As autonomous systems become increasingly prevalent in warehouses, urban transportation, and other complex environments, MAPF has evolved from a theoretical challenge to a critical enabler of real-world multi-robot coordination. This comprehensive survey bridges the long-standing divide between classical algorithmic approaches and emerging learning-based methods in MAPF research. We present a unified framework that encompasses search-based methods (including Conflict-Based Search, Priority-Based Search, and Large Neighborhood Search), compilation-based approaches (SAT, SMT, CSP, ASP, and MIP formulations), and data-driven techniques (reinforcement learning, supervised learning, and hybrid strategies). Through systematic analysis of experimental practices across 200+ papers, we uncover significant disparities in evaluation methodologies, with classical methods typically tested on larger-scale instances (up to 200 by 200 grids with 1000+ agents) compared to learning-based approaches (predominantly 10-100 agents). We provide a comprehensive taxonomy of evaluation metrics, environment types, and baseline selections, highlighting the need for standardized benchmarking protocols. Finally, we outline promising future directions including mixed-motive MAPF with game-theoretic considerations, language-grounded planning with large language models, and neural solver architectures that combine the rigor of classical methods with the flexibility of deep learning. This survey serves as both a comprehensive reference for researchers and a practical guide for deploying MAPF solutions in increasingly complex real-world applications.

[283] DICE: Dynamic In-Context Example Selection in LLM Agents via Efficient Knowledge Transfer

Ruoyu Wang, Junda Wu, Yu Xia, Tong Yu, Ryan A. Rossi, Julian McAuley, Lina Yao

Main category: cs.AI

TL;DR: DICE introduces a dynamic, theoretically grounded method for selecting in-context demonstrations to improve LLM agent performance by addressing spurious dependencies and ensuring relevance at each reasoning step.

Details

Motivation: Existing in-context learning (ICL) methods for LLM agents are sensitive to demonstration choices, often leading to unstable performance. There's a lack of a general, principled approach for effective demo selection.

Method: DICE decomposes demonstration knowledge into transferable and non-transferable components using a causal lens, proposes a stepwise selection criterion, and integrates as a plug-in module without additional training.

Result: Experiments show DICE’s effectiveness and generality across diverse domains, improving agent performance with principled demo selection.

Conclusion: DICE provides a robust, framework-agnostic solution for dynamic demo selection, enhancing LLM agent performance and generalization.

Abstract: Large language model-based agents, empowered by in-context learning (ICL), have demonstrated strong capabilities in complex reasoning and tool-use tasks. However, existing works have shown that the effectiveness of ICL is highly sensitive to the choice of demonstrations, with suboptimal examples often leading to unstable or degraded performance. While prior work has explored example selection, including in some agentic or multi-step settings, existing approaches typically rely on heuristics or task-specific designs and lack a general, theoretically grounded criterion for what constitutes an effective demonstration across reasoning steps. Therefore, it is non-trivial to develop a principled, general-purpose method for selecting demonstrations that consistently benefit agent performance. In this paper, we address this challenge with DICE, Dynamic In-Context Example Selection for LLM Agents, a theoretically grounded ICL framework for agentic tasks that selects the most relevant demonstrations at each step of reasoning. Our approach decomposes demonstration knowledge into transferable and non-transferable components through a causal lens, showing how the latter can introduce spurious dependencies that impair generalization. We further propose a stepwise selection criterion with a formal guarantee of improved agent performance. Importantly, DICE is a general, framework-agnostic solution that can be integrated as a plug-in module into existing agentic frameworks without any additional training cost. Extensive experiments across diverse domains demonstrate our method’s effectiveness and generality, highlighting the importance of principled, context-aware demo selection for robust and efficient LLM agents.

[284] GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

Haoyang Liu, Yijiang Li, Haohan Wang

Main category: cs.AI

TL;DR: GenoMAS introduces a team of LLM-based scientists to automate gene expression analysis, combining structured workflows with autonomous adaptability, outperforming prior methods in preprocessing and gene identification.

Details

Motivation: Current automation tools for gene expression analysis are either too rigid or lack precision, limiting their effectiveness in rigorous scientific inquiry.

Method: GenoMAS uses six specialized LLM agents coordinated via typed message-passing protocols and a guided-planning framework to handle genomic data flexibly and coherently.

Result: Achieves 89.13% Composite Similarity Correlation for preprocessing and 60.48% F1 for gene identification, outperforming prior methods by significant margins.

Conclusion: GenoMAS successfully balances reliability and adaptability, offering a robust solution for gene expression analysis with validated biological insights.

Abstract: Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

[285] Semantic Chain-of-Trust: Autonomous Trust Orchestration for Collaborator Selection via Hypergraph-Aided Agentic AI

Botao Zhu, Xianbin Wang, Dusit Niyato

Main category: cs.AI

TL;DR: Proposes an autonomous trust orchestration method using agentic AI and hypergraph for efficient trust evaluation in collaborative systems.

Details

Motivation: Addresses the complexity and resource consumption of trust evaluations in distributed collaboration due to task complexity, dynamic device resources, and assessment overhead.

Method: Uses agentic AI and hypergraph to autonomously evaluate trust during device idle periods, analyze task-resource alignment, and manage collaborators hierarchically via trust hypergraphs.

Result: Achieves resource-efficient trust evaluation, balancing overhead and accuracy, and supports multi-hop collaboration in large-scale systems.

Conclusion: The method effectively optimizes trust evaluation, enhancing resource utilization and collaborative task execution.

Abstract: In collaborative systems, the effective completion of tasks hinges on task-specific trust evaluations of potential devices for distributed collaboration. However, the complexity of tasks, the spatiotemporal dynamism of distributed device resources, and the inevitable assessment overhead dramatically increase the complexity and resource consumption of the trust evaluation process. As a result, ill-timed or overly frequent trust evaluations can reduce utilization rate of constrained resources, negatively affecting collaborative task execution. To address this challenge, this paper proposes an autonomous trust orchestration method based on a new concept of semantic chain-of-trust. Our technique employs agentic AI and hypergraph to establish and maintain trust relationships among devices. By leveraging its strengths in autonomous perception, task decomposition, and semantic reasoning, we propose agentic AI to perceive device states and autonomously perform trust evaluations of collaborators based on historical performance data only during device idle periods, thereby enabling efficient utilization of distributed resources. In addition, agentic AI performs task-specific trust evaluations on collaborator resources by analyzing the alignment between resource capabilities and task requirements. Moreover, by maintaining a trust hypergraph embedded with trust semantics for each device, agentic AI enables hierarchical management of collaborators and identifies collaborators requiring trust evaluation based on trust semantics, thereby achieving a balance between overhead and trust accuracy. Furthermore, local trust hypergraphs from multiple devices can be chained together to support multi-hop collaboration, enabling efficient coordination in large-scale systems. Experimental results demonstrate that the proposed method achieves resource-efficient trust evaluation.

[286] MemoCue: Empowering LLM-Based Agents for Human Memory Recall via Strategy-Guided Querying

Qian Zhao, Zhuo Sun, Bin Guo, Zhiwen Yu

Main category: cs.AI

TL;DR: The paper proposes a strategy-guided agent-assisted memory recall method, using a Recall Router framework and fine-tuned LLMs to improve memory recall performance.

Details

Motivation: Conventional memory modules are limited in size, hindering complete memory acquisition. Memory theories suggest proactive activation of memories via cues.

Method: A 5W Recall Map classifies queries into scenarios, and a hierarchical recall tree with Monte Carlo Tree Search optimizes strategy selection. Fine-tuned LLMs (MemoCue) generate responses.

Result: MemoCue outperforms LLM-based methods by 17.74% in recall inspiration and excels in human evaluations.

Conclusion: The proposed method effectively enhances memory recall by leveraging strategic cues and fine-tuned LLMs.

Abstract: Agent-assisted memory recall is one critical research problem in the field of human-computer interaction. In conventional methods, the agent can retrieve information from its equipped memory module to help the person recall incomplete or vague memories. The limited size of memory module hinders the acquisition of complete memories and impacts the memory recall performance in practice. Memory theories suggest that the person’s relevant memory can be proactively activated through some effective cues. Inspired by this, we propose a novel strategy-guided agent-assisted memory recall method, allowing the agent to transform an original query into a cue-rich one via the judiciously designed strategy to help the person recall memories. To this end, there are two key challenges. (1) How to choose the appropriate recall strategy for diverse forgetting scenarios with distinct memory-recall characteristics? (2) How to obtain the high-quality responses leveraging recall strategies, given only abstract and sparsely annotated strategy patterns? To address the challenges, we propose a Recall Router framework. Specifically, we design a 5W Recall Map to classify memory queries into five typical scenarios and define fifteen recall strategy patterns across the corresponding scenarios. We then propose a hierarchical recall tree combined with the Monte Carlo Tree Search algorithm to optimize the selection of strategy and the generation of strategy responses. We construct an instruction tuning dataset and fine-tune multiple open-source large language models (LLMs) to develop MemoCue, an agent that excels in providing memory-inspired responses. Experiments on three representative datasets show that MemoCue surpasses LLM-based methods by 17.74% in recall inspiration. Further human evaluation highlights its advantages in memory-recall applications.

[287] Personalized Education with Ranking Alignment Recommendation

Haipeng Liu, Yuxuan Liu, Ting Long

Main category: cs.AI

TL;DR: The paper introduces Ranking Alignment Recommendation (RAR) to improve personalized question recommendation by enhancing exploration efficiency in reinforcement learning.

Details

Motivation: Existing methods struggle with efficient exploration in identifying optimal questions for students during training.

Method: Proposes RAR, integrating collaborative ideas into the exploration mechanism for better efficiency.

Result: RAR improves recommendation performance and is adaptable to any RL-based recommender.

Conclusion: RAR offers a scalable solution for personalized question recommendation, with code publicly available.

Abstract: Personalized question recommendation aims to guide individual students through questions to enhance their mastery of learning targets. Most previous methods model this task as a Markov Decision Process and use reinforcement learning to solve, but they struggle with efficient exploration, failing to identify the best questions for each student during training. To address this, we propose Ranking Alignment Recommendation (RAR), which incorporates collaborative ideas into the exploration mechanism, enabling more efficient exploration within limited training episodes. Experiments show that RAR effectively improves recommendation performance, and our framework can be applied to any RL-based question recommender. Our code is available in https://github.com/wuming29/RAR.git.

[288] TextQuests: How Good are LLMs at Text-Based Video Games?

Long Phan, Mantas Mazeika, Andy Zou, Dan Hendrycks

Main category: cs.AI

TL;DR: TextQuests is a new benchmark for evaluating AI agents’ long-context reasoning and autonomous problem-solving in exploratory environments, based on interactive fiction games.

Details

Motivation: Existing benchmarks lack the ability to assess AI agents' autonomous reasoning in exploratory, long-horizon tasks.

Method: The benchmark uses text-based interactive fiction games (Infocom suite) to evaluate agents’ intrinsic reasoning without external tools.

Result: TextQuests provides a framework to test AI agents on sustained, self-directed problem-solving in complex environments.

Conclusion: TextQuests aims to advance the development of AI agents with robust long-context reasoning and autonomous capabilities.

Abstract: Evaluating AI agents within complex, interactive environments that mirror real-world challenges is critical for understanding their practical capabilities. While existing agent benchmarks effectively assess skills like tool use or performance on structured tasks, they often do not fully capture an agent’s ability to operate autonomously in exploratory environments that demand sustained, self-directed reasoning over a long and growing context. To spur the development of agents capable of more robust intrinsic reasoning over long horizons, we introduce TextQuests, a benchmark based on the Infocom suite of interactive fiction games. These text-based adventures, which can take human players over 30 hours and require hundreds of precise actions to solve, serve as an effective proxy for evaluating AI agents on focused, stateful tasks. The benchmark is specifically designed to assess an LLM agent’s capacity for self-contained problem-solving by precluding the use of external tools, thereby focusing on intrinsic long-context reasoning capabilities in an exploratory environment characterized by the need for trial-and-error learning and sustained problem-solving within a single interactive session. We release TextQuests at https://textquests.ai.

[289] Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu

Main category: cs.AI

TL;DR: Seed-Prover, a lemma-style whole-proof reasoning model, leverages Lean feedback and self-summarization to refine proofs, achieving high success rates on IMO and Putnam problems. Seed-Geometry addresses Lean’s geometry limitations, contributing to IMO 2025 success.

Details

Motivation: LLMs struggle with theorem proving due to unclear supervision in natural language. Formal verification via Lean provides clear signals for effective training.

Method: Seed-Prover iteratively refines proofs using Lean feedback, proved lemmas, and self-summarization. Three test-time inference strategies enable deep and broad reasoning. Seed-Geometry enhances geometry reasoning.

Result: Seed-Prover proves 78.1% of formalized IMO problems, saturates MiniF2F, and exceeds 50% on PutnamBench. Seed-Geometry outperforms prior formal geometry engines.

Conclusion: The work advances automated mathematical reasoning by combining formal verification with long chain-of-thought reasoning, demonstrated by IMO 2025 performance.

Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

[290] CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu

Main category: cs.AI

TL;DR: CoT-Self-Instruct is a synthetic data generation method using Chain-of-Thought reasoning to create high-quality prompts for LLM training, outperforming existing datasets in reasoning and instruction-following tasks.

Details

Motivation: To improve LLM training by generating high-quality synthetic data that enhances reasoning and instruction-following capabilities.

Method: Uses Chain-of-Thought (CoT) reasoning on seed tasks to generate synthetic prompts, followed by filtering with automatic metrics.

Result: Outperforms datasets like s1k and OpenMathReasoning in verifiable reasoning (MATH500, AMC23, etc.) and surpasses human/standard prompts in non-verifiable tasks (AlpacaEval 2.0, Arena-Hard).

Conclusion: CoT-Self-Instruct effectively enhances LLM training by generating superior synthetic data for both reasoning and instruction-following tasks.

Abstract: We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on the given seed tasks, and then to generate a new synthetic prompt of similar quality and complexity for use in LLM training, followed by filtering for high-quality data with automatic metrics. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, across MATH500, AMC23, AIME24 and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of human or standard self-instruct prompts on both AlpacaEval 2.0 and Arena-Hard.

[291] SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model

Mingkai Deng, Jinyu Hou, Yilin Shen, Hongxia Jin, Graham Neubig, Zhiting Hu, Eric Xing

Main category: cs.AI

TL;DR: SimuRA introduces a goal-oriented architecture for generalized agentic reasoning, overcoming autoregressive LLM limitations by using a world model for planning via simulation.

Details

Motivation: Current AI agents focus on one-task-one-agent approaches, lacking scalability and generality, while humans reason by simulating outcomes. SimuRA aims to create a more general and powerful AI agent.

Method: SimuRA uses a world model implemented with LLMs for flexible planning in diverse environments, leveraging natural language’s latent space.

Result: Experiments show SimuRA improves flight search success from 0% to 32.2%, with world-model-based planning outperforming autoregressive planning by up to 124%.

Conclusion: SimuRA demonstrates the potential for training a single, general agent model based on LLMs for superintelligent action across environments, with a web-browsing agent made available for public testing.

Abstract: AI agents built on large language models (LLMs) hold enormous promise, but current practice focuses on a one-task-one-agent approach, which not only falls short of scalability and generality, but also suffers from the fundamental limitations of autoregressive LLMs. On the other hand, humans are general agents who reason by mentally simulating the outcomes of their actions and plans. Moving towards a more general and powerful AI agent, we introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning. Based on a principled formulation of optimal agent in any environment, \modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation. The generalized world model is implemented using LLM, which can flexibly plan in a wide range of environments using the concept-rich latent space of natural language. Experiments on difficult web browsing tasks show that \modelname improves the success of flight search from 0% to 32.2%. World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning, demonstrating the advantage of world model simulation as a reasoning paradigm. We are excited about the possibility for training a single, general agent model based on LLMs that can act superintelligently in all environments. To start, we make SimuRA, a web-browsing agent built on \modelname with pretrained LLMs, available as a research demo for public testing.

[292] FGeo-HyperGNet: Geometric Problem Solving Integrating FormalGeo Symbolic System and Hypergraph Neural Network

Xiaokai Zhang, Yang Li, Na Zhu, Cheng Qin, Zhenbing Zeng, Tuo Leng

Main category: cs.AI

TL;DR: FGeo-HyperGNet is a neural-symbolic system for geometric problem solving, combining formal reasoning with a hypergraph neural network to achieve state-of-the-art results.

Details

Motivation: Geometric problem solving is a challenge in AI and math; the paper aims to automate human-like reasoning.

Method: Uses a symbolic system (FormalGeo) for relational reasoning and algebraic calculations, and a neural component (HyperGNet) for theorem prediction and hypergraph updates.

Result: Achieved 93.50% TPA and 88.36% PSSR on FormalGeo7K dataset.

Conclusion: The neural-symbolic architecture is effective for readable and traceable geometric problem solving.

Abstract: Geometric problem solving has always been a long-standing challenge in the fields of mathematical reasoning and artificial intelligence. We built a neural-symbolic system, called FGeo-HyperGNet, to automatically perform human-like geometric problem solving. The symbolic component is a formal system built on FormalGeo, which can automatically perform geometric relational reasoning and algebraic calculations and organize the solution into a hypergraph with conditions as hypernodes and theorems as hyperedges. The neural component, called HyperGNet, is a hypergraph neural network based on the attention mechanism, including an encoder to encode the structural and semantic information of the hypergraph and a theorem predictor to provide guidance in solving problems. The neural component predicts theorems according to the hypergraph, and the symbolic component applies theorems and updates the hypergraph, thus forming a predict-apply cycle to ultimately achieve readable and traceable automatic solving of geometric problems. Experiments demonstrate the effectiveness of this neural-symbolic architecture. We achieved state-of-the-art results with a TPA of 93.50% and a PSSR of 88.36% on the FormalGeo7K dataset. The code is available at https://github.com/BitSecret/HyperGNet.

[293] TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling

Cristian Sestito, Shady Agwa, Themis Prodromakis

Main category: cs.AI

TL;DR: TrIM is a novel dataflow for systolic arrays that improves energy efficiency and throughput in CNNs by reducing memory access and data redundancy.

Details

Motivation: The Von Neumann bottleneck and data redundancy in CNNs drive the need for energy-efficient computing paradigms like systolic arrays.

Method: TrIM employs a Triangular Input Movement dataflow to maximize local input utilization and minimize weight data movement, avoiding on-chip memory penalties.

Result: TrIM reduces memory access by ~10X, increases throughput by up to 81.8%, and uses up to 15.6X fewer registers compared to row stationary dataflow.

Conclusion: TrIM offers a promising solution for energy-efficient CNN computation in systolic arrays by addressing data redundancy and memory access issues.

Abstract: In order to follow the ever-growing computational complexity and data intensity of state-of-the-art AI models, new computing paradigms are being proposed. These paradigms aim at achieving high energy efficiency by mitigating the Von Neumann bottleneck that relates to the energy cost of moving data between the processing cores and the memory. Convolutional Neural Networks (CNNs) are susceptible to this bottleneck, given the massive data they have to manage. Systolic arrays (SAs) are promising architectures to mitigate data transmission cost, thanks to high data utilization of Processing Elements (PEs). These PEs continuously exchange and process data locally based on specific dataflows (such as weight stationary and row stationary), in turn reducing the number of memory accesses to the main memory. In SAs, convolutions are managed either as matrix multiplications or exploiting the raster-order scan of sliding windows. However, data redundancy is a primary concern affecting area, power, and energy. In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing. TrIM maximizes the local input utilization, minimizes the weight data movement, and solves the data redundancy problem. Furthermore, TrIM does not incur the significant on-chip memory penalty introduced by the row stationary dataflow. When compared to state-of-the-art SA dataflows, the high data utilization offered by TrIM guarantees ~10X less memory access. Furthermore, considering that PEs continuously overlap multiplications and accumulations, TrIM achieves high throughput (up to 81.8% higher than row stationary), other than requiring a limited number of registers (up to 15.6X fewer registers than row stationary).

[294] When Words Smile: Generating Diverse Emotional Facial Expressions from Text

Haidong Xu, Meishan Zhang, Hao Ju, Zhedong Zheng, Erik Cambria, Min Zhang, Hao Fei

Main category: cs.AI

TL;DR: A text-to-expression model for digital humans focuses on emotional dynamics, outperforming baselines with diverse, fluid expressions, supported by the new EmoAva dataset.

Details

Motivation: Digital humans lack rich emotional expressions in current systems, limiting applications in dialogue and gaming.

Method: An end-to-end text-to-expression model learns expressive facial variations in a continuous latent space.

Result: The model outperforms baselines on multiple metrics, validated by the EmoAva dataset.

Conclusion: The work advances emotional expression synthesis for digital humans.

Abstract: Enabling digital humans to express rich emotions has significant applications in dialogue systems, gaming, and other interactive scenarios. While recent advances in talking head synthesis have achieved impressive results in lip synchronization, they tend to overlook the rich and dynamic nature of facial expressions. To fill this critical gap, we introduce an end-to-end text-to-expression model that explicitly focuses on emotional dynamics. Our model learns expressive facial variations in a continuous latent space and generates expressions that are diverse, fluid, and emotionally coherent. To support this task, we introduce EmoAva, a large-scale and high-quality dataset containing 15,000 text-3D expression pairs. Extensive experiments on both existing datasets and EmoAva demonstrate that our method significantly outperforms baselines across multiple evaluation metrics, marking a significant advancement in the field.

[295] AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang, Christopher M. Poskitt, Jun Sun

Main category: cs.AI

TL;DR: AgentSpec is a domain-specific language for enforcing runtime constraints on LLM agents, ensuring safety across diverse applications with high effectiveness and low overhead.

Details

Motivation: Existing methods for mitigating safety risks in LLM agents lack robustness, interpretability, and adaptability, necessitating a better solution.

Method: AgentSpec allows users to define structured rules (triggers, predicates, enforcement mechanisms) to enforce safety boundaries. It is implemented in domains like code execution, embodied agents, and autonomous driving.

Result: AgentSpec prevents unsafe executions in 90%+ of code agent cases, eliminates hazardous actions in embodied agents, and ensures 100% compliance in AVs. Rule generation by LLMs achieves high precision and recall.

Conclusion: AgentSpec offers a practical, scalable, and efficient solution for LLM agent safety, combining interpretability, modularity, and low computational overhead.

Abstract: Agents built on LLMs are increasingly deployed across diverse domains, automating complex decision-making and task execution. However, their autonomy introduces safety risks, including security vulnerabilities, legal violations, and unintended harmful actions. Existing mitigation methods, such as model-based safeguards and early enforcement strategies, fall short in robustness, interpretability, and adaptability. To address these challenges, we propose AgentSpec, a lightweight domain-specific language for specifying and enforcing runtime constraints on LLM agents. With AgentSpec, users define structured rules that incorporate triggers, predicates, and enforcement mechanisms, ensuring agents operate within predefined safety boundaries. We implement AgentSpec across multiple domains, including code execution, embodied agents, and autonomous driving, demonstrating its adaptability and effectiveness. Our evaluation shows that AgentSpec successfully prevents unsafe executions in over 90% of code agent cases, eliminates all hazardous actions in embodied agent tasks, and enforces 100% compliance by autonomous vehicles (AVs). Despite its strong safety guarantees, AgentSpec remains computationally lightweight, with overheads in milliseconds. By combining interpretability, modularity, and efficiency, AgentSpec provides a practical and scalable solution for enforcing LLM agent safety across diverse applications. We also automate the generation of rules using LLMs and assess their effectiveness. Our evaluation shows that the rules generated by OpenAI o1 achieve a precision of 95.56% and recall of 70.96% for embodied agents, successfully identify 87.26% of the risky code, and prevent AVs from breaking laws in 5 out of 8 scenarios.

[296] EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework

Yao Shi, Rongkeng Liang, Yong Xu

Main category: cs.AI

TL;DR: EducationQ is a multi-agent framework to evaluate LLMs’ teaching capabilities, revealing that smaller models can outperform larger ones in pedagogy, emphasizing the need for specialized optimization beyond scaling.

Details

Motivation: Current evaluations of LLMs as educational tools focus on knowledge recall, neglecting interactive pedagogy, which is resource-intensive and context-dependent to assess.

Method: EducationQ uses simulated educational scenarios with specialized agents (teaching, learning, evaluation) to test 14 LLMs across 13 disciplines and 10 difficulty levels. Mixed-methods include quantitative metrics, qualitative analysis, and expert case studies.

Result: Teaching effectiveness doesn’t linearly correlate with model scale or general reasoning; smaller open-source models sometimes outperform larger commercial ones. Human experts agreed 78% with automated qualitative analysis.

Conclusion: LLMs-as-teachers need specialized pedagogical optimization, not just scaling, suggesting future AI education tools should focus on targeted enhancements for teaching effectiveness.

Abstract: Large language models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.

[297] Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining

Yu Shi, Yitong Duan, Jian Li

Main category: cs.AI

TL;DR: A novel framework combining LLMs and MCTS for automated alpha factor mining, outperforming traditional methods in predictive accuracy and interpretability.

Details

Motivation: Traditional alpha mining methods are inefficient or yield opaque results; this paper aims to improve search efficiency and interpretability.

Method: Integrates LLMs with MCTS, using financial backtesting feedback and a subtree avoidance mechanism for diverse exploration.

Result: Outperforms existing methods in predictive accuracy and trading performance, with more interpretable formulas.

Conclusion: Establishes a more effective and efficient paradigm for formulaic alpha mining.

Abstract: Alpha factor mining is pivotal in quantitative investment for identifying predictive signals from complex financial data. While traditional formulaic alpha mining relies on human expertise, contemporary automated methods, such as those based on genetic programming or reinforcement learning, often struggle with search inefficiency or yield alpha factors that are difficult to interpret. This paper introduces a novel framework that integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to overcome these limitations. Our framework leverages the LLM’s instruction-following and reasoning capability to iteratively generate and refine symbolic alpha formulas within an MCTS-driven exploration. A key innovation is the guidance of MCTS exploration by rich, quantitative feedback from financial backtesting of each candidate factor, enabling efficient navigation of the vast search space. Furthermore, a frequent subtree avoidance mechanism is introduced to enhance search diversity and prevent formulaic homogenization, further improving performance. Experimental results on real-world stock market data demonstrate that our LLM-based framework outperforms existing methods by mining alphas with superior predictive accuracy and trading performance. The resulting formulas are also more amenable to human interpretation, establishing a more effective and efficient paradigm for formulaic alpha mining.

[298] Enhancing AI System Resiliency: Formulation and Guarantee for LSTM Resilience Based on Control Theory

Sota Yoshihara, Ryosuke Yamamoto, Hiroyuki Kusumoto, Masanari Shimura

Main category: cs.AI

TL;DR: The paper introduces a resilience metric called “recovery time” for LSTM networks in control systems, derives a data-independent upper bound for it, and validates the approach experimentally.

Details

Motivation: To ensure the resilience of LSTM networks in safety-critical AI applications by quantifying and controlling their recovery time after anomalous inputs.

Method: The authors refine incremental input-to-state stability theory for LSTM networks to derive a data-independent upper bound on recovery time, enabling resilience-aware training.

Result: Experimental validation shows the effectiveness of the resilience estimation and control methods.

Conclusion: The framework provides a foundation for rigorous quality assurance in safety-critical AI applications using LSTM networks.

Abstract: This paper proposes a novel theoretical framework for guaranteeing and evaluating the resilience of long short-term memory (LSTM) networks in control systems. We introduce “recovery time” as a new metric of resilience in order to quantify the time required for an LSTM to return to its normal state after anomalous inputs. By mathematically refining incremental input-to-state stability ($\delta$ISS) theory for LSTM, we derive a practical data-independent upper bound on recovery time. This upper bound gives us resilience-aware training. Experimental validation on simple models demonstrates the effectiveness of our resilience estimation and control methods, enhancing a foundation for rigorous quality assurance in safety-critical AI applications.

[299] Coordinating Search-Informed Reasoning and Reasoning-Guided Search in Claim Verification

Qisheng Hu, Quanyu Long, Wenya Wang

Main category: cs.AI

TL;DR: HARIS is a hierarchical agent system for multi-hop claim verification, combining reasoning-driven searching and search-informed reasoning to improve accuracy and interpretability.

Details

Motivation: Multi-hop claim verification is complex, requiring dynamic reasoning and iterative information retrieval, which are interleaved processes.

Method: HARIS uses a high-level reasoning agent and a low-level search agent, trained with reinforcement learning, to specialize in reasoning and information retrieval.

Result: HARIS achieves strong performance on EX-FEVER and HOVER benchmarks, advancing multi-hop claim verification.

Conclusion: HARIS effectively models the interplay between reasoning and search, enhancing verification accuracy and interpretability.

Abstract: Multi-hop claim verification is inherently challenging, requiring multi-step reasoning to construct verification chains while iteratively searching for information to uncover hidden bridging facts. This process is fundamentally interleaved, as effective reasoning relies on dynamically retrieved evidence, while effective search demands reasoning to refine queries based on partial information. To achieve this, we propose Hierarchical Agent Reasoning and Information Search (HARIS), explicitly modeling the coordinated process of reasoning-driven searching and search-informed reasoning. HARIS consists of a high-level reasoning agent that focuses on constructing the main verification chain, generating factual questions when more information is needed, and a low-level search agent that iteratively retrieves more information, refining its search based on intermediate findings. This design allows each agent to specialize in its respective task, enhancing verification accuracy and interpretability. HARIS is trained using reinforcement learning with outcome-based rewards. Experimental results on the EX-FEVER and HOVER benchmarks demonstrate that HARIS achieves strong performance, greatly advancing multi-hop claim verification.

[300] Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yumeng Wang, Yi R. Fung

Main category: cs.AI

TL;DR: The paper introduces the RFMDataset to evaluate large reasoning models’ performance on mathematical proofs, revealing significant shortcomings and diverse error types.

Details

Motivation: To expose hidden reasoning failures in large models, which are often masked by high accuracy on numerical evaluations and potential benchmark leakage.

Method: Created the RFMDataset with 200 diverse proof problems and analyzed models’ performance, identifying 10 fine-grained error types.

Result: Models struggle with proofs (some <20% correct), exhibit diverse reasoning failures, and show hallucination/incompleteness.

Conclusion: Current models lack rigorous reasoning; formalized, fine-grained logical training is needed.

Abstract: Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models’ performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models’ self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

[301] DrugMCTS: a drug repurposing framework combining multi-agent, RAG and Monte Carlo Tree Search

Zerui Yang, Yuwei Wan, Siyu Yan, Yudai Matsuda, Tong Xie, Bram Hoex, Linqi Song

Main category: cs.AI

TL;DR: DrugMCTS integrates RAG, multi-agent collaboration, and Monte Carlo Tree Search to enhance drug repositioning, outperforming traditional LLMs and deep learning methods.

Details

Motivation: Current LLMs struggle with reasoning beyond pretrained knowledge, and existing methods like fine-tuning or RAG are limited in computational efficiency or data utilization.

Method: DrugMCTS combines RAG, multi-agent collaboration, and Monte Carlo Tree Search, using five specialized agents for structured reasoning.

Result: Experiments on DrugBank and KIBA show DrugMCTS achieves higher recall and robustness than general-purpose LLMs and deep learning baselines.

Conclusion: Structured reasoning, agent collaboration, and feedback-driven search are key to advancing LLM applications in drug repositioning.

Abstract: Recent advances in large language models have demonstrated considerable potential in scientific domains such as drug repositioning. However, their effectiveness remains constrained when reasoning extends beyond the knowledge acquired during pretraining. Conventional approaches, such as fine-tuning or retrieval-augmented generation, face limitations in either imposing high computational overhead or failing to fully exploit structured scientific data. To overcome these challenges, we propose DrugMCTS, a novel framework that synergistically integrates RAG, multi-agent collaboration, and Monte Carlo Tree Search for drug repositioning. The framework employs five specialized agents tasked with retrieving and analyzing molecular and protein information, thereby enabling structured and iterative reasoning. Extensive experiments on the DrugBank and KIBA datasets demonstrate that DrugMCTS achieves substantially higher recall and robustness compared to both general-purpose LLMs and deep learning baselines. Our results highlight the importance of structured reasoning, agent-based collaboration, and feedback-driven search mechanisms in advancing LLM applications for drug repositioning.

[302] AI Should Sense Better, Not Just Scale Bigger: Adaptive Sensing as a Paradigm Shift

Eunsu Baek, Keondo Park, Jeonggil Ko, Min-hwan Oh, Taesik Gong, Hyung-Sin Kim

Main category: cs.AI

TL;DR: The paper proposes adaptive sensing as a sustainable alternative to scaling AI models, showing its efficiency and potential across various applications.

Details

Motivation: Current AI scaling methods are costly and unsustainable, prompting a need for bio-inspired adaptive sensing.

Method: Adaptive sensing dynamically adjusts sensor parameters (e.g., exposure, sensitivity) to improve efficiency and robustness.

Result: Small models with adaptive sensing outperform larger models, demonstrating significant efficiency gains.

Conclusion: The paper advocates for integrating adaptive sensing into AI systems, addressing technical and ethical challenges for sustainable and equitable AI.

Abstract: Current AI advances largely rely on scaling neural models and expanding training datasets to achieve generalization and robustness. Despite notable successes, this paradigm incurs significant environmental, economic, and ethical costs, limiting sustainability and equitable access. Inspired by biological sensory systems, where adaptation occurs dynamically at the input (e.g., adjusting pupil size, refocusing vision)–we advocate for adaptive sensing as a necessary and foundational shift. Adaptive sensing proactively modulates sensor parameters (e.g., exposure, sensitivity, multimodal configurations) at the input level, significantly mitigating covariate shifts and improving efficiency. Empirical evidence from recent studies demonstrates that adaptive sensing enables small models (e.g., EfficientNet-B0) to surpass substantially larger models (e.g., OpenCLIP-H) trained with significantly more data and compute. We (i) outline a roadmap for broadly integrating adaptive sensing into real-world applications spanning humanoid, healthcare, autonomous systems, agriculture, and environmental monitoring, (ii) critically assess technical and ethical integration challenges, and (iii) propose targeted research directions, such as standardized benchmarks, real-time adaptive algorithms, multimodal integration, and privacy-preserving methods. Collectively, these efforts aim to transition the AI community toward sustainable, robust, and equitable artificial intelligence systems.

[303] MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors

Shouyi Lu, Zihan Lin, Chao Lu, Huanran Wang, Guirong Zhuo, Lianqing Zheng

Main category: cs.AI

TL;DR: MultiEditor is a dual-branch latent diffusion framework for editing images and LiDAR point clouds in driving scenarios, improving cross-modality consistency and rare-category vehicle detection.

Details

Motivation: Addressing the challenge of long-tailed data distribution in autonomous driving, especially for rare but safety-critical vehicle categories.

Method: Uses 3D Gaussian Splatting (3DGS) as a prior, with multi-level appearance control and depth-guided deformable cross-modality conditioning.

Result: Achieves high visual/geometric fidelity, editing controllability, and cross-modality consistency, enhancing detection accuracy for rare classes.

Conclusion: MultiEditor effectively improves multimodal data editing and perception model performance for underrepresented vehicle categories.

Abstract: Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism–comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement–to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.

[304] Tiny-BioMoE: a Lightweight Embedding Model for Biosignal Analysis

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper introduces Tiny-BioMoE, a lightweight pretrained model for biosignal analysis, aimed at improving automatic pain assessment through multimodal physiological signals.

Details

Motivation: Accurate pain assessment is crucial for patient care and management. Current systems lack continuous monitoring and objectivity, which Tiny-BioMoE addresses by leveraging physiological signals.

Method: The study proposes Tiny-BioMoE, a model trained on 4.4 million biosignal image representations with 7.3 million parameters, for extracting high-quality embeddings for pain recognition tasks.

Result: Experiments show the model’s effectiveness across diverse biosignal modalities (e.g., electrodermal activity, blood volume pulse) in pain recognition.

Conclusion: Tiny-BioMoE offers a scalable and efficient solution for automatic pain assessment, with its architecture and weights publicly available.

Abstract: Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person’s state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed approach introduces \textit{Tiny-BioMoE}, a lightweight pretrained embedding model for biosignal analysis. Trained on $4.4$ million biosignal image representations and consisting of only $7.3$ million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model’s effectiveness across diverse modalities in automatic pain recognition tasks. \textit{\textcolor{blue}{The model’s architecture (code) and weights are available at https://github.com/GkikasStefanos/Tiny-BioMoE.

[305] Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper proposes a pipeline using electrodermal activity signals for automatic pain assessment, achieving results comparable or superior to traditional methods.

Details

Motivation: Reliable pain assessment is crucial for effective management and reducing distress. Automated systems can provide continuous, objective monitoring.

Method: The method uses electrodermal activity signals, creating multiple representations visualized in a single diagram, with various processing and filtering techniques.

Result: The approach consistently matches or outperforms traditional fusion methods, proving robust for integrating signal representations.

Conclusion: The proposed pipeline is a viable alternative for pain assessment, offering accuracy and versatility in signal integration.

Abstract: Pain is a multifaceted phenomenon that affects a substantial portion of the population. Reliable and consistent evaluation benefits those experiencing pain and underpins the development of effective and advanced management strategies. Automatic pain-assessment systems deliver continuous monitoring, inform clinical decision-making, and aim to reduce distress while preventing functional decline. By incorporating physiological signals, these systems provide objective, accurate insights into an individual’s condition. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages electrodermal activity signals as input modality. Multiple representations of the signal are created and visualized as waveforms, and they are jointly visualized within a single multi-representation diagram. Extensive experiments incorporating various processing and filtering techniques, along with multiple representation combinations, demonstrate the effectiveness of the proposed approach. It consistently yields comparable, and in several cases superior, results to traditional fusion methods, establishing it as a robust alternative for integrating different signal representations or modalities.

[306] Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper proposes a respiration-based pain assessment method using a cross-attention transformer and multi-windowing strategy, showing strong performance with efficient models.

Details

Motivation: Pain assessment is critical for effective management, and automated systems can aid continuous monitoring and clinical decisions.

Method: A pipeline using respiration signals, a cross-attention transformer, and multi-windowing to capture short/long-term and global features.

Result: Respiration is effective for pain assessment; optimized compact models outperform larger ones. Multi-windowing enhances feature representation.

Conclusion: The method demonstrates the viability of respiration and efficient models for accurate pain assessment, improving clinical support.

Abstract: Pain is a complex condition affecting a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain, and it supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring and support clinical decision-making, aiming to reduce distress and prevent functional decline. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages respiration as the input signal and incorporates a highly efficient cross-attention transformer alongside a multi-windowing strategy. Extensive experiments demonstrate that respiration is a valuable physiological modality for pain assessment. Moreover, experiments revealed that compact and efficient models, when properly optimized, can achieve strong performance, often surpassing larger counterparts. The proposed multi-window approach effectively captures both short-term and long-term features, as well as global characteristics, thereby enhancing the model’s representational capacity.

[307] LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang

Main category: cs.AI

TL;DR: The paper introduces LLM-Crowdsourced, a benchmark-free evaluation method for LLMs, addressing issues like data contamination and subjectivity. It uses LLMs to generate, answer, and evaluate questions, ensuring dynamic, transparent, objective, and professional criteria.

Details

Motivation: Existing LLM evaluation methods are flawed due to data contamination, black-box operation, and subjective preferences, hindering comprehensive capability assessment.

Method: Proposes LLM-Crowdsourced, where LLMs generate questions, answer independently, and evaluate each other, integrating dynamic, transparent, objective, and professional criteria.

Result: Tests on eight LLMs in math and programming show the method’s effectiveness in distinguishing performance. Novel findings include Gemini’s superior question-design and memorization-based answering in some LLMs.

Conclusion: LLM-Crowdsourced offers a robust, comprehensive evaluation method, uncovering insights traditional methods miss, and demonstrating high consistency in results.

Abstract: Although large language models (LLMs) demonstrate remarkable capabilities across various tasks, evaluating their capabilities remains a challenging task. Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference. These issues make it difficult to evaluate the LLMs’ true capabilities comprehensively. To tackle these challenges, we propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced. It utilizes LLMs to generate questions, answer independently, and evaluate mutually. This method integrates four key evaluation criteria: dynamic, transparent, objective, and professional, which existing evaluation methods cannot satisfy simultaneously. Experiments on eight mainstream LLMs across mathematics and programming verify the advantages of our method in distinguishing LLM performance. Furthermore, our study reveals several novel findings that are difficult for traditional methods to detect, including but not limited to: (1) Gemini demonstrates the highest original and professional question-design capabilities among others; (2) Some LLMs exhibit ‘‘memorization-based answering’’ by misrecognizing questions as familiar ones with a similar structure; (3) LLM evaluation results demonstrate high consistency (robustness).

[308] Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies

Hugo Garrido-Lestache, Jeremy Kedziora

Main category: cs.AI

TL;DR: TAAC is a reinforcement learning algorithm for multi-agent collaboration, using attention mechanisms and a penalized loss function to improve teamwork. It outperforms benchmarks in simulated soccer.

Details

Motivation: Enhancing multi-agent collaboration in cooperative environments by addressing the challenges of joint-action spaces and role diversity.

Method: Centralized Training/Centralized Execution with multi-headed attention in actor and critic, plus a penalized loss function for role diversity.

Result: TAAC outperforms benchmarks (PPO, MAAC) in win rates, goal differentials, Elo ratings, and collaborative behaviors.

Conclusion: TAAC effectively improves multi-agent collaboration and performance in cooperative tasks.

Abstract: This paper introduces Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm designed to enhance multi-agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi-headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter-agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint-action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi-agent paradigms, including Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter-agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).

cs.SD

[309] Balancing Information Preservation and Disentanglement in Self-Supervised Music Representation Learning

Julia Wilkins, Sivan Ding, Magdalena Fuentes, Juan Pablo Bello

Main category: cs.SD

TL;DR: A multi-view SSL framework combines contrastive and reconstructive objectives to disentangle music audio representations, balancing fidelity and semantics.

Details

Motivation: To explore the interaction between contrastive and reconstructive SSL paradigms in a unified framework for music audio.

Method: Proposes a multi-view SSL framework combining contrastive and reconstructive objectives to disentangle music representations.

Result: Effective combination of strategies enables disentanglement of music attributes without losing information integrity.

Conclusion: Contrastive and reconstructive strategies complement each other in disentangling music audio representations.

Abstract: Recent advances in self-supervised learning (SSL) methods offer a range of strategies for capturing useful representations from music audio without the need for labeled data. While some techniques focus on preserving comprehensive details through reconstruction, others favor semantic structure via contrastive objectives. Few works examine the interaction between these paradigms in a unified SSL framework. In this work, we propose a multi-view SSL framework for disentangling music audio representations that combines contrastive and reconstructive objectives. The architecture is designed to promote both information fidelity and structured semantics of factors in disentangled subspaces. We perform an extensive evaluation on the design choices of contrastive strategies using music audio representations in a controlled setting. We find that while reconstruction and contrastive strategies exhibit consistent trade-offs, when combined effectively, they complement each other; this enables the disentanglement of music attributes without compromising information integrity.

[310] “I made this (sort of)”: Negotiating authorship, confronting fraudulence, and exploring new musical spaces with prompt-based AI music generation

Bob L. T. Sturm

Main category: cs.SD

TL;DR: The author reflects on creating two AI-generated music albums, exploring authorship, identity, and AI’s role in music, using an LLM for self-interview and deeper analysis.

Details

Motivation: To explore the creative and philosophical implications of using AI for music generation, questioning authorship, identity, and the nature of AI's influence on artistic expression.

Method: Created two albums using prompt-based AI music platforms, then used an LLM to interview the author, analyzing the process and outcomes.

Result: The albums and LLM interview raised questions about authorship, identity, and AI’s limitations in generating ‘unpolished’ music, revealing new artistic possibilities.

Conclusion: AI-mediated creativity challenges traditional notions of authorship and identity, opening new musical spaces and prompting deeper self-reflection.

Abstract: I reflect on my experience creating two music albums centered on state-of-the-art prompt-based AI music generation platforms. The first album explicitly poses the question: What happens when I collide my junk mail with these platforms? The second album is a direct response to the first, and toys with the inability of state-of-the-art prompt-based AI music generation platforms to generate music that is not practiced'', polished’’, and ``produced’’. I seed a large language model (LLM) with information about these albums and have it interview me, which results in the exploration of several deeper questions: To what extent am I the author? Where am I in the resulting music? How is my musical identity changing as I am faced with machines that are in some ways far more talented than I? What new musical spaces does my work open, for me or anyone/thing else? I conclude by reflecting on my reflections, as well as LLM-mediated self-reflection as method.

[311] Identifying Hearing Difficulty Moments in Conversational Audio

Jack Collins, Adrian Buzea, Chris Collier, Alejandro Ballesta Rosen, Julian Maclaren, Richard F. Lyon, Simon Carlile

Main category: cs.SD

TL;DR: Machine learning models, especially audio language models, outperform traditional methods in detecting hearing difficulty moments in conversations.

Details

Motivation: Timely interventions in hearing assistive technology require identifying moments of hearing difficulty in real-time.

Method: Proposed and compared machine learning solutions, including audio language models, ASR hotword heuristic, and Wav2Vec fine-tuning.

Result: Audio language models significantly outperformed the ASR hotword heuristic and Wav2Vec fine-tuning.

Conclusion: Audio language models are highly effective for detecting hearing difficulty moments, offering superior performance over conventional methods.

Abstract: Individuals regularly experience Hearing Difficulty Moments in everyday conversation. Identifying these moments of hearing difficulty has particular significance in the field of hearing assistive technology where timely interventions are key for realtime hearing assistance. In this paper, we propose and compare machine learning solutions for continuously detecting utterances that identify these specific moments in conversational audio. We show that audio language models, through their multimodal reasoning capabilities, excel at this task, significantly outperforming a simple ASR hotword heuristic and a more conventional fine-tuning approach with Wav2Vec, an audio-only input architecture that is state-of-the-art for automatic speech recognition (ASR).

[312] DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Minggang Zhao, Zhao Lv

Main category: cs.SD

TL;DR: The paper introduces DMF2Mel, a Dynamic Multiscale Fusion Network, to improve mel spectrogram reconstruction in imagined speech decoding, achieving significant performance gains over baselines.

Details

Motivation: Existing methods struggle with balancing temporal dependency modeling and information retention in long-sequence decoding for imagined speech.

Method: Proposes DMF2Mel with four components: DC-FAM for feature separation, HAMS-Net for cross-scale fusion, SplineMap attention for context modeling, and convMamba for long-range dependencies.

Result: DMF2Mel improves Pearson correlation coefficients by 48% for known subjects and 35% for unknown subjects on the SparrKULee dataset.

Conclusion: DMF2Mel effectively addresses challenges in imagined speech decoding, demonstrating superior performance in mel spectrogram reconstruction.

Abstract: Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related “foreground features” from noisy “background features” through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel.

cs.LG

[313] Neural Autoregressive Modeling of Brain Aging

Ridvan Yesiloglu, Wei Peng, Md Tauhidul Islam, Ehsan Adeli

Main category: cs.LG

TL;DR: NeuroAR, a generative autoregressive transformer model, synthesizes brain aging by predicting future MRI scans from past ones, outperforming state-of-the-art models like LDM and GANs in fidelity and realism.

Details

Motivation: Brain aging synthesis is crucial for clinical and computational neuroscience but faces challenges like high-dimensional data and subtle structural changes.

Method: NeuroAR uses autoregressive transformers to estimate future scan token maps from past and future scan embeddings, guided by cross-attention with age data.

Result: NeuroAR outperforms LDM and GANs in image fidelity and realism, validated by a pre-trained age predictor.

Conclusion: NeuroAR effectively models subject-specific brain aging trajectories with high fidelity, advancing brain aging synthesis.

Abstract: Brain aging synthesis is a critical task with broad applications in clinical and computational neuroscience. The ability to predict the future structural evolution of a subject’s brain from an earlier MRI scan provides valuable insights into aging trajectories. Yet, the high-dimensionality of data, subtle changes of structure across ages, and subject-specific patterns constitute challenges in the synthesis of the aging brain. To overcome these challenges, we propose NeuroAR, a novel brain aging simulation model based on generative autoregressive transformers. NeuroAR synthesizes the aging brain by autoregressively estimating the discrete token maps of a future scan from a convenient space of concatenated token embeddings of a previous and future scan. To guide the generation, it concatenates into each scale the subject’s previous scan, and uses its acquisition age and the target age at each block via cross-attention. We evaluate our approach on both the elderly population and adolescent subjects, demonstrating superior performance over state-of-the-art generative models, including latent diffusion models (LDM) and generative adversarial networks, in terms of image fidelity. Furthermore, we employ a pre-trained age predictor to further validate the consistency and realism of the synthesized images with respect to expected aging patterns. NeuroAR significantly outperforms key models, including LDM, demonstrating its ability to model subject-specific brain aging trajectories with high fidelity.

[314] LLM-Assisted Cheating Detection in Korean Language via Keystrokes

Dong Hyun Roh, Rajesh Kumar, An Ngo

Main category: cs.LG

TL;DR: A keystroke-based framework detects LLM-assisted cheating in Korean, using temporal and rhythmic features to distinguish between bona fide writing, paraphrasing, and transcribing ChatGPT responses. Models outperform humans in detection.

Details

Motivation: Address gaps in language coverage, cognitive context, and granularity of LLM involvement in detecting cheating.

Method: Dataset of 69 participants writing under three conditions (bona fide, paraphrasing, transcribing ChatGPT). Features extracted and evaluated under Cognition-Aware and Cognition-Unaware settings.

Result: Temporal features excel in Cognition-Aware scenarios; rhythmic features generalize better. Models outperform humans, especially in detecting paraphrased responses.

Conclusion: Keystroke dynamics reliably detect LLM-assisted writing across cognitive demands and strategies.

Abstract: This paper presents a keystroke-based framework for detecting LLM-assisted cheating in Korean, addressing key gaps in prior research regarding language coverage, cognitive context, and the granularity of LLM involvement. Our proposed dataset includes 69 participants who completed writing tasks under three conditions: Bona fide writing, paraphrasing ChatGPT responses, and transcribing ChatGPT responses. Each task spans six cognitive processes defined in Bloom’s Taxonomy (remember, understand, apply, analyze, evaluate, and create). We extract interpretable temporal and rhythmic features and evaluate multiple classifiers under both Cognition-Aware and Cognition-Unaware settings. Temporal features perform well under Cognition-Aware evaluation scenarios, while rhythmic features generalize better under cross-cognition scenarios. Moreover, detecting bona fide and transcribed responses was easier than paraphrased ones for both the proposed models and human evaluators, with the models significantly outperforming the humans. Our findings affirm that keystroke dynamics facilitate reliable detection of LLM-assisted writing across varying cognitive demands and writing strategies, including paraphrasing and transcribing LLM-generated responses.

[315] Scientific Machine Learning with Kolmogorov-Arnold Networks

Salah A. Faroughi, Farinaz Mostajeran, Amin Hamed Mashhadzadeh, Shirko Faroughi

Main category: cs.LG

TL;DR: The paper reviews the shift from MLPs to KANs in scientific machine learning, highlighting KANs’ advantages in interpretability, flexibility, and performance. It categorizes progress in KAN-based models and identifies future research challenges.

Details

Motivation: The limitations of MLPs (e.g., poor interpretability, fixed activation functions) motivate the adoption of KANs for better modeling of complex nonlinear interactions.

Method: The review examines KAN-based models from three perspectives: data-driven learning, physics-informed modeling, and deep operator learning, focusing on design, training, and application.

Result: KANs outperform MLPs in accuracy, convergence, and spectral representation, demonstrating superior capability in capturing complex dynamics.

Conclusion: The paper identifies challenges in KAN development (e.g., computational efficiency, theoretical guarantees) and suggests future research directions for robustness and scalability.

Abstract: The field of scientific machine learning, which originally utilized multilayer perceptrons (MLPs), is increasingly adopting Kolmogorov-Arnold Networks (KANs) for data encoding. This shift is driven by the limitations of MLPs, including poor interpretability, fixed activation functions, and difficulty capturing localized or high-frequency features. KANs address these issues with enhanced interpretability and flexibility, enabling more efficient modeling of complex nonlinear interactions and effectively overcoming the constraints associated with conventional MLP architectures. This review categorizes recent progress in KAN-based models across three distinct perspectives: (i) data-driven learning, (ii) physics-informed modeling, and (iii) deep operator learning. Each perspective is examined through the lens of architectural design, training strategies, application efficacy, and comparative evaluation against MLP-based counterparts. By benchmarking KANs against MLPs, we highlight consistent improvements in accuracy, convergence, and spectral representation, clarifying KANs’ advantages in capturing complex dynamics while learning more effectively. Finally, this review identifies critical challenges and open research questions in KAN development, particularly regarding computational efficiency, theoretical guarantees, hyperparameter tuning, and algorithm complexity. We also outline future research directions aimed at improving the robustness, scalability, and physical consistency of KAN-based frameworks.

[316] Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods

Siwoo Park

Main category: cs.LG

TL;DR: The paper explores inverse capabilities of multimodal latent spaces in AI models, revealing their limitations in producing coherent and meaningful inverse mappings despite optimization efforts.

Details

Motivation: To investigate whether multimodal latent spaces, optimized for forward tasks, can support semantically meaningful inverse mappings.

Method: An optimization-based framework is applied bidirectionally across Text-Image and Text-Audio modalities to infer input characteristics from outputs.

Result: Optimization produces textually aligned outputs but lacks perceptual coherence and semantic interpretability in inverse mappings.

Conclusion: Multimodal latent spaces lack inherent structure for robust inverse mappings, highlighting the need for further research into invertible spaces.

Abstract: This paper investigates the inverse capabilities and broader utility of multimodal latent spaces within task-specific AI (Artificial Intelligence) models. While these models excel at their designed forward tasks (e.g., text-to-image generation, audio-to-text transcription), their potential for inverse mappings remains largely unexplored. We propose an optimization-based framework to infer input characteristics from desired outputs, applying it bidirectionally across Text-Image (BLIP, Flux.1-dev) and Text-Audio (Whisper-Large-V3, Chatterbox-TTS) modalities. Our central hypothesis posits that while optimization can guide models towards inverse tasks, their multimodal latent spaces will not consistently support semantically meaningful and perceptually coherent inverse mappings. Experimental results consistently validate this hypothesis. We demonstrate that while optimization can force models to produce outputs that align textually with targets (e.g., a text-to-image model generating an image that an image captioning model describes correctly, or an ASR model transcribing optimized audio accurately), the perceptual quality of these inversions is chaotic and incoherent. Furthermore, when attempting to infer the original semantic input from generative models, the reconstructed latent space embeddings frequently lack semantic interpretability, aligning with nonsensical vocabulary tokens. These findings highlight a critical limitation. multimodal latent spaces, primarily optimized for specific forward tasks, do not inherently possess the structure required for robust and interpretable inverse mappings. Our work underscores the need for further research into developing truly semantically rich and invertible multimodal latent spaces.

[317] Multi-Hazard Early Warning Systems for Agriculture with Featural-Temporal Explanations

Boyuan Zheng, Victor W. Chu

Main category: cs.LG

TL;DR: A multi-hazard forecasting framework for agriculture using deep learning and XAI, validated on US meteorological data, shows strong predictive accuracy and explainability.

Details

Motivation: Addressing the limitations of traditional single-hazard forecasting methods in capturing complex climate interactions, exacerbated by climate change.

Method: Combines sequential deep learning (BiLSTM) with XAI (TimeSHAP) for multi-hazard forecasting, using meteorological data from US agricultural regions (2010-2023).

Result: Demonstrates strong predictive accuracy, especially with BiLSTM, and provides temporal explanations for climatic feature impacts.

Conclusion: Advances explainability and applicability of multi-hazard EWS, aiding proactive risk management in agriculture.

Abstract: Climate extremes present escalating risks to agriculture intensifying the need for reliable multi-hazard early warning systems (EWS). The situation is evolving due to climate change and hence such systems should have the intelligent to continue to learn from recent climate behaviours. However, traditional single-hazard forecasting methods fall short in capturing complex interactions among concurrent climatic events. To address this deficiency, in this paper, we combine sequential deep learning models and advanced Explainable Artificial Intelligence (XAI) techniques to introduce a multi-hazard forecasting framework for agriculture. In our experiments, we utilize meteorological data from four prominent agricultural regions in the United States (between 2010 and 2023) to validate the predictive accuracy of our framework on multiple severe event types, which are extreme cold, floods, frost, hail, heatwaves, and heavy rainfall, with tailored models for each area. The framework uniquely integrates attention mechanisms with TimeSHAP (a recurrent XAI explainer for time series) to provide comprehensive temporal explanations revealing not only which climatic features are influential but precisely when their impacts occur. Our results demonstrate strong predictive accuracy, particularly with the BiLSTM architecture, and highlight the system’s capacity to inform nuanced, proactive risk management strategies. This research significantly advances the explainability and applicability of multi-hazard EWS, fostering interdisciplinary trust and effective decision-making process for climate risk management in the agricultural industry.

[318] FedCVD++: Communication-Efficient Federated Learning for Cardiovascular Risk Prediction with Parametric and Non-Parametric Model Optimization

Abdelrhman Gaber, Hassan Abd-Eltawab, John Elgallab, Youssif Abuzied, Dineo Mpanya, Turgay Celik, Swarun Kumar, Tamer ElBatt

Main category: cs.LG

TL;DR: FedCVD++ is an enhanced federated learning framework for CVD risk prediction, integrating parametric and non-parametric models with communication-efficient strategies. It outperforms centralized methods and reduces bandwidth usage.

Details

Motivation: The urgent need for privacy-preserving predictive systems due to the high global mortality from cardiovascular diseases (CVD).

Method: Combines parametric (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) with innovations like tree-subset sampling, XGBoost-based feature extraction, and federated SMOTE synchronization.

Result: Achieves state-of-the-art results (F1 = 0.80 for federated XGBoost, 0.81 for Random Forest) and reduces bandwidth consumption by 3.2X while maintaining 95% accuracy.

Conclusion: FedCVD++ is the first practical integration of non-parametric models into federated healthcare systems, offering superior scalability and privacy-preserving solutions under clinical constraints.

Abstract: Cardiovascular diseases (CVD) cause over 17 million deaths annually worldwide, highlighting the urgent need for privacy-preserving predictive systems. We introduce FedCVD++, an enhanced federated learning (FL) framework that integrates both parametric models (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) for coronary heart disease risk prediction. To address key FL challenges, we propose: (1) tree-subset sampling that reduces Random Forest communication overhead by 70%, (2) XGBoost-based feature extraction enabling lightweight federated ensembles, and (3) federated SMOTE synchronization for resolving cross-institutional class imbalance. Evaluated on the Framingham dataset (4,238 records), FedCVD++ achieves state-of-the-art results: federated XGBoost (F1 = 0.80) surpasses its centralized counterpart (F1 = 0.78), and federated Random Forest (F1 = 0.81) matches non-federated performance. Additionally, our communication-efficient strategies reduce bandwidth consumption by 3.2X while preserving 95% accuracy. Compared to existing FL frameworks, FedCVD++ delivers up to 15% higher F1-scores and superior scalability for multi-institutional deployment. This work represents the first practical integration of non-parametric models into federated healthcare systems, providing a privacy-preserving solution validated under real-world clinical constraints.

[319] SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

RJ Skerry-Ryan, Julian Salazar, Soroosh Mariooryad, David Kao, Daisy Stanton, Eric Battenberg, Matt Shannon, Ron J. Weiss, Robin Scheibler, Jonas Rothfuss, Tom Bagby

Main category: cs.LG

TL;DR: A neural network layer API for sequence modeling, enabling both layer-by-layer and step-by-step execution, with state management for streaming and correctness.

Details

Motivation: Simplify the creation of sequence models with explicit state representation to avoid bugs in streaming and parallel processing.

Method: Layers define state over time and a step method for state evolution, ensuring identical results to stateless layer-wise calls.

Result: Enables immediate streaming, reduces bugs, and supports composable, declarative model building with correctness guarantees.

Conclusion: SequenceLayers provides a robust, flexible framework for sequence modeling, available in JAX and TensorFlow 2.

Abstract: We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.

[320] Planning for Cooler Cities: A Multimodal AI Framework for Predicting and Mitigating Urban Heat Stress through Urban Landscape Transformation

Shengao Yi, Xiaojiang Li, Wei Tu, Tianhong Zhao

Main category: cs.LG

TL;DR: GSM-UTCI, a deep learning model, predicts hyperlocal UTCI for cities efficiently, aiding urban heat mitigation planning.

Details

Motivation: Cities struggle with heat stress due to climate change; traditional models like SOLWEIG are computationally heavy.

Method: GSM-UTCI fuses morphology, land cover, and weather data using FiLM architecture, trained on SOLWEIG-derived UTCI maps.

Result: Achieves R2 of 0.9151, MAE of 0.41°C, and reduces inference time to under 5 minutes for city-wide analysis.

Conclusion: GSM-UTCI is a scalable tool for evaluating urban greening strategies, demonstrated in Philadelphia with significant cooling effects.

Abstract: As extreme heat events intensify due to climate change and urbanization, cities face increasing challenges in mitigating outdoor heat stress. While traditional physical models such as SOLWEIG and ENVI-met provide detailed assessments of human-perceived heat exposure, their computational demands limit scalability for city-wide planning. In this study, we propose GSM-UTCI, a multimodal deep learning framework designed to predict daytime average Universal Thermal Climate Index (UTCI) at 1-meter hyperlocal resolution. The model fuses surface morphology (nDSM), high-resolution land cover data, and hourly meteorological conditions using a feature-wise linear modulation (FiLM) architecture that dynamically conditions spatial features on atmospheric context. Trained on SOLWEIG-derived UTCI maps, GSM-UTCI achieves near-physical accuracy, with an R2 of 0.9151 and a mean absolute error (MAE) of 0.41{\deg}C, while reducing inference time from hours to under five minutes for an entire city. To demonstrate its planning relevance, we apply GSM-UTCI to simulate systematic landscape transformation scenarios in Philadelphia, replacing bare earth, grass, and impervious surfaces with tree canopy. Results show spatially heterogeneous but consistently strong cooling effects, with impervious-to-tree conversion producing the highest aggregated benefit (-4.18{\deg}C average change in UTCI across 270.7 km2). Tract-level bivariate analysis further reveals strong alignment between thermal reduction potential and land cover proportions. These findings underscore the utility of GSM-UTCI as a scalable, fine-grained decision support tool for urban climate adaptation, enabling scenario-based evaluation of greening strategies across diverse urban environments.

[321] Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, Samira Samadi

Main category: cs.LG

TL;DR: The paper argues against using human psychological tests to evaluate LLMs, calling for AI-specific evaluation frameworks.

Details

Motivation: Human tests are theory-driven and calibrated for humans; applying them to LLMs risks mischaracterization and lacks justification.

Method: Critiques the current practice of using human benchmarks for AI, highlighting validity issues like cultural bias and prompt sensitivity.

Result: Current interpretations of AI performance on human tests lack theoretical and empirical support.

Conclusion: Advocates for developing principled, AI-specific evaluation frameworks instead of repurposing human tests.

Abstract: Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence’’, despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.

[322] DynaSwarm: Dynamically Graph Structure Selection for LLM-based Multi-agent System

Hui Yi Leong, Yuqing Wu

Main category: cs.LG

TL;DR: DynaSwarm is a dynamic framework for LLM-based multi-agent systems, using A2C reinforcement learning and dynamic graph selection to improve adaptability and performance.

Details

Motivation: Current MAS frameworks use static collaboration graphs, limiting adaptability and performance.

Method: DynaSwarm employs A2C reinforcement learning for stable graph optimization and a dynamic graph selector for sample-specific routing. It also fine-tunes a demonstration retriever for in-context learning.

Result: Outperforms state-of-the-art single-agent and MAS baselines in QA, math reasoning, and coding tasks.

Conclusion: Sample-aware structural flexibility is crucial for effective LLM-based MAS designs.

Abstract: Current multi-agent systems (MAS) frameworks often rely on manually designed and static collaboration graph structures, limiting adaptability and performance. To address these limitations, we propose DynaSwarm, a dynamic framework that enhances LLM-based MAS through two key innovations: (1) an actor-critic reinforcement learning (A2C) mechanism to optimize graph structures with improved stability over prior RL methods, and (2) a dynamic graph selector that adaptively chooses the optimal graph structure for each input sample via parameter-efficient LLM fine-tuning. DynaSwarm eliminates the need for rigid, one-fits-all graph architectures, instead leveraging sample-specific idiosyncrasies to dynamically route queries through specialized agent networks. (c) We propose to fine-tune the demonstration retriever to fully exploit the power of in-context learning (ICL). Extensive experiments on question answering, mathematical reasoning, and coding tasks demonstrate that DynaSwarm consistently outperforms state-of-the-art single-agent and MAS baselines across multiple LLM backbones. Our findings highlight the importance of sample-aware structural flexibility in LLM MAS designs.

[323] KLLM: Fast LLM Inference with K-Means Quantization

Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, Hai Li

Main category: cs.LG

TL;DR: KLLM is a hardware-software co-design framework for efficient LLM inference using K-Means quantization and outlier detection, achieving significant speedups and energy efficiency improvements.

Details

Motivation: Addressing the challenges of weight and activation quantization (WAQ) in LLMs, such as accuracy degradation and activation outliers, to enable efficient low-precision inference.

Method: Proposes KLLM with an index-based computation scheme for K-Means-quantized data and an outlier detection engine (Orizuru) for online inference.

Result: Achieves speedups of 9.67x and 7.03x, and energy efficiency improvements of 229.50x and 150.21x compared to A100 GPU and Atom, respectively.

Conclusion: KLLM effectively leverages K-Means quantization and outlier detection to enhance LLM inference efficiency without significant accuracy loss.

Abstract: Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. However, two key challenges remain in the existing WAQ designs. (1) Traditional WAQ designs rely on uniform integer-based quantization for hardware efficiency, but this often results in significant accuracy degradation at low precision. K-Means-based quantization, a non-uniform quantization technique, achieves higher accuracy by matching the Gaussian-like distributions of weights and activations in LLMs. However, its non-uniform nature prevents direct execution on low-precision compute units, requiring dequantization and floating-point matrix multiplications (MatMuls) during inference. (2) Activation outliers further hinder effective low-precision WAQ. Offline thresholding methods for outlier detection can lead to significant model performance degradation, while existing online detection techniques introduce substantial runtime overhead. To address the aforementioned challenges and fully unleash the potential of WAQ with K-Means quantization for LLM inference, in this paper, we propose KLLM, a hardware-software co-design framework. KLLM features an index-based computation scheme for efficient execution of MatMuls and nonlinear operations on K-Means-quantized data, which avoids most of the dequantization and full-precision computations. Moreover, KLLM incorporates a novel outlier detection engine, Orizuru, that efficiently identifies the top-$k$ largest and smallest elements in the activation data stream during online inference. Extensive experiments show that, on average, KLLM achieves speedups of 9.67x, 7.03x and energy efficiency improvements of 229.50x, 150.21x compared to the A100 GPU and Atom, respectively.

[324] Linking Actor Behavior to Process Performance Over Time

Aurélie Leribaux, Rafael Oyamada, Johannes De Smedt, Zahra Dasht Bozorgi, Artem Polyvyanyy, Jochen De Weerdt

Main category: cs.LG

TL;DR: The paper integrates actor behavior analysis with Granger causality to study how individual actor actions influence process outcomes, revealing measurable impacts on performance like throughput time.

Details

Motivation: Traditional process mining overlooks temporal and causal dynamics from individual actor behavior, limiting understanding of real-world process complexity.

Method: Combines actor behavior analysis with Granger causality, using Group Lasso for lag selection on real-world event logs to identify influential lags.

Result: Identifies a small set of influential lags showing actor behavior directly impacts process performance, especially throughput time.

Conclusion: Actor-centric, time series-based methods provide nuanced insights into how individual behaviors drive process efficiency.

Abstract: Understanding how actor behavior influences process outcomes is a critical aspect of process mining. Traditional approaches often use aggregate and static process data, overlooking the temporal and causal dynamics that arise from individual actor behavior. This limits the ability to accurately capture the complexity of real-world processes, where individual actor behavior and interactions between actors significantly shape performance. In this work, we address this gap by integrating actor behavior analysis with Granger causality to identify correlating links in time series data. We apply this approach to realworld event logs, constructing time series for actor interactions, i.e. continuation, interruption, and handovers, and process outcomes. Using Group Lasso for lag selection, we identify a small but consistently influential set of lags that capture the majority of causal influence, revealing that actor behavior has direct and measurable impacts on process performance, particularly throughput time. These findings demonstrate the potential of actor-centric, time series-based methods for uncovering the temporal dependencies that drive process outcomes, offering a more nuanced understanding of how individual behaviors impact overall process efficiency.

[325] Prediction of Significant Creatinine Elevation in First ICU Stays with Vancomycin Use: A retrospective study through Catboost

Junyi Fan, Li Sun, Shuheng Chen, Yong Si, Minoo Ahmadi, Greg Placencia, Elham Pishgar, Kamiar Alaei, Maryam Pishgar

Main category: cs.LG

TL;DR: A machine learning model was developed to predict vancomycin-related kidney injury in ICU patients using routine data, achieving strong accuracy and interpretability.

Details

Motivation: Early prediction of vancomycin-induced nephrotoxicity is challenging but crucial for timely interventions in critically ill patients.

Method: Analyzed 10,288 ICU patients from MIMIC-IV, defined kidney injury by KDIGO criteria, selected features, and tested six algorithms with cross-validation. Interpretability was assessed using SHAP, ALE, and Bayesian methods.

Result: CatBoost performed best (AUROC 0.818), with key predictors like phosphate and bilirubin. SHAP and ALE confirmed interpretability, and Bayesian analysis estimated high-risk cases.

Conclusion: The model accurately predicts vancomycin-associated kidney injury, aiding early risk detection and intervention in critical care.

Abstract: Background: Vancomycin, a key antibiotic for severe Gram-positive infections in ICUs, poses a high nephrotoxicity risk. Early prediction of kidney injury in critically ill patients is challenging. This study aimed to develop a machine learning model to predict vancomycin-related creatinine elevation using routine ICU data. Methods: We analyzed 10,288 ICU patients (aged 18-80) from the MIMIC-IV database who received vancomycin. Kidney injury was defined by KDIGO criteria (creatinine rise >=0.3 mg/dL within 48h or >=50% within 7d). Features were selected via SelectKBest (top 30) and Random Forest ranking (final 15). Six algorithms were tested with 5-fold cross-validation. Interpretability was evaluated using SHAP, Accumulated Local Effects (ALE), and Bayesian posterior sampling. Results: Of 10,288 patients, 2,903 (28.2%) developed creatinine elevation. CatBoost performed best (AUROC 0.818 [95% CI: 0.801-0.834], sensitivity 0.800, specificity 0.681, negative predictive value 0.900). Key predictors were phosphate, total bilirubin, magnesium, Charlson index, and APSIII. SHAP confirmed phosphate as a major risk factor. ALE showed dose-response patterns. Bayesian analysis estimated mean risk 60.5% (95% credible interval: 16.8-89.4%) in high-risk cases. Conclusions: This machine learning model predicts vancomycin-associated creatinine elevation from routine ICU data with strong accuracy and interpretability, enabling early risk detection and supporting timely interventions in critical care.

Tatsuya Mitomi, Fumiyasu Makinoshima, Fumiya Makihara, Eigo Segawa

Main category: cs.LG

TL;DR: The paper introduces a differentiable agent-based simulation for dynamic pricing in bike-sharing systems to balance inventory, outperforming conventional methods in accuracy and speed.

Details

Motivation: Bike-sharing systems face imbalanced inventory due to spatiotemporally varying user demands, requiring optimal dynamic pricing to manage costs.

Method: Develops a differentiable agent-based simulation to design dynamic pricing, validated through numerical experiments and large-scale urban scenarios.

Result: Achieves 73-78% loss reduction, 100x faster convergence, and balanced inventory without manual relocation.

Conclusion: Optimal dynamic pricing can effectively balance bike-sharing inventory while minimizing costs, validated by simulations.

Abstract: Bike-sharing systems are emerging in various cities as a new ecofriendly transportation system. In these systems, spatiotemporally varying user demands lead to imbalanced inventory at bicycle stations, resulting in additional relocation costs. Therefore, it is essential to manage user demand through optimal dynamic pricing for the system. However, optimal pricing design for such a system is challenging because the system involves users with diverse backgrounds and their probabilistic choices. To address this problem, we develop a differentiable agent-based simulation to rapidly design dynamic pricing in bike-sharing systems, achieving balanced bicycle inventory despite spatiotemporally heterogeneous trips and probabilistic user decisions. We first validate our approach against conventional methods through numerical experiments involving 25 bicycle stations and five time slots, yielding 100 parameters. Compared to the conventional methods, our approach obtains a more accurate solution with a 73% to 78% reduction in loss while achieving more than a 100-fold increase in convergence speed. We further validate our approach on a large-scale urban bike-sharing system scenario involving 289 bicycle stations, resulting in a total of 1156 parameters. Through simulations using the obtained pricing policies, we confirm that these policies can naturally induce balanced inventory without any manual relocation. Additionally, we find that the cost of discounts to induce the balanced inventory can be minimized by setting appropriate initial conditions.

[327] Locally Differentially Private Thresholding Bandits

Annalisa Barbara, Joseph Lazzaro, Ciara Pike-Burke

Main category: cs.LG

TL;DR: The paper explores local differential privacy in thresholding bandit problems, proposing methods with strong privacy guarantees and theoretical performance bounds.

Details

Motivation: To address privacy concerns in bandit problems by ensuring local differential privacy while identifying high-reward arms.

Method: Uses a Bernoulli-based differentially private mechanism for private responses in fixed budget and fixed confidence settings.

Result: The proposed algorithms match derived lower bounds up to poly-logarithmic factors, ensuring strong privacy and performance.

Conclusion: The work offers insights into privacy-preserving decision-making in bandit problems, balancing privacy and performance.

Abstract: This work investigates the impact of ensuring local differential privacy in the thresholding bandit problem. We consider both the fixed budget and fixed confidence settings. We propose methods that utilize private responses, obtained through a Bernoulli-based differentially private mechanism, to identify arms with expected rewards exceeding a predefined threshold. We show that this procedure provides strong privacy guarantees and derive theoretical performance bounds on the proposed algorithms. Additionally, we present general lower bounds that characterize the additional loss incurred by any differentially private mechanism, and show that the presented algorithms match these lower bounds up to poly-logarithmic factors. Our results provide valuable insights into privacy-preserving decision-making frameworks in bandit problems.

[328] A Foundation Model for Material Fracture Prediction

Agnese Marcato, Aleksandra Pachalieva, Ryley G. Hill, Kai Gao, Xiaoyu Wang, Esteban Rougier, Zhou Lei, Vinamra Agrawal, Janel Chua, Qinjun Kang, Jeffrey D. Hyman, Abigail Hunter, Nathan DeBardeleben, Earl Lawrence, Hari Viswanathan, Daniel O’Malley, Javier E. Santos

Main category: cs.LG

TL;DR: A transformer-based foundation model for fracture prediction unifies diverse materials and loading conditions, reducing data needs and improving generalization.

Details

Motivation: Accurate fracture prediction is critical for safety and reliability, but current methods are limited by narrow datasets, lack of robustness, and high computational costs.

Method: A transformer-based architecture combines multimodal inputs (structured/unstructured meshes and text embeddings) to adapt flexibly across simulations without architectural changes.

Result: The model generalizes to unseen materials with minimal data, supports diverse tasks (e.g., time-to-failure estimation), and outperforms simulator-specific workflows.

Conclusion: The foundation model offers a scalable, extensible solution for unified fracture prediction across materials and conditions.

Abstract: Accurately predicting when and how materials fail is critical to designing safe, reliable structures, mechanical systems, and engineered components that operate under stress. Yet, fracture behavior remains difficult to model across the diversity of materials, geometries, and loading conditions in real-world applications. While machine learning (ML) methods show promise, most models are trained on narrow datasets, lack robustness, and struggle to generalize. Meanwhile, physics-based simulators offer high-fidelity predictions but are fragmented across specialized methods and require substantial high-performance computing resources to explore the input space. To address these limitations, we present a data-driven foundation model for fracture prediction, a transformer-based architecture that operates across simulators, a wide range of materials (including plastic-bonded explosives, steel, aluminum, shale, and tungsten), and diverse loading conditions. The model supports both structured and unstructured meshes, combining them with large language model embeddings of textual input decks specifying material properties, boundary conditions, and solver settings. This multimodal input design enables flexible adaptation across simulation scenarios without changes to the model architecture. The trained model can be fine-tuned with minimal data on diverse downstream tasks, including time-to-failure estimation, modeling fracture evolution, and adapting to combined finite-discrete element method simulations. It also generalizes to unseen materials such as titanium and concrete, requiring as few as a single sample, dramatically reducing data needs compared to standard ML. Our results show that fracture prediction can be unified under a single model architecture, offering a scalable, extensible alternative to simulator-specific workflows.

[329] On the Sustainability of AI Inferences in the Edge

Ghazal Sobhani, Md. Monzurul Amin Ifath, Tushar Sharma, Israat Haque

Main category: cs.LG

TL;DR: This study analyzes the performance and energy usage of edge devices (e.g., Raspberry Pi, NVIDIA Jetson nano) for AI inferences, focusing on trade-offs between model accuracy, inference time, power, and memory.

Details

Motivation: The lack of comprehensive studies on edge device performance and energy usage for AI applications motivates this research to aid informed device and model selection.

Method: The study rigorously evaluates traditional, neural network, and large language models on edge devices, analyzing trade-offs in F1 score, inference time, power, and memory usage.

Result: Findings highlight the importance of hardware and framework optimizations, along with parameter tuning, to balance performance and resource usage for practical edge AI deployments.

Conclusion: The study provides insights for optimizing edge AI deployments by considering performance and energy trade-offs, aiding device and model selection.

Abstract: The proliferation of the Internet of Things (IoT) and its cutting-edge AI-enabled applications (e.g., autonomous vehicles and smart industries) combine two paradigms: data-driven systems and their deployment on the edge. Usually, edge devices perform inferences to support latency-critical applications. In addition to the performance of these resource-constrained edge devices, their energy usage is a critical factor in adopting and deploying edge applications. Examples of such devices include Raspberry Pi (RPi), Intel Neural Compute Stick (INCS), NVIDIA Jetson nano (NJn), and Google Coral USB (GCU). Despite their adoption in edge deployment for AI inferences, there is no study on their performance and energy usage for informed decision-making on the device and model selection to meet the demands of applications. This study fills the gap by rigorously characterizing the performance of traditional, neural networks, and large language models on the above-edge devices. Specifically, we analyze trade-offs among model F1 score, inference time, inference power, and memory usage. Hardware and framework optimization, along with external parameter tuning of AI models, can balance between model performance and resource usage to realize practical edge AI deployments.

[330] Scalable Generative Modeling of Weighted Graphs

Richard Williams, Eric Nalisnick, Andrew Holbrook

Main category: cs.LG

TL;DR: BiGG-E is an autoregressive model extending BiGG to learn joint distributions over weighted graphs efficiently.

Details

Motivation: Current deep generative models for graphs often ignore edge weights or fail to model joint distributions with topology, limiting their applicability to weighted graphs.

Method: BiGG-E extends BiGG to handle weighted graphs, leveraging sparsity for efficient generation in O((n + m)log n) time.

Result: BiGG-E outperforms benchmarks in capturing weighted graph distributions while remaining scalable and efficient.

Conclusion: BiGG-E addresses limitations of existing models by jointly learning topology and edge weights, offering a scalable solution for weighted graph generation.

Abstract: Weighted graphs are ubiquitous throughout biology, chemistry, and the social sciences, motivating the development of generative models for abstract weighted graph data using deep neural networks. However, most current deep generative models are either designed for unweighted graphs and are not easily extended to weighted topologies or incorporate edge weights without consideration of a joint distribution with topology. Furthermore, learning a distribution over weighted graphs must account for complex nonlocal dependencies between both the edges of the graph and corresponding weights of each edge. We develop an autoregressive model BiGG-E, a nontrivial extension of the BiGG model, that learns a joint distribution over weighted graphs while still exploiting sparsity to generate a weighted graph with $n$ nodes and $m$ edges in $O((n + m)\log n)$ time. Simulation studies and experiments on a variety of benchmark datasets demonstrate that BiGG-E best captures distributions over weighted graphs while remaining scalable and computationally efficient.

[331] FLOSS: Federated Learning with Opt-Out and Straggler Support

David J Goetze, Dahlia J Felten, Jeannie R Albrecht, Rohit Bhattacharya

Main category: cs.LG

TL;DR: FLOSS mitigates bias and performance degradation in federated learning caused by missing data due to user opt-out and stragglers.

Details

Motivation: Addressing the challenge of missing data in federated learning systems, which arises from user opt-out and device heterogeneity, leading to bias and degraded model performance.

Method: Introduces FLOSS, a system designed to mitigate the impacts of missing data in federated learning, tested through simulations.

Result: Empirical simulations demonstrate FLOSS’s effectiveness in improving model performance despite missing data.

Conclusion: FLOSS successfully addresses the issue of missing data in federated learning, enhancing model robustness and fairness.

Abstract: Previous work on data privacy in federated learning systems focuses on privacy-preserving operations for data from users who have agreed to share their data for training. However, modern data privacy agreements also empower users to use the system while opting out of sharing their data as desired. When combined with stragglers that arise from heterogeneous device capabilities, the result is missing data from a variety of sources that introduces bias and degrades model performance. In this paper, we present FLOSS, a system that mitigates the impacts of such missing data on federated learning in the presence of stragglers and user opt-out, and empirically demonstrate its performance in simulations.

[332] Evaluating and Improving the Robustness of Speech Command Recognition Models to Noise and Distribution Shifts

Anaïs Baranger, Lucas Maison

Main category: cs.LG

TL;DR: The paper explores how training conditions and input features affect the robustness and generalization of spoken keyword classifiers under OOD conditions, using fairness and robustness metrics.

Details

Motivation: Prior work in computer vision shows correlations between ID and OOD accuracies, but this relationship is underexplored in audio-based models.

Method: Benchmarking neural architectures across evaluation sets, using fairness (F) and robustness (R) metrics to quantify noise impact.

Result: Noise-aware training improves robustness in some configurations.

Conclusion: The findings highlight the benefits and limitations of noise-based augmentation for generalization in speech models.

Abstract: Although prior work in computer vision has shown strong correlations between in-distribution (ID) and out-of-distribution (OOD) accuracies, such relationships remain underexplored in audio-based models. In this study, we investigate how training conditions and input features affect the robustness and generalization abilities of spoken keyword classifiers under OOD conditions. We benchmark several neural architectures across a variety of evaluation sets. To quantify the impact of noise on generalization, we make use of two metrics: Fairness (F), which measures overall accuracy gains compared to a baseline model, and Robustness (R), which assesses the convergence between ID and OOD performance. Our results suggest that noise-aware training improves robustness in some configurations. These findings shed new light on the benefits and limitations of noise-based augmentation for generalization in speech models.

[333] Observational Multiplicity

Erin George, Deanna Needell, Berk Ustun

Main category: cs.LG

TL;DR: The paper addresses arbitrariness in probabilistic classification due to observational multiplicity, introduces a regret-based measure to evaluate it, and demonstrates its utility for safety and interpretability.

Details

Motivation: To tackle the issue of conflicting predictions from equally performant models, which undermines interpretability and safety in probabilistic classification tasks.

Method: Proposes a regret measure for probabilistic classification, introduces a method to estimate it, and applies it to analyze group-specific arbitrariness and safety measures like abstention and data collection.

Result: Shows that regret is higher for certain groups and highlights its potential for improving safety through abstention and targeted data collection.

Conclusion: Regret estimation can mitigate arbitrariness in probabilistic classification, enhancing interpretability and safety in real-world applications.

Abstract: Many prediction tasks can admit multiple models that can perform almost equally well. This phenomenon can can undermine interpretability and safety when competing models assign conflicting predictions to individuals. In this work, we study how arbitrariness can arise in probabilistic classification tasks as a result of an effect that we call \emph{observational multiplicity}. We discuss how this effect arises in a broad class of practical applications where we learn a classifier to predict probabilities $p_i \in [0,1]$ but are given a dataset of observations $y_i \in {0,1}$. We propose to evaluate the arbitrariness of individual probability predictions through the lens of \emph{regret}. We introduce a measure of regret for probabilistic classification tasks, which measures how the predictions of a model could change as a result of different training labels change. We present a general-purpose method to estimate the regret in a probabilistic classification task. We use our measure to show that regret is higher for certain groups in the dataset and discuss potential applications of regret. We demonstrate how estimating regret promote safety in real-world applications by abstention and data collection.

[334] AI paradigm for solving differential equations: first-principles data generation and scale-dilation operator AI solver

Xiangshu Gong, Zhiqiang Xie, Xiaowei Jin, Chen Wang, Yanling Qu, Wangmeng Zuo, Hui Li

Main category: cs.LG

TL;DR: Proposes an AI paradigm for solving DEs with a first-principles data generation method and a scale-dilation operator (SDO) AI solver, addressing high-frequency component approximation and achieving superior accuracy.

Details

Motivation: Existing AI solvers struggle with high-frequency component approximation and data scarcity in solving DEs.

Method: Uses first-principles data generation and a reversible SDO with Fourier transforms to fix high-frequency issues, coupled with a Transformer AI solver.

Result: Demonstrates consistently superior accuracy over state-of-the-art methods on diverse DEs.

Conclusion: Makes AI solvers for DEs practical for broad applications in nature and engineering.

Abstract: Many problems are governed by differential equations (DEs). Artificial intelligence (AI) is a new path for solving DEs. However, data is very scarce and existing AI solvers struggle with approximation of high frequency components (AHFC). We propose an AI paradigm for solving diverse DEs, including DE-ruled first-principles data generation methodology and scale-dilation operator (SDO) AI solver. Using either prior knowledge or random fields, we generate solutions and then substitute them into the DEs to derive the sources and initial/boundary conditions through balancing DEs, thus producing arbitrarily vast amount of, first-principles-consistent training datasets at extremely low computational cost. We introduce a reversible SDO that leverages the Fourier transform of the multiscale solutions to fix AHFC, and design a spatiotemporally coupled, attention-based Transformer AI solver of DEs with SDO. An upper bound on the Hessian condition number of the loss function is proven to be proportional to the squared 2-norm of the solution gradient, revealing that SDO yields a smoother loss landscape, consequently fixing AHFC with efficient training. Extensive tests on diverse DEs demonstrate that our AI paradigm achieves consistently superior accuracy over state-of-the-art methods. This work makes AI solver of DEs to be truly usable in broad nature and engineering fields.

[335] FuseTen: A Generative Model for Daily 10 m Land Surface Temperature Estimation from Spatio-Temporal Satellite Observations

Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai

Main category: cs.LG

TL;DR: FuseTen is a generative framework that fuses satellite data to produce high-resolution daily Land Surface Temperature (LST) estimates, improving accuracy by 32.06% over baselines.

Details

Motivation: Urban heatwaves, droughts, and land degradation require accurate LST data, but satellite trade-offs between spatial and temporal resolution limit current capabilities.

Method: FuseTen combines Sentinel-2, Landsat 8, and Terra MODIS data using a generative architecture with attention, normalization, and PatchGAN for realistic outputs.

Result: FuseTen outperforms linear baselines by 32.06% in metrics and 31.42% in visual fidelity, achieving fine 10 m resolution daily LST.

Conclusion: FuseTen is the first non-linear method to generate daily LST at 10 m resolution, addressing critical gaps in land surface monitoring.

Abstract: Urban heatwaves, droughts, and land degradation are pressing and growing challenges in the context of climate change. A valuable approach to studying them requires accurate spatio-temporal information on land surface conditions. One of the most important variables for assessing and understanding these phenomena is Land Surface Temperature (LST), which is derived from satellites and provides essential information about the thermal state of the Earth’s surface. However, satellite platforms inherently face a trade-off between spatial and temporal resolutions. To bridge this gap, we propose FuseTen, a novel generative framework that produces daily LST observations at a fine 10 m spatial resolution by fusing spatio-temporal observations derived from Sentinel-2, Landsat 8, and Terra MODIS. FuseTen employs a generative architecture trained using an averaging-based supervision strategy grounded in physical principles. It incorporates attention and normalization modules within the fusion process and uses a PatchGAN discriminator to enforce realism. Experiments across multiple dates show that FuseTen outperforms linear baselines, with an average 32.06% improvement in quantitative metrics and 31.42% in visual fidelity. To the best of our knowledge, this is the first non-linear method to generate daily LST estimates at such fine spatial resolution.

[336] BAR Conjecture: the Feasibility of Inference Budget-Constrained LLM Services with Authenticity and Reasoning

Jinan Zhou, Rajat Ghosh, Vaishnavi Bhargava, Debojyoti Dutta, Aryan Singhal

Main category: cs.LG

TL;DR: The BAR Theorem framework addresses the trade-off between inference-time budget, factual authenticity, and reasoning capacity in LLM design.

Details

Motivation: Practitioners face challenges in optimizing inference-time budget, factual authenticity, and reasoning capacity simultaneously in LLM services.

Method: Formal proof of the trade-off and proposal of the BAR Theorem framework.

Result: No model can optimize all three properties at once; the BAR Theorem provides a principled design approach.

Conclusion: The BAR Theorem offers a structured solution for balancing trade-offs in LLM-application design.

Abstract: When designing LLM services, practitioners care about three key properties: inference-time budget, factual authenticity, and reasoning capacity. However, our analysis shows that no model can simultaneously optimize for all three. We formally prove this trade-off and propose a principled framework named The BAR Theorem for LLM-application design.

[337] NaN-Propagation: A Novel Method for Sparsity Detection in Black-Box Computational Functions

Peter Sharpe

Main category: cs.LG

TL;DR: The paper introduces NaN-propagation to detect sparsity in black-box functions, eliminating false negatives in gradient calculations and achieving computational speedups.

Details

Motivation: Existing finite-difference methods for sparsity detection produce false negatives due to coincidental zero gradients, leading to corrupted gradient calculations.

Method: NaN-propagation uses IEEE 754 NaN values to trace input-output dependencies, systematically contaminating inputs with NaN to reconstruct conservative sparsity patterns.

Result: The method demonstrated a 1.52x speedup on an aerospace wing weight model, detecting dependencies missed by conventional methods.

Conclusion: NaN-propagation leverages IEEE 754 compliance for cross-language compatibility and offers faster-than-linear time complexity, improving black-box sparsity detection.

Abstract: Sparsity detection in black-box functions enables significant computational speedups in gradient-based optimization through Jacobian compression, but existing finite-difference methods suffer from false negatives due to coincidental zero gradients. These false negatives can silently corrupt gradient calculations, leading to difficult-to-diagnose errors. We introduce NaN-propagation, which exploits the universal contamination property of IEEE 754 Not-a-Number floating-point values to trace input-output dependencies through floating-point numerical computations. By systematically contaminating inputs with NaN and observing which outputs become NaN, the method reconstructs conservative sparsity patterns that eliminate false negatives. We demonstrate the approach on an aerospace wing weight model, achieving a 1.52x speedup while detecting dozens of dependencies missed by conventional methods – a significant improvement since gradient computation is the bottleneck in many optimization workflows. The technique leverages IEEE 754 compliance to work across programming languages and math libraries without modifying existing black-box codes. Advanced strategies including NaN payload encoding enable faster-than-linear time complexity, improving upon existing black-box sparsity detection methods. Practical algorithms are also proposed to mitigate challenges from branching code execution common in engineering applications.

[338] Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation

Hyeon Seong Jeong, Sangwoo Jo, Byeong Hyun Yoon, Yoonseok Heo, Haedong Jeong, Taehoon Kim

Main category: cs.LG

TL;DR: DocsRay is a training-free system for understanding multimodal documents using pseudo-TOC generation and hierarchical RAG, improving efficiency and accuracy.

Details

Motivation: Challenges in understanding multimodal documents due to structural inconsistencies and limited training data.

Method: Combines semantic structuring with LLMs, zero-shot multimodal analysis, and hierarchical retrieval to process diverse document elements.

Result: Reduced query latency by 45% and achieved 64.7% accuracy on MMLongBench-Doc.

Conclusion: DocsRay effectively processes complex documents without training, outperforming prior methods.

Abstract: Understanding complex multimodal documents remains challenging due to their structural inconsistencies and limited training data availability. We introduce \textit{DocsRay}, a training-free document understanding system that integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG). Our approach leverages multimodal Large Language Models’ (LLMs) native capabilities to seamlessly process documents containing diverse elements such as text, images, charts, and tables without requiring specialized models or additional training. DocsRay’s framework synergistically combines three key techniques: (1) a semantic structuring module using prompt-based LLM interactions to generate a hierarchical pseudo-TOC, (2) zero-shot multimodal analysis that converts diverse document elements into unified, text-centric representations using the inherent capabilities of multimodal LLMs, and (3) an efficient two-stage hierarchical retrieval system that reduces retrieval complexity from $O(N)$ to $O(S + k_1 \cdot N_s)$. Evaluated on documents averaging 49.4 pages and 20,971 textual tokens, DocsRay reduced query latency from 3.89 to 2.12 seconds, achieving a 45% efficiency improvement. On the MMLongBench-Doc benchmark, DocsRay-Pro attains an accuracy of 64.7%, substantially surpassing previous state-of-the-art results.

[339] A Single Direction of Truth: An Observer Model’s Linear Residual Probe Exposes and Steers Contextual Hallucinations

Charles O’Neill, Slava Chalnev, Chi Chi Zhao, Max Kirkby, Mudith Jayasekara

Main category: cs.LG

TL;DR: A method using a generator-agnostic observer model detects AI hallucinations via a linear probe on its residual stream, outperforming baselines and enabling causal manipulation of hallucination rates.

Details

Motivation: Addressing the challenge of contextual hallucinations in AI by identifying and mitigating unsupported statements.

Method: Uses a linear probe on the residual stream of an observer model to isolate a transferable direction separating hallucinated from faithful text, with gradient-times-activation analysis.

Result: Outperforms baselines by 5-27 points, shows robust performance across model sizes, and enables causal manipulation of hallucination rates.

Conclusion: Demonstrates internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, offering actionable insights for detection and mitigation.

Abstract: Contextual hallucinations – statements unsupported by given context – remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.

[340] Efficient Machine Unlearning via Influence Approximation

Jiawei Liu, Chenwang Wu, Defu Lian, Enhong Chen

Main category: cs.LG

TL;DR: The paper introduces Influence Approximation Unlearning (IAU), an efficient machine unlearning method inspired by cognitive science, linking memorizing (incremental learning) to forgetting (unlearning).

Details

Motivation: Growing privacy concerns necessitate methods for machine learning models to forget specific data. Existing influence-based unlearning is computationally expensive due to Hessian matrix calculations.

Method: The paper connects incremental learning (memorizing) to unlearning (forgetting) and proposes IAU, an algorithm leveraging efficient gradient optimization for unlearning.

Result: IAU achieves a superior balance among removal guarantee, efficiency, and model utility, outperforming state-of-the-art methods.

Conclusion: The study demonstrates the feasibility of efficient unlearning by drawing parallels to incremental learning, offering a practical solution for large-scale models.

Abstract: Due to growing privacy concerns, machine unlearning, which aims at enabling machine learning models to ``forget” specific training data, has received increasing attention. Among existing methods, influence-based unlearning has emerged as a prominent approach due to its ability to estimate the impact of individual training samples on model parameters without retraining. However, this approach suffers from prohibitive computational overhead arising from the necessity to compute the Hessian matrix and its inverse across all training samples and parameters, rendering it impractical for large-scale models and scenarios involving frequent data deletion requests. This highlights the difficulty of forgetting. Inspired by cognitive science, which suggests that memorizing is easier than forgetting, this paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning). This connection allows machine unlearning to be addressed from the perspective of incremental learning. Unlike the time-consuming Hessian computations in unlearning (forgetting), incremental learning (memorizing) typically relies on more efficient gradient optimization, which supports the aforementioned cognitive theory. Based on this connection, we introduce the Influence Approximation Unlearning (IAU) algorithm for efficient machine unlearning from the incremental perspective. Extensive empirical evaluations demonstrate that IAU achieves a superior balance among removal guarantee, unlearning efficiency, and comparable model utility, while outperforming state-of-the-art methods across diverse datasets and model architectures. Our code is available at https://github.com/Lolo1222/IAU.

[341] Evaluating the Dynamics of Membership Privacy in Deep Learning

Yuetian Chen, Zhiqi Wang, Nathalie Baracaldo, Swanand Ravindra Kadhe, Lei Yu

Main category: cs.LG

TL;DR: The paper introduces a dynamic framework to analyze privacy leakage in deep learning, focusing on how training factors influence sample-level vulnerabilities and revealing early-stage determination of privacy risks.

Details

Motivation: To understand when and how deep learning models encode membership information during training, addressing gaps in current knowledge about privacy risks.

Method: A dynamic analytical framework tracks per-sample vulnerabilities on an FPR-TPR plane, examining impacts of dataset complexity, model architecture, and optimizer choice.

Result: Samples highly vulnerable in the final model show early-stage privacy risk determination, with a strong correlation to intrinsic learning difficulty.

Conclusion: The study enhances understanding of dynamic privacy risk emergence, supporting proactive, privacy-aware training strategies.

Abstract: Membership inference attacks (MIAs) pose a critical threat to the privacy of training data in deep learning. Despite significant progress in attack methodologies, our understanding of when and how models encode membership information during training remains limited. This paper presents a dynamic analytical framework for dissecting and quantifying privacy leakage dynamics at the individual sample level. By tracking per-sample vulnerabilities on an FPR-TPR plane throughout training, our framework systematically measures how factors such as dataset complexity, model architecture, and optimizer choice influence the rate and severity at which samples become vulnerable. Crucially, we discover a robust correlation between a sample’s intrinsic learning difficulty, and find that the privacy risk of samples highly vulnerable in the final trained model is largely determined early during training. Our results thus provide a deeper understanding of how privacy risks dynamically emerge during training, laying the groundwork for proactive, privacy-aware model training strategies.

[342] An Interpretable Data-Driven Unsupervised Approach for the Prevention of Forgotten Items

Luca Corbucci, Javier Alejandro Borges Legrottaglie, Francesco Spinnato, Anna Monreale, Riccardo Guidotti

Main category: cs.LG

TL;DR: The paper introduces the forgotten item prediction task in Next Basket Prediction (NBP), proposing two interpretable algorithms to identify omitted items and provide clear explanations.

Details

Motivation: Existing NBP methods focus on predicting future purchases but overlook detecting forgotten items due to data scarcity and reliance on opaque models.

Method: Two novel interpretable-by-design algorithms are proposed to identify forgotten items with human-understandable explanations.

Result: Experiments on a real-world dataset show the algorithms outperform state-of-the-art NBP baselines by 10-15% in multiple metrics.

Conclusion: The study successfully addresses the gap in NBP by introducing interpretable methods for forgotten item prediction, demonstrating superior performance.

Abstract: Accurately identifying items forgotten during a supermarket visit and providing clear, interpretable explanations for recommending them remains an underexplored problem within the Next Basket Prediction (NBP) domain. Existing NBP approaches typically only focus on forecasting future purchases, without explicitly addressing the detection of unintentionally omitted items. This gap is partly due to the scarcity of real-world datasets that allow for the reliable estimation of forgotten items. Furthermore, most current NBP methods rely on black-box models, which lack transparency and limit the ability to justify recommendations to end users. In this paper, we formally introduce the forgotten item prediction task and propose two novel interpretable-by-design algorithms. These methods are tailored to identify forgotten items while offering intuitive, human-understandable explanations. Experiments on a real-world retail dataset show our algorithms outperform state-of-the-art NBP baselines by 10-15% across multiple evaluation metrics.

[343] Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner

Tao He, Rongchuan Mu, Lizi Liao, Yixin Cao, Ming Liu, Bing Qin

Main category: cs.LG

TL;DR: The paper introduces process reward models (PRMs) and a novel intrinsic signal-driven evaluation mechanism to improve RL training for large reasoning models (LRMs) in math tasks, achieving higher accuracy with fewer samples.

Details

Motivation: Conventional RL approaches for LRMs rely on sparse outcome-only rewards, leading to inefficient optimization. The study aims to enhance training by leveraging process-level feedback.

Method: Proposes a thought-level evaluation mechanism using intrinsic signals to judge stepwise correctness and aggregate steps into coherent ’thought’ units. Introduces a capability-adaptive reward mechanism and integrates these into the TP-GRPO algorithm.

Result: Experiments on 1.5B and 7B parameter LRMs show higher problem-solving accuracy with fewer training samples compared to outcome-only reward baselines.

Conclusion: Well-structured process rewards significantly accelerate LRM optimization in math reasoning tasks, validated by improved efficiency and accuracy.

Abstract: Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL). But conventional approaches rely on outcome-only rewards that provide sparse feedback, resulting in inefficient optimization process. In this work, we investigate the function of process reward models (PRMs) to accelerate the RL training for LRMs. We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training. Specifically, instead of requiring PRMs to know how to solve problems, our method uses intrinsic signals in solutions to judge stepwise correctness and aggregate contiguous correct/incorrect steps into coherent ’thought’ units. This structured, thought-level rewards enable more reliable credit assignment by reducing ambiguity in step segmentation and alleviating reward hacking. We further introduce a capability-adaptive reward mechanism that dynamically balances exploration and exploitation based on the LRM’s current proficiency, guiding learning without stifling creative trial-and-error. These innovations are integrated into a new off-policy RL algorithm, TP-GRPO, which extends grouped proximal optimization with process-based rewards and improves training efficiency. Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines. The results validate that well-structured process rewards can substantially accelerate LRM optimization in math reasoning tasks. Code is available at https://github.com/cs-holder/tp_grpo.

[344] Scalable and Precise Patch Robustness Certification for Deep Learning Models with Top-k Predictions

Qilin Zhou, Haipeng Wang, Zhengyuan Wei, W. K. Chan

Main category: cs.LG

TL;DR: CostCert is a scalable, precise voting-based certified recovery defender for deep learning systems, outperforming PatchGuard by avoiding pairwise comparisons and combinatorial explosion.

Details

Motivation: Existing techniques for patch robustness certification fail to certify the true label within top-k predictions due to attack budget inflation and combinatorial explosion.

Method: CostCert verifies the true label by checking if the attack budget is insufficient to exclude it from top-k predictions, avoiding pairwise comparisons.

Result: CostCert retains up to 57.3% certified accuracy with a patch size of 96, while PatchGuard drops to zero.

Conclusion: CostCert offers a novel, efficient solution for certified recovery in adversarial patch defense, significantly improving over existing methods.

Abstract: Patch robustness certification is an emerging verification approach for defending against adversarial patch attacks with provable guarantees for deep learning systems. Certified recovery techniques guarantee the prediction of the sole true label of a certified sample. However, existing techniques, if applicable to top-k predictions, commonly conduct pairwise comparisons on those votes between labels, failing to certify the sole true label within the top k prediction labels precisely due to the inflation on the number of votes controlled by the attacker (i.e., attack budget); yet enumerating all combinations of vote allocation suffers from the combinatorial explosion problem. We propose CostCert, a novel, scalable, and precise voting-based certified recovery defender. CostCert verifies the true label of a sample within the top k predictions without pairwise comparisons and combinatorial explosion through a novel design: whether the attack budget on the sample is infeasible to cover the smallest total additional votes on top of the votes uncontrollable by the attacker to exclude the true labels from the top k prediction labels. Experiments show that CostCert significantly outperforms the current state-of-the-art defender PatchGuard, such as retaining up to 57.3% in certified accuracy when the patch size is 96, whereas PatchGuard has already dropped to zero.

[345] Causal Explanation of Concept Drift – A Truly Actionable Approach

David Komnick, Kathrin Lammers, Barbara Hammer, Valerie Vaquet, Fabian Hinder

Main category: cs.LG

TL;DR: The paper extends model-based drift explanations to causal explanations to improve actionability, demonstrating practical usefulness in isolating causally relevant features for targeted interventions.

Details

Motivation: Understanding and explaining concept drift is critical to prevent model failures and physical system malfunctions in dynamic environments.

Method: The work extends model-based drift explanations to causal explanations and evaluates the framework on various use cases.

Result: The framework successfully isolates causally relevant features affected by concept drift, enabling targeted interventions.

Conclusion: Causal explanations enhance the actionability of drift explanations, proving practical utility in real-world applications.

Abstract: In a world that constantly changes, it is crucial to understand how those changes impact different systems, such as industrial manufacturing or critical infrastructure. Explaining critical changes, referred to as concept drift in the field of machine learning, is the first step towards enabling targeted interventions to avoid or correct model failures, as well as malfunctions and errors in the physical world. Therefore, in this work, we extend model-based drift explanations towards causal explanations, which increases the actionability of the provided explanations. We evaluate our explanation strategy on a number of use cases, demonstrating the practical usefulness of our framework, which isolates the causally relevant features impacted by concept drift and, thus, allows for targeted intervention.

[346] Policy Learning from Large Vision-Language Model Feedback without Reward Modeling

Tung M. Luu, Donghoon Lee, Younghwan Lee, Chang D. Yoo

Main category: cs.LG

TL;DR: PLARE introduces a novel offline RL method using vision-language models (VLMs) to generate preference-based guidance, eliminating manual reward design and achieving competitive performance in robotic tasks.

Details

Motivation: Offline RL avoids costly online interactions but relies on reward-labeled data, which is expensive to design. PLARE aims to bypass this bottleneck by using VLMs for preference-based training.

Method: PLARE queries a VLM for preference labels on visual trajectory segments based on task descriptions, then trains policies using supervised contrastive preference learning without explicit reward models.

Result: PLARE matches or surpasses state-of-the-art VLM-based methods in MetaWorld tasks and proves effective in real-world robotic manipulation.

Conclusion: PLARE offers a practical, scalable solution for offline RL by leveraging VLMs to replace manual reward design, validated in both simulated and real-world settings.

Abstract: Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large vision-language models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.

[347] A Machine Learning Approach for Honey Adulteration Detection using Mineral Element Profiles

Mokhtar A. Al-Awadhi, Ratnadeep R. Deshmukh

Main category: cs.LG

TL;DR: A machine learning system using honey mineral profiles detects adulteration with 98.37% accuracy using random forest.

Details

Motivation: To develop an ML-based system for detecting honey adulteration using mineral element profiles.

Method: Two-phase system: preprocessing (missing-value treatment, normalization) and classification (logistic regression, decision tree, random forest).

Result: Random forest outperforms others with 98.37% cross-validation accuracy. Mineral content is effective for detection.

Conclusion: Mineral profiles are robust for honey adulteration detection; random forest is the best classifier.

Abstract: This paper aims to develop a Machine Learning (ML)-based system for detecting honey adulteration utilizing honey mineral element profiles. The proposed system comprises two phases: preprocessing and classification. The preprocessing phase involves the treatment of missing-value attributes and normalization. In the classifica-tion phase, we use three supervised ML models: logistic regression, decision tree, and random forest, to dis-criminate between authentic and adulterated honey. To evaluate the performance of the ML models, we use a public dataset comprising measurements of mineral element content of authentic honey, sugar syrups, and adul-terated honey. Experimental findings show that mineral element content in honey provides robust discriminative information for detecting honey adulteration. Results also demonstrate that the random forest-based classifier outperforms other classifiers on this dataset, achieving the highest cross-validation accuracy of 98.37%.

[348] Detection of Adulteration in Coconut Milk using Infrared Spectroscopy and Machine Learning

Mokhtar A. Al-Awadhi, Ratnadeep R. Deshmukh

Main category: cs.LG

TL;DR: A system using infrared spectroscopy and machine learning (LDA and KNN) detects coconut milk adulteration with 93.33% accuracy.

Details

Motivation: To address the issue of adulteration in coconut milk by developing an efficient detection method.

Method: Three-phase system: preprocessing (data cleaning), feature extraction (LDA), and classification (KNN).

Result: Achieved 93.33% cross-validation accuracy in detecting adulterated samples.

Conclusion: The proposed method is effective for detecting adulteration in coconut milk using infrared spectroscopy and machine learning.

Abstract: In this paper, we propose a system for detecting adulteration in coconut milk, utilizing infrared spectroscopy. The machine learning-based proposed system comprises three phases: preprocessing, feature extraction, and classification. The first phase involves removing irrelevant data from coconut milk spectral signals. In the second phase, we employ the Linear Discriminant Analysis (LDA) algorithm for extracting the most discriminating features. In the third phase, we use the K-Nearest Neighbor (KNN) model to classify coconut milk samples into authentic or adulterated. We evaluate the performance of the proposed system using a public dataset comprising Fourier Transform Infrared (FTIR) spectral information of pure and contaminated coconut milk samples. Findings show that the proposed method successfully detects adulteration with a cross-validation accuracy of 93.33%.

[349] Merging Memory and Space: A Spatiotemporal State Space Neural Operator

Nodens F. Koren, Samuel Lanthaler

Main category: cs.LG

TL;DR: ST-SSM is a compact neural operator for time-dependent PDEs, using factorized spatiotemporal modeling for efficiency and performance.

Details

Motivation: To efficiently learn solution operators for time-dependent PDEs by factorizing spatial and temporal dimensions.

Method: Introduces a novel factorization using structured state-space models for independent temporal and spatial modeling.

Result: Outperforms alternatives on PDE benchmarks with fewer parameters, showing improved performance under partial observability.

Conclusion: ST-SSM offers efficient, generalizable PDE modeling with strong theoretical and empirical support.

Abstract: We propose the Spatiotemporal State Space Neural Operator (ST-SSM), a compact architecture for learning solution operators of time-dependent partial differential equations (PDEs). ST-SSM introduces a novel factorization of the spatial and temporal dimensions, using structured state-space models to independently model temporal evolution and spatial interactions. This design enables parameter efficiency and flexible modeling of long-range spatiotemporal dynamics. A theoretical connection is established between SSMs and neural operators, and a unified universality theorem is proved for the resulting class of architectures. Empirically, we demonstrate that our factorized formulation outperforms alternative schemes such as zigzag scanning and parallel independent processing on several PDE benchmarks, including 1D Burgers’ equation, 1D Kuramoto-Sivashinsky equation, and 2D Navier-Stokes equations under varying physical conditions. Our model performs competitively with existing baselines while using significantly fewer parameters. In addition, our results reinforce previous findings on the benefits of temporal memory by showing improved performance under partial observability. Our results highlight the advantages of dimensionally factorized operator learning for efficient and generalizable PDE modeling, and put this approach on a firm theoretical footing.

[350] Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable DNN Accelerator Design

Yinhui Ma, Tomomasa Yamasaki, Zhehui Wang, Tao Luo, Bo Wang

Main category: cs.LG

TL;DR: Coflex, a novel HW-NAS framework, uses Sparse Gaussian Process and multi-objective Bayesian optimization to efficiently co-optimize neural network performance and hardware energy efficiency, reducing computational costs significantly.

Details

Motivation: The challenges of extensive search space and high computational costs in HW-NAS hinder its practical adoption, necessitating a more efficient solution.

Method: Coflex integrates Sparse Gaussian Process (SGP) with multi-objective Bayesian optimization, leveraging sparse inducing points to reduce GP kernel complexity from cubic to near-linear.

Result: Coflex outperforms state-of-the-art methods in network accuracy and Energy-Delay-Product, achieving computational speed-ups of 1.9x to 9.5x.

Conclusion: Coflex provides a scalable and efficient solution for HW-NAS, balancing performance and computational overhead.

Abstract: Hardware-Aware Neural Architecture Search (HW-NAS) is an efficient approach to automatically co-optimizing neural network performance and hardware energy efficiency, making it particularly useful for the development of Deep Neural Network accelerators on the edge. However, the extensive search space and high computational cost pose significant challenges to its practical adoption. To address these limitations, we propose Coflex, a novel HW-NAS framework that integrates the Sparse Gaussian Process (SGP) with multi-objective Bayesian optimization. By leveraging sparse inducing points, Coflex reduces the GP kernel complexity from cubic to near-linear with respect to the number of training samples, without compromising optimization performance. This enables scalable approximation of large-scale search space, substantially decreasing computational overhead while preserving high predictive accuracy. We evaluate the efficacy of Coflex across various benchmarks, focusing on accelerator-specific architecture. Our experi- mental results show that Coflex outperforms state-of-the-art methods in terms of network accuracy and Energy-Delay-Product, while achieving a computational speed-up ranging from 1.9x to 9.5x.

[351] Manifold-regularised Signature Kernel Large-Margin $\ell_p$-SVDD for Multidimensional Time Series Anomaly Detection

Shervin Rahimzadeh Arashloo

Main category: cs.LG

TL;DR: Error: OutputParser failed

Details

Motivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: We generalise the recently introduced large-margin $\ell_p$-SVDD approach to exploit the geometry of data distribution via manifold regularising and a signature kernel representation for time series anomaly detection. Specifically, we formulate a manifold-regularised variant of the $\ell_p$-SVDD method to encourage label smoothness on the underlying manifold to capture structural information for improved detection performance. Drawing on an existing Representer theorem, we then provide an effective optimisation technique for the proposed method and show that it can benefit from the signature kernel to capture time series complexities for anomaly detection. We theoretically study the proposed approach using Rademacher complexities to analyse its generalisation performance and also provide an experimental assessment of the proposed method across various data sets to compare its performance against other methods.

[352] Explainable artificial intelligence model predicting the risk of all-cause mortality in patients with type 2 diabetes mellitus

Olga Vershinina, Jacopo Sabbatinelli, Anna Rita Bonfigli, Dalila Colombaretti, Angelica Giuliani, Mikhail Krivonosov, Arseniy Trukhanov, Claudio Franceschi, Mikhail Ivanchenko, Fabiola Olivieri

Main category: cs.LG

TL;DR: A machine learning model (EST) was developed to predict all-cause mortality in T2DM patients, achieving strong performance (AUC up to 0.86) and interpretability via SHAP analysis.

Details

Motivation: Accurate mortality risk estimation in T2DM patients is vital for personalized treatment.

Method: Analyzed 554 T2DM patients over 16.8 years, identified key survival features, and trained ML models (best: EST). SHAP was used for interpretability.

Result: EST model achieved high predictive accuracy (C-statistic 0.776, AUC up to 0.86). SHAP provided model interpretability.

Conclusion: The model offers strong predictive performance and clinical interpretability, aiding in timely treatment optimization for high-risk patients.

Abstract: Objective. Type 2 diabetes mellitus (T2DM) is a highly prevalent non-communicable chronic disease that substantially reduces life expectancy. Accurate estimation of all-cause mortality risk in T2DM patients is crucial for personalizing and optimizing treatment strategies. Research Design and Methods. This study analyzed a cohort of 554 patients (aged 40-87 years) with diagnosed T2DM over a maximum follow-up period of 16.8 years, during which 202 patients (36%) died. Key survival-associated features were identified, and multiple machine learning (ML) models were trained and validated to predict all-cause mortality risk. To improve model interpretability, Shapley additive explanations (SHAP) was applied to the best-performing model. Results. The extra survival trees (EST) model, incorporating ten key features, demonstrated the best predictive performance. The model achieved a C-statistic of 0.776, with the area under the receiver operating characteristic curve (AUC) values of 0.86, 0.80, 0.841, and 0.826 for 5-, 10-, 15-, and 16.8-year all-cause mortality predictions, respectively. The SHAP approach was employed to interpret the model’s individual decision-making processes. Conclusions. The developed model exhibited strong predictive performance for mortality risk assessment. Its clinically interpretable outputs enable potential bedside application, improving the identification of high-risk patients and supporting timely treatment optimization.

[353] Incorporating structural uncertainty in causal decision making

Maurits Kaptein

Main category: cs.LG

TL;DR: The paper examines when structural uncertainty in causal relationships (e.g., $X \rightarrow Y$ vs. $X \leftarrow Y$) is significant enough to require Bayesian model averaging, identifying key conditions for its benefit.

Details

Motivation: Practitioners often ignore structural uncertainty in causal inference, which can lead to suboptimal decisions. This paper aims to address when and how this uncertainty should be accounted for.

Method: The study uses Bayesian model averaging over competing causal structures, focusing on bivariate relationships. It establishes conditions for its benefit and proves optimality under regularity conditions. Simulations with modern causal discovery methods validate the approach.

Result: Model averaging is beneficial when structural uncertainty is moderate to high, causal effects differ between structures, and loss functions are sensitive to effect size. Simulations confirm the feasibility of quantifying this uncertainty.

Conclusion: The framework addresses a typically overlooked source of uncertainty in causal inference, complementing existing robust methods.

Abstract: Practitioners making decisions based on causal effects typically ignore structural uncertainty. We analyze when this uncertainty is consequential enough to warrant methodological solutions (Bayesian model averaging over competing causal structures). Focusing on bivariate relationships ($X \rightarrow Y$ vs. $X \leftarrow Y$), we establish that model averaging is beneficial when: (1) structural uncertainty is moderate to high, (2) causal effects differ substantially between structures, and (3) loss functions are sufficiently sensitive to the size of the causal effect. We prove optimality results of our suggested methodological solution under regularity conditions and demonstrate through simulations that modern causal discovery methods can provide, within limits, the necessary quantification. Our framework complements existing robust causal inference approaches by addressing a distinct source of uncertainty typically overlooked in practice.

[354] Directional Ensemble Aggregation for Actor-Critics

Nicklas Werge, Yi-Shan Wu, Bahareh Tasdighi, Melih Kandemir

Main category: cs.LG

TL;DR: DEA is a dynamic ensemble aggregation method for Q-value estimates in reinforcement learning, adapting conservatism and exploration based on data-driven directional parameters.

Details

Motivation: Static ensemble aggregation methods in off-policy RL discard valuable information and lack adaptability to task-specific needs or learning phases.

Method: DEA introduces two learnable directional parameters for critic-side conservatism and actor-side exploration, trained using ensemble disagreement-weighted Bellman errors.

Result: DEA outperforms static ensemble strategies across continuous control benchmarks and various learning regimes.

Conclusion: DEA provides a flexible, adaptive solution for Q-value aggregation, improving performance in continuous control tasks.

Abstract: Off-policy reinforcement learning in continuous control tasks depends critically on accurate $Q$-value estimates. Conservative aggregation over ensembles, such as taking the minimum, is commonly used to mitigate overestimation bias. However, these static rules are coarse, discard valuable information from the ensemble, and cannot adapt to task-specific needs or different learning regimes. We propose Directional Ensemble Aggregation (DEA), an aggregation method that adaptively combines $Q$-value estimates in actor-critic frameworks. DEA introduces two fully learnable directional parameters: one that modulates critic-side conservatism and another that guides actor-side policy exploration. Both parameters are learned using ensemble disagreement-weighted Bellman errors, which weight each sample solely by the direction of its Bellman error. This directional learning mechanism allows DEA to adjust conservatism and exploration in a data-driven way, adapting aggregation to both uncertainty levels and the phase of training. We evaluate DEA across continuous control benchmarks and learning regimes - from interactive to sample-efficient - and demonstrate its effectiveness over static ensemble strategies.

[355] A Verifier Hierarchy

Maurits Kaptein

Main category: cs.LG

TL;DR: The paper explores the trade-off between certificate length and verifier runtime, proving a theorem that links reduced verification time to longer certificates. It applies this to complexity class separations and problems like string periodicity, while also offering insights into P vs. NP.

Details

Motivation: To understand the relationship between certificate length and verifier runtime, and to apply this understanding to complexity theory and practical problems.

Method: Proves a Verifier Trade-off Theorem linking verification time reduction to certificate length, and applies it to complexity classes and natural problems.

Result: A hierarchy based on certificate complexity is established, with applications to conjectured separations between complexity classes and problems like string periodicity.

Conclusion: The theorem provides a framework for analyzing verifier trade-offs, with implications for complexity theory and the P vs. NP problem.

Abstract: We investigate the trade-off between certificate length and verifier runtime. We prove a Verifier Trade-off Theorem showing that reducing the inherent verification time of a language from (f(n)) to (g(n)), where (f(n) \ge g(n)), requires certificates of length at least (\Omega(\log(f(n) / g(n)))). This theorem induces a natural hierarchy based on certificate complexity. We demonstrate its applicability to analyzing conjectured separations between complexity classes (e.g., (\np) and (\exptime)) and to studying natural problems such as string periodicity and rotation detection. Additionally, we provide perspectives on the (\p) vs. (\np) problem by relating it to the existence of sub-linear certificates.

[356] Differentially Private Clipped-SGD: High-Probability Convergence with Arbitrary Clipping Level

Saleh Vatan Khah, Savelii Chezhegov, Shahrokh Farahmand, Samuel Horváth, Eduard Gorbunov

Main category: cs.LG

TL;DR: The paper provides the first high-probability convergence analysis for DP-Clipped-SGD with a fixed clipping level, showing faster convergence under heavy-tailed noise while balancing privacy guarantees.

Details

Motivation: Existing analyses require increasing clipping thresholds, incompatible with standard DP mechanisms like Gaussian. This work addresses the gap for fixed clipping levels.

Method: Analyzes DP-Clipped-SGD with fixed clipping for convex and non-convex smooth optimization under heavy-tailed noise (bounded central α-th moment).

Result: Converges to a neighborhood of the optimal solution faster than existing methods, balancing convergence speed and DP noise.

Conclusion: Fixed clipping levels enable efficient convergence under heavy-tailed noise while maintaining DP compatibility, offering a refined trade-off.

Abstract: Gradient clipping is a fundamental tool in Deep Learning, improving the high-probability convergence of stochastic first-order methods like SGD, AdaGrad, and Adam under heavy-tailed noise, which is common in training large language models. It is also a crucial component of Differential Privacy (DP) mechanisms. However, existing high-probability convergence analyses typically require the clipping threshold to increase with the number of optimization steps, which is incompatible with standard DP mechanisms like the Gaussian mechanism. In this work, we close this gap by providing the first high-probability convergence analysis for DP-Clipped-SGD with a fixed clipping level, applicable to both convex and non-convex smooth optimization under heavy-tailed noise, characterized by a bounded central $\alpha$-th moment assumption, $\alpha \in (1,2]$. Our results show that, with a fixed clipping level, the method converges to a neighborhood of the optimal solution with a faster rate than the existing ones. The neighborhood can be balanced against the noise introduced by DP, providing a refined trade-off between convergence speed and privacy guarantees.

[357] Continual Learning with Synthetic Boundary Experience Blending

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

Main category: cs.LG

TL;DR: The paper proposes Experience Blending, a CL framework using synthetic boundary data (SBD) to mitigate forgetting, outperforming baselines by 10-13% accuracy.

Details

Motivation: Address catastrophic forgetting in continual learning by improving decision boundary stability with synthetic data.

Method: Experience Blending integrates stored key samples and SBD, using DP noise for SBD generation and joint training.

Result: Outperforms nine CL baselines with accuracy gains of 10%, 6%, and 13% on CIFAR-10, CIFAR-100, and Tiny ImageNet.

Conclusion: Synthetic boundary data enhances CL performance, validating the effectiveness of Experience Blending.

Abstract: Continual learning (CL) aims to address catastrophic forgetting in models trained sequentially on multiple tasks. While experience replay has shown promise, its effectiveness is often limited by the sparse distribution of stored key samples, leading to overly simplified decision boundaries. We hypothesize that introducing synthetic data near the decision boundary (Synthetic Boundary Data, or SBD) during training serves as an implicit regularizer, improving boundary stability and mitigating forgetting. To validate this hypothesis, we propose a novel training framework, {\bf Experience Blending}, which integrates knowledge from both stored key samples and synthetic, boundary-adjacent data. Experience blending consists of two core components: (1) a multivariate Differential Privacy (DP) noise mechanism that injects batch-wise noise into low-dimensional feature representations, generating SBD; and (2) an end-to-end training strategy that jointly leverages both stored key samples and SBD. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet demonstrate that our method outperforms nine CL baselines, achieving accuracy improvements of 10%, 6%, and 13%, respectively.

[358] Transparent AI: The Case for Interpretability and Explainability

Dhanesh Ramachandram, Himanshu Joshi, Judy Zhu, Dhari Gandhi, Lucas Hartman, Ananya Raval

Main category: cs.LG

TL;DR: The paper discusses the importance of transparency in AI for high-stakes decisions, sharing insights and strategies for integrating interpretability into AI systems.

Details

Motivation: To promote responsible and trustworthy AI by emphasizing transparency and interpretability in AI systems.

Method: Presents lessons learned from practical interpretability applications across diverse domains.

Result: Actionable strategies and implementation guidance for organizations at different AI maturity levels.

Conclusion: Interpretability should be a core design principle in AI systems, not an afterthought.

Abstract: As artificial intelligence systems increasingly inform high-stakes decisions across sectors, transparency has become foundational to responsible and trustworthy AI implementation. Leveraging our role as a leading institute in advancing AI research and enabling industry adoption, we present key insights and lessons learned from practical interpretability applications across diverse domains. This paper offers actionable strategies and implementation guidance tailored to organizations at varying stages of AI maturity, emphasizing the integration of interpretability as a core design principle rather than a retrospective add-on.

[359] From LLMs to Edge: Parameter-Efficient Fine-Tuning on Edge Devices

Georg Slamanig, Francesco Corti, Olga Saukh

Main category: cs.LG

TL;DR: The paper benchmarks PEFT methods (LoRA, DoRA, GaLore) on convolutional architectures for edge devices, revealing trade-offs in memory efficiency and FLOPs reduction compared to LLMs.

Details

Motivation: To explore the under-researched application of PEFT methods in smaller models (e.g., CNNs) for edge devices, addressing computational cost and efficiency.

Method: Evaluates PEFT methods on convolutional architectures using PyTorch profilers, comparing performance and computational costs with traditional fine-tuning.

Result: PEFT methods are less memory-efficient with depthwise-separable CNNs but can reduce FLOPs by up to 95% for edge-optimized architectures.

Conclusion: Provides insights for selecting PEFT methods based on hardware constraints and performance needs, with code available online.

Abstract: Parameter-efficient fine-tuning (PEFT) methods reduce the computational costs of updating deep learning models by minimizing the number of additional parameters used to adapt a model to a down- stream task. While extensively researched in large language models (LLMs), their application to smaller models used on edge devices, such as convolutional neural networks, remains underexplored. This paper benchmarks and analyzes popular PEFT methods on convolutional architectures typically deployed in resource-constrained edge environments. We evaluate LoRA, DoRA, and GaLore for updating standard and depthwise convolutional architectures to handle distribution shifts and accommodate unseen classes. We utilize recently proposed PyTorch profilers to compare the updated model performance and computational costs of these PEFT methods with traditional fine-tuning approaches. With resource efficiency in mind, we investigate their update behavior across different rank dimensions. We find that the evaluated PEFT methods are only half as memory-efficient when applied to depthwise-separable convolution architectures, compared to their efficiency with LLMs. Conversely, when targeting convolu- tional architectures optimized for edge deployment, adapter-based PEFT methods can reduce floating point operations (FLOPs) during model updates by up to 95%. These insights offer valuable guidance for selecting PEFT methods based on hardware constraints, performance requirements, and application needs. Our code is online.

[360] Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions

Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner

Main category: cs.LG

TL;DR: The paper introduces subquadratic-time algorithms for computing matrix-vector products with asymmetric Gaussian Kernel matrices, validated by experimental results.

Details

Motivation: The need for fast processing of attention matrices, particularly in applications like large language models (LLMs), drives the study of efficient algorithms for matrix-vector products.

Method: The algorithms leverage a modelling assumption that the sum of entries in the Gaussian Kernel matrix scales linearly with n, enabling subquadratic-time computation.

Result: The proposed algorithm achieves subquadratic-time performance for unrestricted vectors, validated experimentally in settings like LLMs.

Conclusion: The study provides the first subquadratic-time algorithm for such matrices under the linear scaling assumption, with practical relevance in fast attention computation.

Abstract: Motivated by the problem of fast processing of attention matrices, we study fast algorithms for computing matrix-vector products for asymmetric Gaussian Kernel matrices $K\in \mathbb{R}^{n\times n}$. $K$’s columns are indexed by a set of $n$ keys $k_1,k_2\ldots, k_n\in \mathbb{R}^d$, rows by a set of $n$ queries $q_1,q_2,\ldots,q_n\in \mathbb{R}^d $, and its $i,j$ entry is $K_{ij} = e^{-|q_i-k_j|_2^2/2\sigma^2}$ for some bandwidth parameter $\sigma>0$. Given a vector $x\in \mathbb{R}^n$ and error parameter $\epsilon>0$, our task is to output a $y\in \mathbb{R}^n$ such that $|Kx-y|_2\leq \epsilon |x|_2$ in time subquadratic in $n$ and linear in $d$. Our algorithms rely on the following modelling assumption about the matrices $K$: the sum of the entries of $K$ scales linearly in $n$, as opposed to worst case quadratic growth. We validate this assumption experimentally, for Gaussian kernel matrices encountered in various settings such as fast attention computation in LLMs. We obtain the first subquadratic-time algorithm that works under this assumption, for unrestricted vectors.

[361] Hardware-Aware Fine-Tuning of Spiking Q-Networks on the SpiNNaker2 Neuromorphic Platform

Sirine Arfa, Bernhard Vogginger, Christian Mayr

Main category: cs.LG

TL;DR: SNNs enable low-power, low-latency RL on neuromorphic hardware. Quantized SNNs trained with Q-learning achieve 32x energy reduction on SpiNNaker2 vs. GPU, with comparable latency.

Details

Motivation: To leverage SNNs for energy-efficient RL in robotic tasks, comparing neuromorphic hardware (SpiNNaker2) to GPUs.

Method: Train SNNs with Q-learning, quantize to 8-bit, deploy on SpiNNaker2, and compare latency/power to GPU.

Result: SpiNNaker2 reduces energy by 32x vs. GPU, maintains latency, and shows real-time viability.

Conclusion: SpiNNaker2 is promising for scalable, low-energy neuromorphic RL, making it compelling for efficient deep Q-learning.

Abstract: Spiking Neural Networks (SNNs) promise orders-of-magnitude lower power consumption and low-latency inference on neuromorphic hardware for a wide range of robotic tasks. In this work, we present an energy-efficient implementation of a reinforcement learning (RL) algorithm using quantized SNNs to solve two classical control tasks. The network is trained using the Q-learning algorithm, then fine-tuned and quantized to low-bit (8-bit) precision for embedded deployment on the SpiNNaker2 neuromorphic chip. To evaluate the comparative advantage of SpiNNaker2 over conventional computing platforms, we analyze inference latency, dynamic power consumption, and energy cost per inference for our SNN models, comparing performance against a GTX 1650 GPU baseline. Our results demonstrate SpiNNaker2’s strong potential for scalable, low-energy neuromorphic computing, achieving up to 32x reduction in energy consumption. Inference latency remains on par with GPU-based execution, with improvements observed in certain task settings, reinforcing SpiNNaker2’s viability for real-time neuromorphic control and making the neuromorphic approach a compelling direction for efficient deep Q-learning.

[362] Optimised Feature Subset Selection via Simulated Annealing

Fernando Martínez-García, Álvaro Rubio-García, Samuel Fernández-Lorenzo, Juan José García-Ripoll, Diego Porras

Main category: cs.LG

TL;DR: SA-FDR is a new algorithm for feature selection using simulated annealing and Fisher discriminant ratio, achieving compact subsets with high accuracy.

Details

Motivation: To address the challenge of selecting minimal yet informative feature subsets in high-dimensional settings, focusing on model sparsity and interpretability.

Method: Uses simulated annealing for global search over feature subsets, guided by the Fisher discriminant ratio as a proxy for model quality.

Result: SA-FDR selects compact feature subsets with high predictive accuracy, capturing inter-feature dependencies better than greedy methods.

Conclusion: SA-FDR offers a flexible and effective solution for interpretable models in high-dimensional scenarios.

Abstract: We introduce SA-FDR, a novel algorithm for $\ell_0$-norm feature selection that considers this task as a combinatorial optimisation problem and solves it by using simulated annealing to perform a global search over the space of feature subsets. The optimisation is guided by the Fisher discriminant ratio, which we use as a computationally efficient proxy for model quality in classification tasks. Our experiments, conducted on datasets with up to hundreds of thousands of samples and hundreds of features, demonstrate that SA-FDR consistently selects more compact feature subsets while achieving a high predictive accuracy. This ability to recover informative yet minimal sets of features stems from its capacity to capture inter-feature dependencies often missed by greedy optimisation approaches. As a result, SA-FDR provides a flexible and effective solution for designing interpretable models in high-dimensional settings, particularly when model sparsity, interpretability, and performance are crucial.

[363] GraphRAG-R1: Graph Retrieval-Augmented Generation with Process-Constrained Reinforcement Learning

Chuanyue Yu, Kuo Zhao, Yuhan Li, Heng Chang, Mingjian Feng, Xiangzhe Jiang, Yufei Sun, Jia Li, Yuzhi Zhang, Jianxin Li, Ziwei Zhang

Main category: cs.LG

TL;DR: GraphRAG-R1 enhances multi-hop reasoning in LLMs using adaptive reinforcement learning and hybrid retrieval, outperforming existing GraphRAG methods.

Details

Motivation: Existing GraphRAG methods struggle with multi-hop reasoning due to pre-defined heuristics and underutilized LLM reasoning potential.

Method: Proposes GraphRAG-R1 with process-constrained RL (GRPO), PRA and CAF rewards, phase-dependent training, and hybrid graph-textual retrieval.

Result: Outperforms state-of-the-art GraphRAG methods on in-domain and out-of-domain datasets, improving reasoning capabilities.

Conclusion: GraphRAG-R1 effectively addresses multi-hop reasoning bottlenecks and is adaptable to various retrieval methods.

Abstract: Graph Retrieval-Augmented Generation (GraphRAG) has shown great effectiveness in enhancing the reasoning abilities of LLMs by leveraging graph structures for knowledge representation and modeling complex real-world relationships. However, existing GraphRAG methods still face significant bottlenecks when handling complex problems that require multi-hop reasoning, as their query and retrieval phases are largely based on pre-defined heuristics and do not fully utilize the reasoning potentials of LLMs. To address this problem, we propose GraphRAG-R1, an adaptive GraphRAG framework by training LLMs with process-constrained outcome-based reinforcement learning (RL) to enhance the multi-hop reasoning ability. Our method can decompose complex problems, autonomously invoke retrieval tools to acquire necessary information, and perform effective reasoning. Specifically, we utilize a modified version of Group Relative Policy Optimization (GRPO) that supports rollout-with-thinking capability. Next, we design two process-constrained reward functions. To handle the shallow retrieval problem, we design a Progressive Retrieval Attenuation (PRA) reward to encourage essential retrievals. Then, to handle the over-thinking problem, we design Cost-Aware F1 (CAF) reward to balance the model performance with computational costs. We further design a phase-dependent training strategy, containing three training stages corresponding to cold start and these two rewards. Lastly, our method adopts a hybrid graph-textual retrieval to improve the reasoning capacity. Extensive experimental results demonstrate that GraphRAG-R1 boosts LLM capabilities in solving complex reasoning problems compared to state-of-the-art GraphRAG methods on both in-domain and out-of-domain datasets. Furthermore, our framework can be flexibly integrated with various existing retrieval methods, consistently delivering performance improvements.

[364] EB-gMCR: Energy-Based Generative Modeling for Signal Unmixing and Multivariate Curve Resolution

Yu-Tang Chang, Shih-Fang Chen

Main category: cs.LG

TL;DR: The paper introduces EB-gMCR, an energy-based deep learning solver for signal unmixing, automating component discovery and improving scalability and reliability over classical methods.

Details

Motivation: Classical MCR methods struggle with scalability and reliability as dataset size or component count grows, requiring manual component specification.

Method: EB-gMCR reformulates MCR as a generative process, using a differentiable gating network to automatically select the smallest component set from a large candidate pool.

Result: EB-gMCR achieved high accuracy (R^2 >= 0.98) and near-exact component count estimation, even with noisy data.

Conclusion: EB-gMCR provides a scalable and adaptable solution for large-scale signal unmixing, with potential applications in chemical and biological research.

Abstract: Signal unmixing analysis decomposes data into basic patterns and is widely applied in chemical and biological research. Multivariate curve resolution (MCR), a branch of signal unmixing, separates mixed chemical signals into base patterns (components) and their concentrations, playing a key role in understanding composition. Classical MCR is typically framed as matrix factorization (MF) and requires a user-specified component count, usually unknown in real data. As dataset size or component count increases, the scalability and reliability of MF-based MCR face significant challenges. This study reformulates MCR as a generative process (gMCR), and introduces an energy-based deep learning solver, EB-gMCR, that automatically discovers the smallest component set able to reconstruct the data faithfully. EB-gMCR starts from a large candidate pool (e.g., 1024 spectra) and employs a differentiable gating network to retain only active components while estimating their concentrations. On noisy synthetic datasets containing up to 256 latent sources, EB-gMCR maintained R^2 >= 0.98 and recovered the component count within 5% of the ground truth; at lower noise it achieved R^2 >= 0.99 with near exact component estimation. Additional chemical priors, such as non-negativity or nonlinear mixing, enter as simple plug-in functions, enabling adaptation to other instruments or domains without altering the core learning process. By uniting high-capacity generative modeling and hard component selection, EB-gMCR offers a practical route to large-scale signal unmixing analysis, including chemical library-driven scenarios. The source code is available at https://github.com/b05611038/ebgmcr_solver.

[365] Hierarchical Message-Passing Policies for Multi-Agent Reinforcement Learning

Tommaso Marzi, Cesare Alippi, Andrea Cini

Main category: cs.LG

TL;DR: Proposes a novel method combining feudal HRL and message-passing for scalable MARL, addressing partial observability and non-stationarity.

Details

Motivation: Address challenges in decentralized MARL like partial observability and non-stationarity by integrating coordination and temporal abstraction.

Method: Uses feudal HRL and hierarchical graph structure for planning; lower-level agents receive goals and exchange messages, with a novel reward-assignment method.

Result: Outperforms state-of-the-art methods on benchmarks.

Conclusion: The proposed method effectively combines hierarchical policies and message-passing for improved MARL performance.

Abstract: Decentralized Multi-Agent Reinforcement Learning (MARL) methods allow for learning scalable multi-agent policies, but suffer from partial observability and induced non-stationarity. These challenges can be addressed by introducing mechanisms that facilitate coordination and high-level planning. Specifically, coordination and temporal abstraction can be achieved through communication (e.g., message passing) and Hierarchical Reinforcement Learning (HRL) approaches to decision-making. However, optimization issues limit the applicability of hierarchical policies to multi-agent systems. As such, the combination of these approaches has not been fully explored. To fill this void, we propose a novel and effective methodology for learning multi-agent hierarchies of message-passing policies. We adopt the feudal HRL framework and rely on a hierarchical graph structure for planning and coordination among agents. Agents at lower levels in the hierarchy receive goals from the upper levels and exchange messages with neighboring agents at the same level. To learn hierarchical multi-agent policies, we design a novel reward-assignment method based on training the lower-level policies to maximize the advantage function associated with the upper levels. Results on relevant benchmarks show that our method performs favorably compared to the state of the art.

[366] Deep Learning-based Prediction of Clinical Trial Enrollment with Uncertainty Estimates

Tien Huu Do, Antoine Masquelier, Nae Eoun Lee, Jonathan Crowther

Main category: cs.LG

TL;DR: A deep learning-based method using pre-trained language models and attention mechanisms predicts clinical trial patient enrollment more accurately than baselines.

Details

Motivation: Accurate prediction of patient enrollment is critical for clinical trial success but is challenging due to financial and planning constraints.

Method: A neural network model combines pre-trained language models for document analysis with tabular features via attention, enhanced by a probabilistic Gamma layer for uncertainty.

Result: The method outperforms baseline models in predicting patient enrollment across multiple sites in real-world trials.

Conclusion: The proposed deep learning approach effectively addresses the challenge of predicting clinical trial enrollment, offering improved accuracy and uncertainty estimation.

Abstract: Clinical trials are a systematic endeavor to assess the safety and efficacy of new drugs or treatments. Conducting such trials typically demands significant financial investment and meticulous planning, highlighting the need for accurate predictions of trial outcomes. Accurately predicting patient enrollment, a key factor in trial success, is one of the primary challenges during the planning phase. In this work, we propose a novel deep learning-based method to address this critical challenge. Our method, implemented as a neural network model, leverages pre-trained language models (PLMs) to capture the complexities and nuances of clinical documents, transforming them into expressive representations. These representations are then combined with encoded tabular features via an attention mechanism. To account for uncertainties in enrollment prediction, we enhance the model with a probabilistic layer based on the Gamma distribution, which enables range estimation. We apply the proposed model to predict clinical trial duration, assuming site-level enrollment follows a Poisson-Gamma process. We carry out extensive experiments on real-world clinical trial data, and show that the proposed method can effectively predict the number of patients enrolled at a number of sites for a given clinical trial, outperforming established baseline models.

[367] L-GTA: Latent Generative Modeling for Time Series Augmentation

Luis Roque, Carlos Soares, Vitor Cerqueira, Luis Torgo

Main category: cs.LG

TL;DR: L-GTA is a transformer-based generative model for time series augmentation, improving data reliability and predictive accuracy.

Details

Motivation: Data augmentation is crucial for time series tasks, but existing methods lack control and reliability.

Method: Uses a transformer-based variational recurrent autoencoder to apply controlled latent space transformations.

Result: Produces more reliable and controllable augmented data, enhancing predictive accuracy.

Conclusion: L-GTA outperforms direct transformation methods, offering better synthetic data generation.

Abstract: Data augmentation is gaining importance across various aspects of time series analysis, from forecasting to classification and anomaly detection tasks. We introduce the Latent Generative Transformer Augmentation (L-GTA) model, a generative approach using a transformer-based variational recurrent autoencoder. This model uses controlled transformations within the latent space of the model to generate new time series that preserve the intrinsic properties of the original dataset. L-GTA enables the application of diverse transformations, ranging from simple jittering to magnitude warping, and combining these basic transformations to generate more complex synthetic time series datasets. Our evaluation of several real-world datasets demonstrates the ability of L-GTA to produce more reliable, consistent, and controllable augmented data. This translates into significant improvements in predictive accuracy and similarity measures compared to direct transformation methods.

[368] On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Gabriel Mongaras, Eric C. Larson

Main category: cs.LG

TL;DR: The paper explores why softmax attention outperforms linear attention, framing softmax attention as an RNN to analyze its components.

Details

Motivation: To understand the performance gap between softmax attention and linear attention, despite the latter's computational efficiency.

Method: Derives softmax attention’s recurrent form, analyzing its components using RNN terminology.

Result: Identifies the expressive advantages of softmax attention over linear attention.

Conclusion: Softmax attention’s components, when analyzed as an RNN, explain its superior expressiveness compared to linear attention.

Abstract: Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention. Using this form, each part of softmax attention can be described in the language of recurrent neural networks (RNNs). Describing softmax attention as an RNN allows for the ablation of the components of softmax attention to understand the importance of each part and how they interact. In this way, our work helps explain why softmax attention is more expressive than its counterparts.

[369] OptiGradTrust: Byzantine-Robust Federated Learning with Multi-Feature Gradient Analysis and Reinforcement Learning-Based Trust Weighting

Mohammad Karami, Fatemeh Ghassemi, Hamed Kebriaei, Hamid Azadegan

Main category: cs.LG

TL;DR: OptiGradTrust is a defense framework for Federated Learning that uses a six-dimensional fingerprint and hybrid RL-attention module to counter Byzantine attacks and data heterogeneity, achieving superior performance over existing methods.

Details

Motivation: To address vulnerabilities in Federated Learning (FL) to Byzantine attacks and statistical heterogeneity while preserving privacy in medical collaborations.

Method: Introduces OptiGradTrust, evaluating gradients via a six-dimensional fingerprint and a hybrid RL-attention module, and FedBN-Prox for convergence under data heterogeneity.

Result: Outperforms state-of-the-art defenses, achieving up to +1.6 percentage points over FLGuard under non-IID conditions and robustness against diverse attacks.

Conclusion: OptiGradTrust and FedBN-Prox provide effective solutions for secure and efficient FL in medical applications.

Abstract: Federated Learning (FL) enables collaborative model training across distributed medical institutions while preserving patient privacy, but remains vulnerable to Byzantine attacks and statistical heterogeneity. We present OptiGradTrust, a comprehensive defense framework that evaluates gradient updates through a novel six-dimensional fingerprint including VAE reconstruction error, cosine similarity metrics, $L_2$ norm, sign-consistency ratio, and Monte Carlo Shapley value, which drive a hybrid RL-attention module for adaptive trust scoring. To address convergence challenges under data heterogeneity, we develop FedBN-Prox (FedBN-P), combining Federated Batch Normalization with proximal regularization for optimal accuracy-convergence trade-offs. Extensive evaluation across MNIST, CIFAR-10, and Alzheimer’s MRI datasets under various Byzantine attack scenarios demonstrates significant improvements over state-of-the-art defenses, achieving up to +1.6 percentage points over FLGuard under non-IID conditions while maintaining robust performance against diverse attack patterns through our adaptive learning approach.

[370] SHAP-Guided Regularization in Machine Learning Models

Amal Saadallah

Main category: cs.LG

TL;DR: A SHAP-guided regularization framework is proposed to improve model performance and interpretability by incorporating feature importance constraints during training.

Details

Motivation: To enhance predictive performance and interpretability of machine learning models by leveraging SHAP feature attributions for optimization.

Method: Introduces entropy-based penalties to encourage sparse, stable feature attributions, applied to regression and classification tasks, starting with TreeSHAP for tree-based models.

Result: Improves generalization performance and ensures robust, interpretable feature attributions in benchmark datasets.

Conclusion: The framework offers a novel explainability-driven regularization approach, enhancing model accuracy and reliability.

Abstract: Feature attribution methods such as SHapley Additive exPlanations (SHAP) have become instrumental in understanding machine learning models, but their role in guiding model optimization remains underexplored. In this paper, we propose a SHAP-guided regularization framework that incorporates feature importance constraints into model training to enhance both predictive performance and interpretability. Our approach applies entropy-based penalties to encourage sparse, concentrated feature attributions while promoting stability across samples. The framework is applicable to both regression and classification tasks. Our first exploration started with investigating a tree-based model regularization using TreeSHAP. Through extensive experiments on benchmark regression and classification datasets, we demonstrate that our method improves generalization performance while ensuring robust and interpretable feature attributions. The proposed technique offers a novel, explainability-driven regularization approach, making machine learning models both more accurate and more reliable.

[371] TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses

Muhammad Taha Cheema, Abeer Aamir, Khawaja Gul Muhammad, Naveed Anwar Bhatti, Ihsan Ayyub Qazi, Zafar Ayyub Qazi

Main category: cs.LG

TL;DR: TweakLLM is a routing architecture using a lightweight LLM to adapt cached responses dynamically, improving cache effectiveness without sacrificing response quality.

Details

Motivation: Efficient response caching in LLMs is challenging due to personalized interactions and semantic similarity limitations.

Method: TweakLLM employs a lightweight LLM to dynamically tweak cached responses for incoming prompts.

Result: TweakLLM maintains response quality comparable to frontier models while enhancing cache effectiveness.

Conclusion: TweakLLM offers a scalable, resource-efficient caching solution for high-volume LLM deployments.

Abstract: Large Language Models (LLMs) process millions of queries daily, making efficient response caching a compelling optimization for reducing cost and latency. However, preserving relevance to user queries using this approach proves difficult due to the personalized nature of chatbot interactions and the limited accuracy of semantic similarity search. To address this, we present TweakLLM, a novel routing architecture that employs a lightweight LLM to dynamically adapt cached responses to incoming prompts. Through comprehensive evaluation, including user studies with side-by-side comparisons, satisfaction voting, as well as multi-agent LLM debates, we demonstrate that TweakLLM maintains response quality comparable to frontier models while significantly improving cache effectiveness. Our results across real-world datasets highlight TweakLLM as a scalable, resource-efficient caching solution for high-volume LLM deployments without compromising user experience.

[372] One-Step Flow Policy Mirror Descent

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, Bo Dai

Main category: cs.LG

TL;DR: FPMD enables 1-step sampling in online RL, improving responsiveness without extra training, matching diffusion policy performance with fewer evaluations.

Details

Motivation: Diffusion policies in RL are expressive but slow due to iterative sampling, limiting responsiveness.

Method: Proposes FPMD, leveraging flow matching models to enable 1-step sampling without distillation or consistency training. Two variants: flow policy and MeanFlow policy.

Result: Outperforms diffusion policies in speed (hundreds of times fewer evaluations) while maintaining comparable performance on MuJoCo benchmarks.

Conclusion: FPMD offers a faster, efficient alternative to diffusion policies in online RL without sacrificing performance.

Abstract: Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on flow policy and MeanFlow policy parametrizations, respectively. Extensive empirical evaluations on MuJoCo benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring hundreds of times fewer function evaluations during inference.

[373] DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data

Rabeya Tus Sadia, Qiang Cheng

Main category: cs.LG

TL;DR: DepMicroDiff is a new framework for microbiome data imputation that combines diffusion-based modeling with a Dependency-Aware Transformer, outperforming existing methods in accuracy and robustness.

Details

Motivation: Microbiome data's sparsity and noise hinder accurate imputation, limiting downstream tasks like biomarker discovery. Existing methods miss microbial interdependencies and contextual metadata.

Method: DepMicroDiff integrates diffusion-based generative modeling with a Dependency-Aware Transformer (DAT), uses VAE-based pretraining on cancer datasets, and conditions on patient metadata encoded via an LLM.

Result: DepMicroDiff achieves higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE on TCGA datasets, showing robustness across cancer types.

Conclusion: DepMicroDiff is a robust and generalizable solution for microbiome imputation, capturing complex dependencies and leveraging metadata effectively.

Abstract: Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation.

[374] Anomalous Samples for Few-Shot Anomaly Detection

Aymane Abdali, Bartosz Boguslawski, Lucas Drumetz, Vincent Gripon

Main category: cs.LG

TL;DR: The paper explores using anomalous samples in few-shot settings for binary anomaly classification, proposing a multi-score method and augmentation-based validation.

Details

Motivation: Anomalous data is often scarce, but even a few samples can significantly impact model performance. The study aims to leverage such samples effectively.

Method: Proposes a multi-score anomaly detection approach combining Zero-Shot and memory-based techniques, with augmentation-based validation for score optimization.

Result: Demonstrates the utility of anomalous samples compared to regular ones, showing effectiveness on industrial datasets.

Conclusion: Anomalous samples can enhance anomaly classification, with the proposed method offering practical benefits in few-shot settings.

Abstract: Several anomaly detection and classification methods rely on large amounts of non-anomalous or “normal” samples under the assump- tion that anomalous data is typically harder to acquire. This hypothesis becomes questionable in Few-Shot settings, where as little as one anno- tated sample can make a significant difference. In this paper, we tackle the question of utilizing anomalous samples in training a model for bi- nary anomaly classification. We propose a methodology that incorporates anomalous samples in a multi-score anomaly detection score leveraging recent Zero-Shot and memory-based techniques. We compare the utility of anomalous samples to that of regular samples and study the benefits and limitations of each. In addition, we propose an augmentation-based validation technique to optimize the aggregation of the different anomaly scores and demonstrate its effectiveness on popular industrial anomaly detection datasets.

[375] Improving annotator selection in Active Learning using a mood and fatigue-aware Recommender System

Diana Mortagua

Main category: cs.LG

TL;DR: The paper proposes a Knowledge-Based Recommendation System for Active Learning to select annotators by considering their past accuracy, mood, and fatigue, reducing annotation errors and improving model performance.

Details

Motivation: Active Learning (AL) faces challenges in minimizing annotation errors and optimizing efficiency. Existing strategies overlook internal factors like mood and fatigue, which affect annotator productivity.

Method: The study introduces a Knowledge-Based Recommendation System (RS) that ranks annotators based on past accuracy, mood, fatigue, and query details, simulating realistic annotator behavior.

Result: The approach reduces annotation errors and model uncertainty, improving accuracy and F1-scores, though the latter improvements are modest.

Conclusion: The study addresses human cognitive factors in AL, demonstrating the benefits of incorporating annotator mood and fatigue into selection strategies.

Abstract: This study centers on overcoming the challenge of selecting the best annotators for each query in Active Learning (AL), with the objective of minimizing misclassifications. AL recognizes the challenges related to cost and time when acquiring labeled data, and decreases the number of labeled data needed. Nevertheless, there is still the necessity to reduce annotation errors, aiming to be as efficient as possible, to achieve the expected accuracy faster. Most strategies for query-annotator pairs do not consider internal factors that affect productivity, such as mood, attention, motivation, and fatigue levels. This work addresses this gap in the existing literature, by not only considering how the internal factors influence annotators (mood and fatigue levels) but also presenting a new query-annotator pair strategy, using a Knowledge-Based Recommendation System (RS). The RS ranks the available annotators, allowing to choose one or more to label the queried instance using their past accuracy values, and their mood and fatigue levels, as well as information about the instance queried. This work bases itself on existing literature on mood and fatigue influence on human performance, simulating annotators in a realistic manner, and predicting their performance with the RS. The results show that considering past accuracy values, as well as mood and fatigue levels reduces the number of annotation errors made by the annotators, and the uncertainty of the model through its training, when compared to not using internal factors. Accuracy and F1-score values were also better in the proposed approach, despite not being as substantial as the aforementioned. The methodologies and findings presented in this study begin to explore the open challenge of human cognitive factors affecting AL.

[376] Consensus-Driven Active Model Selection

Justin Kay, Grant Van Horn, Subhransu Maji, Daniel Sheldon, Sara Beery

Main category: cs.LG

TL;DR: CODA is a method for active model selection that reduces annotation effort by 70% compared to prior methods, using consensus-driven predictions to prioritize labeling.

Details

Motivation: Traditional model selection requires costly validation datasets; CODA aims to minimize this effort by efficiently identifying the best model.

Method: CODA uses a probabilistic framework to model relationships between classifiers, categories, and data points, leveraging consensus and disagreement to guide label acquisition and Bayesian inference.

Result: CODA outperforms existing methods, reducing annotation effort by over 70% on 26 benchmark tasks.

Conclusion: CODA provides an efficient, scalable solution for active model selection, significantly lowering the cost of identifying the best model.

Abstract: The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task? This question of model selection is traditionally answered by collecting and annotating a validation dataset – a costly and time-intensive process. We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected. We validate our approach by curating a collection of 26 benchmark tasks capturing a range of model selection scenarios. CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 70% compared to the previous state-of-the-art. Code and data are available at https://github.com/justinkay/coda.

[377] Disparate Conditional Prediction in Multiclass Classifiers

Sivan Sabato, Eran Treister, Elad Yom-Tov

Main category: cs.LG

TL;DR: The paper proposes methods to audit multiclass classifiers for fairness under multiclass equalized odds, extending the Disparate Conditional Prediction (DCP) measure from binary to multiclass classifiers. It introduces local-optimization methods for estimating DCP under two regimes and demonstrates their accuracy.

Details

Motivation: To address fairness in multiclass classifiers by extending the DCP measure and providing practical methods for auditing fairness under different data availability scenarios.

Method: Generalizes DCP to multiclass classifiers and introduces local-optimization methods for estimating DCP under two regimes: known conditional confusion matrices and when these are unavailable.

Result: The proposed methods accurately detect unfair treatment in classifiers, as demonstrated by experiments.

Conclusion: The methods effectively audit multiclass classifiers for fairness, with practical applications and provided code for implementation.

Abstract: We propose methods for auditing multiclass classifiers for fairness under multiclass equalized odds,by estimating the deviation from equalized odds when the classifier is not completely fair. We generalize to multiclass classifiers the measure of Disparate Conditional Prediction (DCP), originally suggested by Sabato & Yom-Tov (2020) for binary classifiers. DCP is defined as the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline. We provide new local-optimization methods for estimating the multiclass DCPunder two different regimes,one in which the conditional confusion matrices for each protected sub-population are known, and one in which these cannot be estimated, for instance, because the classifier is inaccessible or because good-quality individual-level data is not available. These methods can be used to detect classifiers that likely treat a significant fraction of the population unfairly. Experiments demonstrate the accuracy of the methods. Code is provided at https://github.com/sivansabato/ DCPmulticlass.

[378] Insights into Closed-form IPM-GAN Discriminator Guidance for Diffusion Modeling

Aadithya Srikanth, Siddarth Asokan, Nishanth Shetty, Chandra Sekhar Seelamantula

Main category: cs.LG

TL;DR: The paper proposes a theoretical framework linking GAN discriminators to Langevin-based sampling in diffusion models, showing improved generation quality with classifier guidance.

Details

Motivation: To understand and improve the role of GAN discriminators in diffusion models, unifying score-based training and IPM-GAN optimization.

Method: Theoretical analysis of GAN discriminator effects on Langevin sampling, introducing closed-form kernel-based guidance for diffusion models.

Result: Improved generation quality (measured by CLIP-FID and KID metrics) in DDIM and LDM settings across datasets.

Conclusion: The framework unifies score-based and GAN training, enhancing diffusion models with practical improvements in image generation.

Abstract: Diffusion models are a state-of-the-art generative modeling framework that transform noise to images via Langevin sampling, guided by the score, which is the gradient of the logarithm of the data distribution. Recent works have shown empirically that the generation quality can be improved when guided by classifier network, which is typically the discriminator trained in a generative adversarial network (GAN) setting. In this paper, we propose a theoretical framework to analyze the effect of the GAN discriminator on Langevin-based sampling, and show that the IPM-GAN optimization can be seen as one of smoothed score-matching, wherein the scores of the data and the generator distributions are convolved with the kernel function associated with the IPM. The proposed approach serves to unify score-based training and optimization of IPM-GANs. Based on these insights, we demonstrate that closed-form kernel-based discriminator guidance, results in improvements (in terms of CLIP-FID and KID metrics) when applied atop baseline diffusion models. We demonstrate these results on the denoising diffusion implicit model (DDIM) and latent diffusion model (LDM) settings on various standard datasets. We also show that the proposed approach can be combined with existing accelerated-diffusion techniques to improve latent-space image generation.

[379] Optimal and Near-Optimal Adaptive Vector Quantization

Ran Ben-Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, Shay Vargaftik

Main category: cs.LG

TL;DR: The paper introduces improved algorithms for Adaptive Vector Quantization (AVQ), reducing time and space complexity, enabling broader use in machine learning.

Details

Motivation: Optimal adaptive quantization is often deemed infeasible due to high runtime and memory demands, limiting its practical applications.

Method: The authors propose algorithms with asymptotically better time and space complexity for AVQ, including a near-optimal one for large inputs.

Result: Experiments demonstrate the feasibility of using AVQ more widely in machine learning due to improved efficiency.

Conclusion: The new algorithms make AVQ more practical, potentially expanding its use in machine learning applications.

Abstract: Quantization is a fundamental optimization for many machine-learning use cases, including compressing gradients, model weights and activations, and datasets. The most accurate form of quantization is \emph{adaptive}, where the error is minimized with respect to a given input, rather than optimizing for the worst case. However, optimal adaptive quantization methods are considered infeasible in terms of both their runtime and memory requirements. We revisit the Adaptive Vector Quantization (AVQ) problem and present algorithms that find optimal solutions with asymptotically improved time and space complexity. We also present an even faster near-optimal algorithm for large inputs. Our experiments show our algorithms may open the door to using AVQ more extensively in a variety of machine learning applications.

[380] Molecule Graph Networks with Many-body Equivariant Interactions

Zetian Mao, Chuan-Shen Hu, Jiawen Li, Chen Liang, Diptesh Das, Masato Sumita, Kelin Xia, Koji Tsuda

Main category: cs.LG

TL;DR: ENINet introduces many-body equivariant interactions to improve directional information in message passing, enhancing prediction accuracy for quantum chemical properties.

Details

Motivation: Addressing the loss of directional information when two-body bond vectors cancel each other in message passing.

Method: Develops Equivariant N-body Interaction Networks (ENINet) to incorporate l=1 equivariant many-body interactions.

Result: Improved prediction accuracy for scalar and tensorial quantum chemical properties.

Conclusion: Many-body equivariant interactions are essential for capturing directional symmetric information in molecular interactions.

Abstract: Message passing neural networks have demonstrated significant efficacy in predicting molecular interactions. Introducing equivariant vectorial representations augments expressivity by capturing geometric data symmetries, thereby improving model accuracy. However, two-body bond vectors in opposition may cancel each other out during message passing, leading to the loss of directional information on their shared node. In this study, we develop Equivariant N-body Interaction Networks (ENINet) that explicitly integrates l = 1 equivariant many-body interactions to enhance directional symmetric information in the message passing scheme. We provided a mathematical analysis demonstrating the necessity of incorporating many-body equivariant interactions and generalized the formulation to $N$-body interactions. Experiments indicate that integrating many-body equivariant representations enhances prediction accuracy across diverse scalar and tensorial quantum chemical properties.

[381] Deciphering interventional dynamical causality from non-intervention complex systems

Jifan Shi, Yang Li, Juan Zhao, Siyang Leng, Rui Bao, Kazuyuki Aihara, Luonan Chen, Wei Lin

Main category: cs.LG

TL;DR: The paper introduces Interventional Dynamical Causality (IntDC) and Interventional Embedding Entropy (IEE) to measure causality in non-intervention systems using observational data, outperforming traditional methods.

Details

Motivation: Causal studies in non-intervention systems are challenging; the paper aims to address this by proposing a new framework and computational criterion.

Method: Proposes IntDC and IEE, leveraging delay-embedding to measure causality without requiring interventions or dynamical models.

Result: IEE accurately identifies causal edges, handles confounding, and robustly quantifies causal strength, validated through numerical and real-world experiments.

Conclusion: IntDC and IEE provide an efficient, robust approach for causal analysis in non-intervention systems using observational data.

Abstract: Detecting and quantifying causality is a focal topic in the fields of science, engineering, and interdisciplinary studies. However, causal studies on non-intervention systems attract much attention but remain extremely challenging. Delay-embedding technique provides a promising approach. In this study, we propose a framework named Interventional Dynamical Causality (IntDC) in contrast to the traditional Constructive Dynamical Causality (ConDC). ConDC, including Granger causality, transfer entropy and convergence of cross-mapping, measures the causality by constructing a dynamical model without considering interventions. A computational criterion, Interventional Embedding Entropy (IEE), is proposed to measure causal strengths in an interventional manner. IEE is an intervened causal information flow but in the delay-embedding space. Further, the IEE theoretically and numerically enables the deciphering of IntDC solely from observational (non-interventional) time-series data, without requiring any knowledge of dynamical models or real interventions in the considered system. In particular, IEE can be applied to rank causal effects according to their importance and construct causal networks from data. We conducted numerical experiments to demonstrate that IEE can find causal edges accurately, eliminate effects of confounding, and quantify causal strength robustly over traditional indices. We also applied IEE to real-world tasks. IEE performed as an accurate and robust tool for causal analyses solely from the observational data. The IntDC framework and IEE algorithm provide an efficient approach to the study of causality from time series in diverse non-intervention complex systems.

[382] Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios

Patricia A. Apellániz, Ana Jiménez, Borja Arroyo Galende, Juan Parras, Santiago Zazo

Main category: cs.LG

TL;DR: A novel method integrates artificial inductive biases into DGMs to improve synthetic data quality in low-data scenarios, outperforming meta-learning with transfer learning techniques.

Details

Motivation: Addressing the challenge of generating high-quality synthetic tabular data when training data is scarce, particularly in domains like healthcare and finance.

Method: Proposes a framework using transfer learning and meta-learning to inject inductive biases into DGMs, evaluating four approaches: pre-training, model averaging, MAML, and DRS.

Result: Transfer learning methods outperform meta-learning, achieving up to 60% gains in Jensen-Shannon divergence, enhancing synthetic data quality.

Conclusion: The model-agnostic methodology effectively improves synthetic data generation in low-data regimes, with significant applications in data-sensitive fields.

Abstract: While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on the availability of substantial training data, often lacking in real-world scenarios. To overcome this limitation, we propose a novel methodology that explicitly integrates artificial inductive biases into the generative process to improve data quality in low-data regimes. Our framework leverages transfer learning and meta-learning techniques to construct and inject informative inductive biases into DGMs. We evaluate four approaches (pre-training, model averaging, Model-Agnostic Meta-Learning (MAML), and Domain Randomized Search (DRS)) and analyze their impact on the quality of the generated text. Experimental results show that incorporating inductive bias substantially improves performance, with transfer learning methods outperforming meta-learning, achieving up to 60% gains in Jensen-Shannon divergence. The methodology is model-agnostic and especially relevant in domains such as healthcare and finance, where high-quality synthetic data are essential, and data availability is often limited.

[383] Parallel Split Learning with Global Sampling

Mohammad Kohankhaki, Ahmad Ayad, Mahdi Barhoush, Anke Schmeink

Main category: cs.LG

TL;DR: A server-driven sampling strategy is introduced to address scalability and generalization issues in distributed deep learning by dynamically adjusting client-side batch sizes, improving model accuracy and efficiency.

Details

Motivation: Challenges in distributed deep learning include scalability and generalization due to large batch sizes and non-identically distributed client data.

Method: A server-driven sampling strategy dynamically adjusts client-side batch sizes to maintain a fixed global batch size, ensuring better reflection of overall data distribution.

Result: The method provides tighter deviation guarantees, improves model accuracy, training efficiency, and convergence stability.

Conclusion: The proposed solution offers a scalable and efficient approach for learning in resource-constrained environments.

Abstract: Distributed deep learning in resource-constrained environments faces scalability and generalization challenges due to large effective batch sizes and non-identically distributed client data. We introduce a server-driven sampling strategy that maintains a fixed global batch size by dynamically adjusting client-side batch sizes. This decouples the effective batch size from the number of participating devices and ensures that global batches better reflect the overall data distribution. Using standard concentration bounds, we establish tighter deviation guarantees compared to existing approaches. Empirical results on a benchmark dataset confirm that the proposed method improves model accuracy, training efficiency, and convergence stability, offering a scalable solution for learning at the network edge.

[384] Coarse Graining with Neural Operators for Simulating Chaotic Systems

Chuwei Wang, Julius Berner, Zongyi Li, Di Zhou, Jiayun Wang, Jane Bae, Anima Anandkumar

Main category: cs.LG

TL;DR: The paper proposes a physics-informed neural operator (PINO) to predict chaotic systems’ long-term behavior, overcoming limitations of traditional closure models and achieving significant speed and accuracy improvements.

Details

Motivation: Accurate long-term prediction of chaotic systems is vital for applications like climate modeling, but traditional methods are computationally expensive and suffer from fundamental limitations.

Method: The authors introduce an end-to-end learning approach using PINO, trained on coarse-grid data and fine-tuned with minimal fully-resolved simulations (FRS) and physics-based losses.

Result: PINO achieves a 330x speedup over FRS with ~10% error, outperforming closure models (60x slower, ~186% error).

Conclusion: PINO offers a scalable, accurate alternative to traditional closure models for chaotic system prediction.

Abstract: Accurately predicting the long-term behavior of chaotic systems is crucial for various applications such as climate modeling. However, achieving such predictions typically requires iterative computations over a dense spatiotemporal grid to account for the unstable nature of chaotic systems, which is expensive and impractical in many real-world situations. An alternative approach to such a full-resolved simulation is using a coarse grid and then correcting its errors through a \textit{closure model}, which approximates the overall information from fine scales not captured in the coarse-grid simulation. Recently, ML approaches have been used for closure modeling, but they typically require a large number of training samples from expensive fully-resolved simulations (FRS). In this work, we prove an even more fundamental limitation, i.e., the standard approach to learning closure models suffers from a large approximation error for generic problems, no matter how large the model is, and it stems from the non-uniqueness of the mapping. We propose an alternative end-to-end learning approach using a physics-informed neural operator (PINO) that overcomes this limitation by not using a closure model or a coarse-grid solver. We first train the PINO model on data from a coarse-grid solver and then fine-tune it with (a small amount of) FRS and physics-based losses on a fine grid. The discretization-free nature of neural operators means that they do not suffer from the restriction of a coarse grid that closure models face, and they can provably approximate the long-term statistics of chaotic systems. In our experiments, our PINO model achieves a 330x speedup compared to FRS with a relative error $\sim 10%$. In contrast, the closure model coupled with a coarse-grid solver is $60$x slower than PINO while having a much higher error $\sim186%$ when the closure model is trained on the same FRS dataset.

[385] Accumulator-Aware Post-Training Quantization for Large Language Models

Ian Colbert, Giuseppe Franco, Fabian Grob, Jinjie Zhang, Rayan Saab

Main category: cs.LG

TL;DR: AXE is a new accumulator-aware quantization framework for post-training quantization (PTQ), designed to avoid overflow while maintaining model performance.

Details

Motivation: The rising cost of additions in MAC units and the expense of QAT motivate the need for efficient PTQ solutions.

Method: AXE is implemented on top of GPFQ and OPTQ, supporting multi-stage accumulation for full datapath optimization.

Result: AXE maintains up to 98% of FP16 perplexity for Llama3 8B, outperforming naive methods by 15%.

Conclusion: AXE bridges the gap in PTQ by providing overflow avoidance and performance efficiency.

Abstract: When quantizing weights and activations to increasingly narrower representations, the cost of additions begins to dominate that of multiplications in multiply-accumulate (MAC) units. Recent studies show that reducing addition costs via low-precision accumulation improves throughput, power, and area across inference platforms, albeit with an increased risk of overflow. Accumulator-aware quantization research has so far only considered the quantization-aware training (QAT) paradigm, in which models are fine-tuned or trained from scratch with quantization in the loop. As models and datasets continue to grow in size, QAT techniques become increasingly more expensive, which has motivated the recent surge in post-training quantization (PTQ) research. To bridge this gap, we introduce AXE, the first accumulator-aware quantization framework explicitly designed to endow overflow avoidance guarantees to PTQ algorithms. We present theoretical motivation for AXE and demonstrate its flexibility by implementing it on top of two existing algorithms: GPFQ and OPTQ. We design AXE to support multi-stage accumulation, opening the door to full datapath optimization for the first time. We evaluate AXE using recent language generation models; when quantizing Llama3 8B for a 16-bit multi-stage accumulation datapath, AXE maintains up to 98% of the FP16 perplexity, surpassing naive bit width manipulation by up to 15%.

[386] On the Approximation of Stationary Processes using the ARMA Model

Anand Ganesh, Babhrubahan Bose, Anand Rajagopalan

Main category: cs.LG

TL;DR: The paper analyzes the approximation error between a true stationary process and an ARMA model using the $L^{\infty}$ norm, showing its validity and structural advantages over other norms.

Details

Motivation: To quantify and improve the approximation error between stationary processes and ARMA models, addressing limitations of existing norms like the cepstral norm and Wiener's $\ell^1$ condition.

Method: Uses transfer function representation and the $L^{\infty}$ norm to analyze stationary processes, focusing on a subspace that includes ARMA models and forms a Banach algebra.

Result: The $L^{\infty}$ norm controls the $\ell^2$ norm of Wold coefficients and generalizes invertibility better than Wiener’s condition. Explicit approximation bounds are derived for continuous transfer functions.

Conclusion: The $L^{\infty}$ norm provides a robust framework for analyzing ARMA models, improving structural properties and invertibility definitions, with practical bounds for approximation errors.

Abstract: We look at a problem related to Autoregressive Moving Average (ARMA) models, on quantifying the approximation error between a true stationary process $X_t$ and an ARMA model $Y_t$. We take the transfer function representation $x(L)$ of a stationary process $X_t$ and show that the $L^{\infty}$ norm of $x$ acts as a valid norm on $X_t$ that controls the $\ell^2$ norm of its Wold coefficients. We then show that a certain subspace of stationary processes, which includes ARMA models, forms a Banach algebra under the $L^{\infty}$ norm that respects the multiplicative structure of $H^{\infty}$ transfer functions and thus improves on the structural properties of the cepstral norm for ARMA models. The natural definition of invertibility in this algebra is consistent with the original definition of ARMA invertibility, and generalizes better to non-ARMA processes than Wiener’s $\ell^1$ condition. Finally, we calculate some explicit approximation bounds in the simpler context of continuous transfer functions, and critique some heuristic ideas on Pad'e approximations and parsimonious models.

[387] ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

Yarden As, Bhavya Sukhija, Lenart Treven, Carmelo Sferrazza, Stelian Coros, Andreas Krause

Main category: cs.LG

TL;DR: ActSafe is a model-based RL algorithm ensuring safe and efficient exploration by balancing optimism in learning with pessimism in safety constraints, achieving near-optimal policies safely.

Details

Motivation: Current RL agents require unsafe, extensive interactions, limiting real-world applicability. ActSafe addresses this by ensuring safety during learning.

Method: ActSafe uses a probabilistic model of the system, optimistically plans with epistemic uncertainty, and enforces pessimism for safety constraints.

Result: ActSafe guarantees safety during learning and achieves near-optimal policies in finite time, performing well in high-dimensional tasks like visual control.

Conclusion: ActSafe advances safe RL by combining safety guarantees with efficient exploration, demonstrating state-of-the-art performance in benchmarks.

Abstract: Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. In this work, we present ActSafe, a novel model-based RL algorithm for safe and efficient exploration. ActSafe learns a well-calibrated probabilistic model of the system and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics, while enforcing pessimism w.r.t. the safety constraints. Under regularity assumptions on the constraints and dynamics, we show that ActSafe guarantees safety during learning while also obtaining a near-optimal policy in finite time. In addition, we propose a practical variant of ActSafe that builds on latest model-based RL advancements and enables safe exploration even in high-dimensional settings such as visual control. We empirically show that ActSafe obtains state-of-the-art performance in difficult exploration tasks on standard safe deep RL benchmarks while ensuring safety during learning.

[388] Electricity Price Prediction Using Multi-Kernel Gaussian Process Regression Combined with Kernel-Based Support Vector Regression

Abhinav Das, Stephan Schlüter, Lorenz Schneider

Main category: cs.LG

TL;DR: A hybrid model combining Gaussian Process Regression (GPR) and Support Vector Regression (SVR) is proposed for predicting German electricity prices, outperforming benchmarks like LASSO and deep neural networks.

Details

Motivation: GPR struggles with out-of-sample predictions due to noise and outliers, while SVR handles non-linear processes better. Combining both aims to leverage their strengths.

Method: The model uses GPR with a tailored covariance function for stochastic patterns and SVR for robustness against outliers. Predictions are linearly combined with uniform weights.

Result: The hybrid model outperforms benchmarks (LASSO and deep neural networks) on historic German power price data.

Conclusion: The hybrid GPR-SVR approach effectively improves prediction accuracy for electricity prices by addressing the limitations of individual methods.

Abstract: This paper presents a new hybrid model for predicting German electricity prices. The algorithm is based on a combination of Gaussian Process Regression (GPR) and Support Vector Regression (SVR). Although GPR is a competent model for learning stochastic patterns within data and for interpolation, its performance for out-of-sample data is not very promising. By choosing a suitable data-dependent covariance function, we can enhance the performance of GPR for the German hourly power prices being tested. However, since the out-of-sample prediction is dependent on the training data, the prediction is vulnerable to noise and outliers. To overcome this issue, a separate prediction is calculated using SVR, which applies margin-based optimization. This method is advantageous when dealing with non-linear processes and outliers, since only certain necessary points (support vectors) in the training data are responsible for regression. The individual predictions are then linearly combined using uniform weights. When tested on historic German power prices, this approach outperforms the publicly available benchmarks, namely the LASSO estimated autoregressive regression model, deep neural network provided in the recent research by [1].

[389] MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Bhavya Sukhija, Stelian Coros, Andreas Krause, Pieter Abbeel, Carmelo Sferrazza

Main category: cs.LG

TL;DR: MaxInfoRL balances intrinsic and extrinsic rewards in RL by maximizing information gain, achieving sublinear regret and superior performance in hard exploration tasks.

Details

Motivation: Balancing task and intrinsic rewards in RL is challenging and often task-dependent. MaxInfoRL aims to address this by directing exploration towards informative transitions.

Method: MaxInfoRL maximizes intrinsic rewards like information gain about the task, combined with Boltzmann exploration to trade off value function maximization and entropy.

Result: The approach achieves sublinear regret in multi-armed bandits and superior performance in continuous state-action spaces and visual control tasks.

Conclusion: MaxInfoRL effectively balances exploration and exploitation, outperforming in complex RL scenarios.

Abstract: Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

[390] InfAlign: Inference-aware language model alignment

Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Theertha Suresh, Ahmad Beirami

Main category: cs.LG

TL;DR: The paper introduces InfAlign, a framework for inference-aware alignment in language models, addressing the sub-optimality of standard RLHF due to train/test mismatch. It proposes InfAlign-CTRL, a method involving reward calibration and transformation, improving inference-time win rates by 3-8%.

Details

Motivation: Standard RLHF is sub-optimal for modern inference-time decoding methods, necessitating an alignment framework that optimizes for inference-time performance.

Method: Proposes InfAlign-CTRL, which includes reward calibration and KL-regularized reward maximization with transformed rewards. Specific transformations are provided for best-of-N sampling and jailbreaking.

Result: InfAlign-CTRL improves inference-time win rates by 3-8% for best-of-N sampling and jailbreaking. Reward calibration also serves as a strong baseline for standard win rate optimization.

Conclusion: InfAlign offers a practical solution for aligning language models with inference-time methods, outperforming standard RLHF and providing a robust baseline for reward calibration.

Abstract: Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.

[391] Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao, Yuki Mitsufuji

Main category: cs.LG

TL;DR: Proposes a new PEFT method combining transform and residual adaptations to improve LoRA’s performance and parameter efficiency.

Details

Motivation: The approximation gap in LoRA limits ultra-parameter-efficiency and performance. The goal is to reduce this gap.

Method: Combines full-rank transform and residual adaptations, using tensor decompositions for efficiency.

Result: Outperforms LoRA and baselines in subject-driven and controllable generation tasks.

Conclusion: The method enhances LoRA’s effectiveness and parameter efficiency.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.

[392] Advancing Generative Artificial Intelligence and Large Language Models for Demand Side Management with Internet of Electric Vehicles

Hanwen Zhang, Ruichen Zhang, Wei Zhang, Dusit Niyato, Yonggang Wen, Chunyan Miao

Main category: cs.LG

TL;DR: The paper explores using LLMs for energy optimization in microgrids, proposing a retrieval-augmented generation solution for DSM, demonstrated via an EV charging case study.

Details

Motivation: To leverage LLMs for automating and enhancing energy management and DSM in microgrids, addressing challenges and unlocking new opportunities.

Method: Integration of LLMs with retrieval-augmented generation for problem formulation, code generation, and optimization customization, tested in an EV charging scenario.

Result: The proposed solution improves energy efficiency and user adaptability in EV charging scheduling, showcasing LLMs’ potential for DSM.

Conclusion: LLMs hold significant promise for transforming energy optimization and DSM, paving the way for intelligent solutions in microgrids.

Abstract: Generative artificial intelligence, particularly through large language models (LLMs), is poised to transform energy optimization and demand side management (DSM) within microgrids. This paper explores the integration of LLMs into energy management, emphasizing their roles in automating the optimization of DSM strategies with Internet of electric vehicles. We investigate challenges and solutions associated with DSM and explore the new opportunities presented by leveraging LLMs. Then, we propose an innovative solution that enhances LLMs with retrieval-augmented generation for automatic problem formulation, code generation, and customizing optimization. We present a case study to demonstrate the effectiveness of our proposed solution in charging scheduling and optimization for electric vehicles, highlighting our solution’s significant advancements in energy efficiency and user adaptability. This work underscores the potential of LLMs for energy optimization and fosters a new era of intelligent DSM solutions.

[393] Tailored Forecasting from Short Time Series via Meta-learning

Declan A. Norton, Edward Ott, Andrew Pomerance, Brian Hunt, Michelle Girvan

Main category: cs.LG

TL;DR: METAFORS is a meta-learning method that leverages related time-series data to enable accurate forecasting in systems with limited historical data.

Details

Motivation: Traditional machine learning models require large datasets for forecasting, which is challenging for systems with limited history. METAFORS addresses this by generalizing knowledge from related systems.

Method: METAFORS uses a library of models trained on longer time series from related systems to initialize and tailor a model for short time-series data. It employs reservoir computing and tests on simulated chaotic systems.

Result: METAFORS reliably predicts short-term dynamics and long-term statistics, even when test and related systems differ significantly.

Conclusion: METAFORS is effective for forecasting in data-limited scenarios without needing contextual labels, showcasing its robustness.

Abstract: Machine learning models can effectively forecast dynamical systems from time-series data, but they typically require large amounts of past data, making forecasting particularly challenging for systems with limited history. To overcome this, we introduce Meta-learning for Tailored Forecasting using Related Time Series (METAFORS), which generalizes knowledge across systems to enable forecasting in data-limited scenarios. By learning from a library of models trained on longer time series from potentially related systems, METAFORS builds and initializes a model tailored to short time-series data from the system of interest. Using a reservoir computing implementation and testing on simulated chaotic systems, we demonstrate that METAFORS can reliably predict both short-term dynamics and long-term statistics without requiring contextual labels. We see this even when test and related systems exhibit substantially different behaviors, highlighting METAFORS’ strengths in data-limited scenarios.

[394] Achieving Deep Continual Learning via Evolution

Aojun Lu, Junchao Ke, Chunhui Ding, Jiahao Fan, Jiancheng Lv, Yanan Sun

Main category: cs.LG

TL;DR: ECL introduces a collective evolution framework for continual learning, outperforming single-model methods by evolving diverse neural network populations.

Details

Motivation: Current deep neural networks struggle with continual learning (CL). Inspired by human collective learning, ECL aims to enhance CL by evolving a population of models.

Method: ECL maintains a diverse population of models, evolving optimal architectures for each incremental task. Each task trains a specialized expert model, stored for future use.

Result: ECL outperforms state-of-the-art CL methods, achieving stability through expert isolation and plasticity via task-specific architectures.

Conclusion: ECL shifts focus from individual adaptation to collective evolution, offering a novel approach for AI systems capable of continual learning.

Abstract: Deep neural networks, despite their remarkable success, remain fundamentally limited in their ability to perform Continual Learning (CL). While most current methods aim to enhance the capabilities of a single model, Inspired by the collective learning mechanisms of human populations, we introduce Evolving Continual Learning (ECL), a framework that maintains and evolves a diverse population of neural network models. ECL continually searches for an optimal architecture for each introduced incremental task. This tailored model is trained on the corresponding task and archived as a specialized expert, contributing to a growing collection of skills. This approach inherently resolves the core CL challenges: stability is achieved through the isolation of expert models, while plasticity is greatly enhanced by evolving unique, task-specific architectures. Experimental results demonstrate that ECL significantly outperforms state-of-the-art individual-level CL methods. By shifting the focus from individual adaptation to collective evolution, ECL presents a novel path toward AI systems capable of CL.

[395] Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Konstantina Bairaktari, Jiayun Wu, Zhiwei Steven Wu

Main category: cs.LG

TL;DR: Kandinsky conformal prediction extends conditional coverage guarantees beyond rigid group definitions, handling overlapping and fractional memberships for more flexible and equitable coverage.

Details

Motivation: Classical conformal prediction methods provide marginal coverage but may fail uniformly across subpopulations, leading to disparities. This work aims to expand conditional coverage guarantees.

Method: The framework flexibly handles overlapping and fractional group memberships on covariates and labels, unifying existing methods like Mondrian conformal prediction.

Result: The algorithm achieves minimax-optimal high-probability conditional coverage and is validated on real-world datasets.

Conclusion: Kandinsky conformal prediction offers a practical and flexible solution for equitable coverage guarantees across diverse subpopulations.

Abstract: Conformal prediction is a powerful distribution-free framework for constructing prediction sets with coverage guarantees. Classical methods, such as split conformal prediction, provide marginal coverage, ensuring that the prediction set contains the label of a random test point with a target probability. However, these guarantees may not hold uniformly across different subpopulations, leading to disparities in coverage. Prior work has explored coverage guarantees conditioned on events related to the covariates and label of the test point. We present Kandinsky conformal prediction, a framework that significantly expands the scope of conditional coverage guarantees. In contrast to Mondrian conformal prediction, which restricts its coverage guarantees to disjoint groups – reminiscent of the rigid, structured grids of Piet Mondrian’s art – our framework flexibly handles overlapping and fractional group memberships defined jointly on covariates and labels, reflecting the layered, intersecting forms in Wassily Kandinsky’s compositions. Our algorithm unifies and extends existing methods, encompassing covariate-based group conditional, class conditional, and Mondrian conformal prediction as special cases, while achieving a minimax-optimal high-probability conditional coverage bound. Finally, we demonstrate the practicality of our approach through empirical evaluation on real-world datasets.

[396] Decision by Supervised Learning with Deep Ensembles: A Practical Framework for Robust Portfolio Optimization

Juhyeong Kim, Sungyoon Choi, Youngbin Lee, Yejin Kim, Yongmin Choi, Yongjae Lee

Main category: cs.LG

TL;DR: DSL reframes portfolio optimization as a supervised learning problem, using cross-entropy loss and Deep Ensembles for stability, outperforming traditional and ML-based methods.

Details

Motivation: To create a robust and stable portfolio optimization framework using supervised learning, addressing limitations of traditional and existing ML methods.

Method: DSL trains models to predict optimal portfolio weights using cross-entropy loss and maximizes Sharpe/Sortino ratios. Deep Ensembles reduce variance.

Result: DSL outperforms traditional strategies and ML methods, with larger ensembles improving median returns and risk-adjusted performance.

Conclusion: DSL is a practical, superior framework for portfolio optimization, with Deep Ensembles enhancing stability and performance.

Abstract: We propose Decision by Supervised Learning (DSL), a practical framework for robust portfolio optimization. DSL reframes portfolio construction as a supervised learning problem: models are trained to predict optimal portfolio weights, using cross-entropy loss and portfolios constructed by maximizing the Sharpe or Sortino ratio. To further enhance stability and reliability, DSL employs Deep Ensemble methods, substantially reducing variance in portfolio allocations. Through comprehensive backtesting across diverse market universes and neural architectures, shows superior performance compared to both traditional strategies and leading machine learning-based methods, including Prediction-Focused Learning and End-to-End Learning. We show that increasing the ensemble size leads to higher median returns and more stable risk-adjusted performance. The code is available at https://github.com/DSLwDE/DSLwDE.

[397] How Can I Publish My LLM Benchmark Without Giving the True Answers Away?

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

Main category: cs.LG

TL;DR: Proposes a method to publish LLM benchmarks without full ground-truth disclosure, using randomized correct answers to detect data contamination.

Details

Motivation: Prevent benchmark contamination in LLMs by avoiding full disclosure of ground-truth answers while enabling open evaluation.

Method: Inject randomness by providing multiple logically correct answers, with only one as the benchmark solution, reducing Bayes accuracy.

Result: Accurately detects data contamination across various benchmarks, models, and training methods.

Conclusion: The approach effectively balances benchmark transparency with contamination prevention and detection.

Abstract: Publishing a large language model (LLM) benchmark on the Internet risks contaminating future LLMs: the benchmark may be unintentionally (or intentionally) used to train or select a model. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers. However, this strategy will require trust in a single organization and still permits test-set overfitting through repeated queries. To overcome this issue, we propose a way to publish benchmarks without completely disclosing the ground-truth answers to the questions, while still maintaining the ability to openly evaluate LLMs. Our main idea is to inject randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. This reduces the best possible accuracy, i.e., Bayes accuracy, of the benchmark. Not only is this helpful to keep us from disclosing the ground truth, but this approach also offers a test for detecting data contamination. In principle, even fully capable models should not surpass the Bayes accuracy. If a model surpasses this ceiling despite this expectation, this is a strong signal of data contamination. We present experimental evidence that our method can detect data contamination accurately on a wide range of benchmarks, models, and training methodologies.

[398] Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Yan Zhu, Jingyang Zhu, Ting Wang, Yuanming Shi, Chunxiao Jiang, Khaled Ben Letaief

Main category: cs.LG

TL;DR: A satellite-ground collaborative federated learning framework is proposed to address computational and communication challenges in fine-tuning large AI models on satellites, reducing training time by ~33%.

Details

Motivation: Privacy concerns and limited bandwidth hinder downloading large remote sensing models for ground fine-tuning, while traditional satellite FL lacks computational capacity for on-board fine-tuning.

Method: The framework decomposes and allocates model components between satellites and ground stations, using tailored communication strategies (parallel intra-orbit, topology-aware satellite-ground, latency-minimization inter-orbit) to optimize space transmission.

Result: Simulations show a ~33% reduction in training time.

Conclusion: The proposed framework effectively addresses computational and communication bottlenecks in satellite FL for large foundation models.

Abstract: Advancements in artificial intelligence (AI) and low-earth orbit (LEO) satellites have promoted the application of large remote sensing foundation models for various downstream tasks. However, direct downloading of these models for fine-tuning on the ground is impeded by privacy concerns and limited bandwidth. Satellite federated learning (FL) offers a solution by enabling model fine-tuning directly on-board satellites and aggregating model updates without data downloading. Nevertheless, for large foundation models, the computational capacity of satellites is insufficient to support effective on-board fine-tuning in traditional satellite FL frameworks. To address these challenges, we propose a satellite-ground collaborative federated fine-tuning framework. The key of the framework lies in how to reasonably decompose and allocate model components to alleviate insufficient on-board computation capabilities. During fine-tuning, satellites exchange intermediate results with ground stations or other satellites for forward propagation and back propagation, which brings communication challenges due to the special communication topology of space transmission networks, such as intermittent satellite-ground communication, short duration of satellite-ground communication windows, and unstable inter-orbit inter-satellite links (ISLs). To reduce transmission delays, we further introduce tailored communication strategies that integrate both communication and computing resources. Specifically, we propose a parallel intra-orbit communication strategy, a topology-aware satellite-ground communication strategy, and a latency-minimalization inter-orbit communication strategy to reduce space communication costs. Simulation results demonstrate significant reductions in training time with improvements of approximately 33%.

[399] SinBasis Networks: Matrix-Equivalent Feature Extraction for Wave-Like Optical Spectrograms

Yuzhou Zhu, Zheng Zhang, Ruyi Zhang, Liang Zhou

Main category: cs.LG

TL;DR: A unified framework reinterprets convolution and attention as linear transforms, using sinusoidal mappings to enhance sensitivity to periodic structures in wave-like images, improving accuracy and robustness.

Details

Motivation: Conventional feature extractors fail to capture harmonic structures in wave-like images, necessitating a physics-informed approach.

Method: Proposes Sin-Basis Networks by embedding sinusoidal transforms into CNN, ViT, and Capsule architectures, leveraging spectral priors.

Result: Demonstrates improved reconstruction accuracy, translational robustness, and cross-domain transfer on diverse datasets.

Conclusion: Sin-Basis Networks provide a lightweight, effective solution for deep learning in wave-form imaging.

Abstract: Wave-like images-from attosecond streaking spectrograms to optical spectra, audio mel-spectrograms and periodic video frames-encode critical harmonic structures that elude conventional feature extractors. We propose a unified, matrix-equivalent framework that reinterprets convolution and attention as linear transforms on flattened inputs, revealing filter weights as basis vectors spanning latent feature subspaces. To infuse spectral priors we apply elementwise $\sin(\cdot)$ mappings to each weight matrix. Embedding these transforms into CNN, ViT and Capsule architectures yields Sin-Basis Networks with heightened sensitivity to periodic motifs and built-in invariance to spatial shifts. Experiments on a diverse collection of wave-like image datasets-including 80,000 synthetic attosecond streaking spectrograms, thousands of Raman, photoluminescence and FTIR spectra, mel-spectrograms from AudioSet and cycle-pattern frames from Kinetics-demonstrate substantial gains in reconstruction accuracy, translational robustness and zero-shot cross-domain transfer. Theoretical analysis via matrix isomorphism and Mercer-kernel truncation quantifies how sinusoidal reparametrization enriches expressivity while preserving stability in data-scarce regimes. Sin-Basis Networks thus offer a lightweight, physics-informed approach to deep learning across all wave-form imaging modalities.

[400] Affect Models Have Weak Generalizability to Atypical Speech

Jaya Narain, Amrit Romana, Vikramjit Mitra, Colin Lea, Shirley Ren

Main category: cs.LG

TL;DR: The paper evaluates how speech atypicalities (intelligibility, monopitch, harshness) impact affect recognition models, showing significant biases (e.g., higher sad predictions for atypical speech). Fine-tuning on pseudo-labeled data improves robustness without harming typical speech performance.

Details

Motivation: To assess the impact of speech atypicalities on paralinguistic affect models and explore ways to improve their robustness for atypical speech.

Method: Evaluated public affect models on atypical speech datasets, comparing results to typical speech. Analyzed distributional trends, comparisons, and correlations. Fine-tuned models on pseudo-labeled atypical speech.

Result: Affect models are significantly biased by atypical speech (e.g., more sad predictions). Fine-tuning improved atypical speech performance without degrading typical speech results.

Conclusion: Broader training datasets and robust modeling approaches are needed for speech emotion models to handle voice and speech differences.

Abstract: Speech and voice conditions can alter the acoustic properties of speech, which could impact the performance of paralinguistic models for affect for people with atypical speech. We evaluate publicly available models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech, comparing results to datasets of typical speech. We investigate three dimensions of speech atypicality: intelligibility, which is related to pronounciation; monopitch, which is related to prosody, and harshness, which is related to voice quality. We look at (1) distributional trends of categorical affect predictions within the dataset, (2) distributional comparisons of categorical affect predictions to similar datasets of typical speech, and (3) correlation strengths between text and speech predictions for spontaneous speech for valence and arousal. We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities. For instance, the percentage of speech predicted as sad is significantly higher for all types and grades of atypical speech when compared to similar typical speech datasets. In a preliminary investigation on improving robustness for atypical speech, we find that fine-tuning models on pseudo-labeled atypical speech data improves performance on atypical speech without impacting performance on typical speech. Our results emphasize the need for broader training and evaluation datasets for speech emotion models, and for modeling approaches that are robust to voice and speech differences.

[401] Recursive Learning-Based Virtual Buffering for Analytical Global Placement

Andrew B. Kahng, Yiting Liu, Zhiang Wang

Main category: cs.LG

TL;DR: MLBuf-RePlAce is a learning-driven buffering-aware global placement framework that improves timing closure without degrading power, outperforming existing methods.

Details

Motivation: Addressing the challenges of traditional buffering approaches (computational expense) and machine learning methods (lack of ERC consideration and integration into design flows).

Method: Uses a recursive learning-based generative buffering approach to predict buffer types and locations, integrated into the OpenROAD infrastructure.

Result: Achieves significant improvements in total negative slack (TNS) (up to 56%) without power degradation in OpenROAD flow, and similar gains in commercial flows.

Conclusion: MLBuf-RePlAce effectively closes timing gaps in physical design flows while maintaining power efficiency.

Abstract: Due to the skewed scaling of interconnect versus cell delay in modern technology nodes, placement with buffer porosity (i.e., cell density) awareness is essential for timing closure in physical synthesis flows. However, existing approaches face two key challenges: (i) traditional van Ginneken-Lillis-style buffering approaches are computationally expensive during global placement; and (ii) machine learning-based approaches, such as BufFormer, lack a thorough consideration of Electrical Rule Check (ERC) violations and fail to “close the loop” back into the physical design flow. In this work, we propose MLBuf-RePlAce, the first open-source learning-driven virtual buffering-aware analytical global placement framework, built on top of the OpenROAD infrastructure. MLBuf-RePlAce adopts an efficient recursive learning-based generative buffering approach to predict buffer types and locations, addressing ERC violations during global placement. We compare MLBuf-RePlAce against the default virtual buffering-based timing-driven global placer in OpenROAD, using open-source testcases from the TILOS MacroPlacement and OpenROAD-flow-scripts repositories. Without degradation of post-route power, MLBuf-RePlAce achieves (maximum, average) improvements of (56%, 31%) in total negative slack (TNS) within the open-source OpenROAD flow. When evaluated by completion in a commercial flow, MLBuf-RePlAce achieves (maximum, average) improvements of (53%, 28%) in TNS with an average of 0.2% improvement in post-route power.

[402] Intersectional Divergence: Measuring Fairness in Regression

Joe Germino, Nuno Moniz, Nitesh V. Chawla

Main category: cs.LG

TL;DR: The paper introduces Intersectional Divergence (ID) to measure fairness in regression tasks, addressing gaps in existing work by considering multiple protected attributes and domain preferences. It also proposes IDLoss, a loss function with convergence guarantees, to improve fairness without compromising predictive performance.

Details

Motivation: Existing fairness research in machine learning focuses on classification, neglecting regression tasks. Current methods also overlook intersectionality (combinations of protected attributes) and domain-specific preferences.

Method: The authors propose Intersectional Divergence (ID) to measure fairness across multiple protected attributes and relevant target ranges. They also introduce IDLoss, an adapted loss function with convergence guarantees, for practical optimization.

Result: Experiments show ID provides unique insights into model fairness and behavior. IDLoss improves fairness for single and intersectional attributes while maintaining predictive performance.

Conclusion: The paper bridges gaps in fairness for regression tasks by introducing ID and IDLoss, offering a practical solution for intersectional fairness without sacrificing model accuracy.

Abstract: Fairness in machine learning research is commonly framed in the context of classification tasks, leaving critical gaps in regression. In this paper, we propose a novel approach to measure intersectional fairness in regression tasks, going beyond the focus on single protected attributes from existing work to consider combinations of all protected attributes. Furthermore, we contend that it is insufficient to measure the average error of groups without regard for imbalanced domain preferences. Accordingly, we propose Intersectional Divergence (ID) as the first fairness measure for regression tasks that 1) describes fair model behavior across multiple protected attributes and 2) differentiates the impact of predictions in target ranges most relevant to users. We extend our proposal demonstrating how ID can be adapted into a loss function, IDLoss, that satisfies convergence guarantees and has piecewise smooth properties that enable practical optimization. Through an extensive experimental evaluation, we demonstrate how ID allows unique insights into model behavior and fairness, and how incorporating IDLoss into optimization can considerably improve single-attribute and intersectional model fairness while maintaining a competitive balance in predictive performance.

[403] Hypergraph Neural Sheaf Diffusion: A Symmetric Simplicial Set Framework for Higher-Order Learning

Seongjin Choi, Gahee Kim, Yong-Geun Oh

Main category: cs.LG

TL;DR: The paper introduces symmetric simplicial lifting to address challenges in constructing sheaf Laplacians for hypergraphs, leading to the development of Hypergraph Neural Sheaf Diffusion (HNSD).

Details

Motivation: Hypergraphs lack intrinsic adjacency and orientation, making sheaf Laplacian construction difficult. The goal is to extend sheaf theory to hypergraphs while preserving structural details.

Method: Symmetric simplicial lifting encodes hyperedge subrelations as ordered tuples, defining adjacency via facet maps. HNSD uses normalized degree zero sheaf Laplacian over this structure.

Result: HNSD resolves orientation ambiguity and adjacency sparsity, performing competitively on benchmarks.

Conclusion: The framework successfully extends sheaf theory to hypergraphs, validated by consistency with graph-based theory and empirical results.

Abstract: The absence of intrinsic adjacency relations and orientation systems in hypergraphs creates fundamental challenges for constructing sheaf Laplacians of arbitrary degrees. We resolve these limitations through symmetric simplicial sets derived directly from hypergraphs, called symmetric simplicial lifting, which encode all possible oriented subrelations within each hyperedge as ordered tuples. This construction canonically defines adjacency via facet maps while inherently preserving hyperedge provenance. We establish that the normalized degree zero sheaf Laplacian on our symmetric simplicial lifting reduces exactly to the traditional graph normalized sheaf Laplacian when restricted to graphs, validating its mathematical consistency with prior graph-based sheaf theory. Furthermore, the induced structure preserves all structural information from the original hypergraph, ensuring that every multi-way relational detail is faithfully retained. Leveraging this framework, we introduce Hypergraph Neural Sheaf Diffusion (HNSD), the first principled extension of neural sheaf diffusion to hypergraphs. HNSD operates via normalized degree zero sheaf Laplacian over symmetric simplicial lifting, resolving orientation ambiguity and adjacency sparsity inherent to hypergraph learning. Experimental evaluations demonstrate HNSDs competitive performance across established benchmarks.

[404] A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

Daniel Beechey, Thomas M. S. Smith, Özgür Şimşek

Main category: cs.LG

TL;DR: SVERL uses Shapley values to explain reinforcement learning agents’ behavior, outcomes, and predictions, addressing trust and deployment challenges.

Details

Motivation: Reinforcement learning agents lack explainability, limiting their use in safety-critical settings.

Method: Develops a unified framework using Shapley values to derive feature influences for explaining behavior, outcomes, and predictions.

Result: SVERL provides interpretable, mathematically justified explanations, correcting prior issues.

Conclusion: SVERL offers comprehensive, intuitive explanations for reinforcement learning agents, enhancing trust and usability.

Abstract: Reinforcement learning agents can achieve super-human performance in complex decision-making tasks, but their behaviour is often difficult to understand and explain. This lack of explanation limits deployment, especially in safety-critical settings where understanding and trust are essential. We identify three core explanatory targets that together provide a comprehensive view of reinforcement learning agents: behaviour, outcomes, and predictions. We develop a unified theoretical framework for explaining these three elements of reinforcement learning agents through the influence of individual features that the agent observes in its environment. We derive feature influences by using Shapley values, which collectively and uniquely satisfy a set of well-motivated axioms for fair and consistent credit assignment. The proposed approach, Shapley Values for Explaining Reinforcement Learning (SVERL), provides a single theoretical framework to comprehensively and meaningfully explain reinforcement learning agents. It yields explanations with precise semantics that are not only interpretable but also mathematically justified, enabling us to identify and correct conceptual issues in prior explanations. Through illustrative examples, we show how SVERL produces useful, intuitive explanations of agent behaviour, outcomes, and predictions, which are not apparent from observing agent behaviour alone.

[405] Diffusion Beats Autoregressive in Data-Constrained Settings

Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak

Main category: cs.LG

TL;DR: Diffusion models outperform autoregressive (AR) models in data-constrained settings due to better data utilization and implicit augmentation.

Details

Motivation: To explore the advantages of diffusion-based language models over AR models, especially in scenarios with limited data but abundant compute.

Method: Systematically study masked diffusion models in data-constrained settings, analyzing their performance, scaling laws, and compute thresholds.

Result: Diffusion models achieve lower validation loss and superior downstream performance by leveraging repeated data and implicit augmentation.

Conclusion: Diffusion models are a compelling alternative to AR models when data is scarce, offering better scalability and performance.

Abstract: Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR’s fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.

[406] A ZeNN architecture to avoid the Gaussian trap

Luís Carvalho, João L. Costa, José Mourão, Gonçalo Oliveira

Main category: cs.LG

TL;DR: ZeNNs address MLP shortcomings by introducing non-learnable weights, scaling factors, and orthogonal activations, enabling pointwise convergence, non-Gaussianity, and high-frequency learning.

Details

Motivation: Overcome MLP limitations like non-parametric behavior, lack of pointwise convergence, loss of non-Gaussian attributes, and poor high-frequency learning.

Method: Introduce ZeNNs with non-learnable weights, scaling factors, and orthogonal activation functions inspired by harmonic analysis.

Result: ZeNNs achieve pointwise convergence, retain non-Gaussianity, perform feature learning, and excel in high-frequency tasks.

Conclusion: ZeNNs provide a robust alternative to MLPs, addressing their key weaknesses while enhancing performance in critical areas.

Abstract: We propose a new simple architecture, Zeta Neural Networks (ZeNNs), in order to overcome several shortcomings of standard multi-layer perceptrons (MLPs). Namely, in the large width limit, MLPs are non-parametric, they do not have a well-defined pointwise limit, they lose non-Gaussian attributes and become unable to perform feature learning; moreover, finite width MLPs perform poorly in learning high frequencies. The new ZeNN architecture is inspired by three simple principles from harmonic analysis: i) Enumerate the perceptons and introduce a non-learnable weight to enforce convergence; ii) Introduce a scaling (or frequency) factor; iii) Choose activation functions that lead to near orthogonal systems. We will show that these ideas allow us to fix the referred shortcomings of MLPs. In fact, in the infinite width limit, ZeNNs converge pointwise, they exhibit a rich asymptotic structure beyond Gaussianity, and perform feature learning. Moreover, when appropriate activation functions are chosen, (finite width) ZeNNs excel at learning high-frequency features of functions with low dimensional domains.

[407] Adapt before Continual Learning

Aojun Lu, Tao Feng, Hangjie Yuan, Chunhui Ding, Yanan Sun

Main category: cs.LG

TL;DR: The paper proposes ACL, a framework for Continual Learning (CL) that balances stability and plasticity by adapting pre-trained models (PTMs) before learning new tasks.

Details

Motivation: Existing CL methods struggle to balance stability (retaining knowledge) and plasticity (learning new knowledge), especially when data distributions diverge.

Method: ACL introduces a plug-and-play adaptation phase to refine PTMs by aligning embeddings with original class prototypes and distancing them from irrelevant classes.

Result: ACL achieves a better stability-plasticity trade-off, improving CL performance across benchmarks.

Conclusion: ACL effectively addresses the limitations of current PTM-based CL methods by enhancing adaptability without catastrophic forgetting.

Abstract: Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). Although pre-trained models (PTMs) have provided a strong foundation for CL, existing approaches face a fundamental challenge in balancing these two competing objectives. Current methods typically address stability by freezing the PTM backbone, which severely limits the model’s plasticity, particularly when incoming data distribution diverges largely from the pre-training data. Alternatively, sequentially fine-tuning the entire PTM can adapt to new knowledge but often leads to catastrophic forgetting, highlighting the critical stability-plasticity trade-off in PTM-based CL. To address this limitation, we propose Adapting PTMs before the core CL} process (ACL), a novel framework that introduces a plug-and-play adaptation phase prior to learning each new task. During this phase, ACL refines the PTM backbone by aligning embeddings with their original class prototypes while distancing them from irrelevant classes. This mechanism theoretically and empirically demonstrates desirable balance between stability and plasticity, significantly improving CL performance across benchmarks and integrated methods. Code is available at https://github.com/byyx666/ACL_code.

[408] GrokAlign: Geometric Characterisation and Acceleration of Grokking

Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

Main category: cs.LG

TL;DR: The paper explains grokking in deep networks via Jacobian alignment, introduces GrokAlign for faster grokking, and proposes centroid alignment for interpretable training dynamics.

Details

Motivation: Understanding and accelerating deep network training dynamics, particularly delayed generalization (grokking) and emergent robustness.

Method: Aligning network Jacobians with training data under a low-rank assumption, introducing Jacobian regularization (GrokAlign), and simplifying it with centroid alignment.

Result: GrokAlign induces grokking sooner than conventional methods like weight decay, and centroid alignment effectively tracks training stages.

Conclusion: Jacobian alignment and GrokAlign provide theoretical and practical advancements for optimizing deep networks, with centroid alignment offering interpretability.

Abstract: A key challenge for the machine learning community is to understand and accelerate the training dynamics of deep networks that lead to delayed generalisation and emergent robustness to input perturbations, also known as grokking. Prior work has associated phenomena like delayed generalisation with the transition of a deep network from a linear to a feature learning regime, and emergent robustness with changes to the network’s functional geometry, in particular the arrangement of the so-called linear regions in deep networks employing continuous piecewise affine nonlinearities. Here, we explain how grokking is realised in the Jacobian of a deep network and demonstrate that aligning a network’s Jacobians with the training data (in the sense of cosine similarity) ensures grokking under a low-rank Jacobian assumption. Our results provide a strong theoretical motivation for the use of Jacobian regularisation in optimizing deep networks – a method we introduce as GrokAlign – which we show empirically to induce grokking much sooner than more conventional regularizers like weight decay. Moreover, we introduce centroid alignment as a tractable and interpretable simplification of Jacobian alignment that effectively identifies and tracks the stages of deep network training dynamics. Accompanying webpage (https://thomaswalker1.github.io/blog/grokalign.html) and code (https://github.com/ThomasWalker1/grokalign).

[409] Two-dimensional Parallel Tempering for Constrained Optimization

Corentin Delacour, M Mahmudul Hasan Sajeeb, Joao P. Hespanha, Kerem Y. Camsari

Main category: cs.LG

TL;DR: A 2D parallel tempering algorithm (2D-PT) is introduced to improve mixing in constrained Ising problems, eliminating the need to tune penalty strengths and achieving faster convergence.

Details

Motivation: Practical implementations of Ising machines are hindered by soft constraints that either slow down mixing or fail to enforce feasibility, necessitating a better approach.

Method: The method extends parallel tempering (PT) by adding a second dimension of replicas that interpolate penalty strengths, ensuring constraint satisfaction in low-energy states.

Result: 2D-PT achieves near-ideal mixing with Kullback-Leibler divergence decaying as O(1/t) and provides orders of magnitude speedup over conventional PT in sparsified Wishart instances.

Conclusion: 2D-PT is broadly applicable to constrained Ising problems and can be deployed on existing Ising machines, offering significant performance improvements.

Abstract: Sampling Boltzmann probability distributions plays a key role in machine learning and optimization, motivating the design of hardware accelerators such as Ising machines. While the Ising model can in principle encode arbitrary optimization problems, practical implementations are often hindered by soft constraints that either slow down mixing when too strong, or fail to enforce feasibility when too weak. We introduce a two-dimensional extension of the powerful parallel tempering algorithm (PT) that addresses this challenge by adding a second dimension of replicas interpolating the penalty strengths. This scheme ensures constraint satisfaction in the final replicas, analogous to low-energy states at low temperature. The resulting two-dimensional parallel tempering algorithm (2D-PT) improves mixing in heavily constrained replicas and eliminates the need to explicitly tune the penalty strength. In a representative example of graph sparsification with copy constraints, 2D-PT achieves near-ideal mixing, with Kullback-Leibler divergence decaying as O(1/t). When applied to sparsified Wishart instances, 2D-PT yields orders of magnitude speedup over conventional PT with the same number of replicas. The method applies broadly to constrained Ising problems and can be deployed on existing Ising machines.

[410] Compositional Function Networks: A High-Performance Alternative to Deep Neural Networks with Built-in Interpretability

Fang Li

Main category: cs.LG

TL;DR: CFNs introduce interpretable models by composing mathematical functions, achieving competitive performance while maintaining transparency.

Details

Motivation: Address the black-box nature of DNNs in high-stakes domains requiring transparency.

Method: Compose elementary mathematical functions with clear semantics, supporting diverse compositional patterns (sequential, parallel, conditional) and enabling efficient training via gradient descent.

Result: CFNs achieve 96.24% accuracy on CIFAR-10, outperforming interpretable models like Explainable Boosting Machines.

Conclusion: CFNs combine deep learning’s expressiveness with interpretability, offering a powerful framework for performance-critical and accountable applications.

Abstract: Deep Neural Networks (DNNs) deliver impressive performance but their black-box nature limits deployment in high-stakes domains requiring transparency. We introduce Compositional Function Networks (CFNs), a novel framework that builds inherently interpretable models by composing elementary mathematical functions with clear semantics. Unlike existing interpretable approaches that are limited to simple additive structures, CFNs support diverse compositional patterns – sequential, parallel, and conditional – enabling complex feature interactions while maintaining transparency. A key innovation is that CFNs are fully differentiable, allowing efficient training through standard gradient descent. We demonstrate CFNs’ versatility across multiple domains, from symbolic regression to image classification with deep hierarchical networks. Our empirical evaluation shows CFNs achieve competitive performance against black-box models (96.24% accuracy on CIFAR-10) while outperforming state-of-the-art interpretable models like Explainable Boosting Machines. By combining the hierarchical expressiveness and efficient training of deep learning with the intrinsic interpretability of well-defined mathematical functions, CFNs offer a powerful framework for applications where both performance and accountability are paramount.

Jaehyun Nam, Jinsung Yoon, Jiefeng Chen, Jinwoo Shin, Sercan Ö. Arık, Tomas Pfister

Main category: cs.LG

TL;DR: MLE-STAR is a novel approach for building LLM-based MLE agents that combines external knowledge retrieval and iterative refinement of ML models, outperforming existing methods.

Details

Motivation: Existing LLM-based MLE agents rely too much on inherent LLM knowledge and coarse exploration, limiting their ability to select task-specific models and deeply explore components like feature engineering.

Method: MLE-STAR retrieves effective models from the web, forms an initial solution, and iteratively refines it by exploring strategies targeting specific ML components, guided by ablation studies. It also introduces a novel ensembling method.

Result: MLE-STAR achieves medals in 64% of Kaggle competitions on MLE-bench Lite, significantly outperforming alternatives.

Conclusion: MLE-STAR demonstrates superior performance by leveraging external knowledge and targeted exploration, making it a promising approach for MLE agents.

Abstract: Agents based on large language models (LLMs) for machine learning engineering (MLE) can automatically implement ML models via code generation. However, existing approaches to build such agents often rely heavily on inherent LLM knowledge and employ coarse exploration strategies that modify the entire code structure at once. This limits their ability to select effective task-specific models and perform deep exploration within specific components, such as experimenting extensively with feature engineering options. To overcome these, we propose MLE-STAR, a novel approach to build MLE agents. MLE-STAR first leverages external knowledge by using a search engine to retrieve effective models from the web, forming an initial solution, then iteratively refines it by exploring various strategies targeting specific ML components. This exploration is guided by ablation studies analyzing the impact of individual code blocks. Furthermore, we introduce a novel ensembling method using an effective strategy suggested by MLE-STAR. Our experimental results show that MLE-STAR achieves medals in 64% of the Kaggle competitions on the MLE-bench Lite, significantly outperforming the best alternative.

[412] Some Theoretical Results on Layerwise Effective Dimension Oscillations in Finite Width ReLU Networks

Darshan Makwana

Main category: cs.LG

TL;DR: The paper analyzes the layerwise effective dimension (rank) in finite-width ReLU networks, deriving closed-form expressions for expected rank and identifying oscillatory rank behavior as a finite-width phenomenon.

Details

Motivation: To understand how random ReLU layers alternately collapse and revive input subspaces, adding nuance to prior work on deep network expressivity.

Method: Derive closed-form expressions for the expected rank of hidden activation matrices using random Gaussian weights and analyze rank behavior under different conditions.

Result: The rank deficit decays geometrically, with revival depths showing local maxima. Oscillatory rank behavior is identified as finite-width specific.

Conclusion: The study provides a precise characterization of rank dynamics in ReLU networks, revealing finite-width effects and contrasting with full-rank behavior under orthogonal initialization or leaky-ReLU.

Abstract: We analyze the layerwise effective dimension (rank of the feature matrix) in fully-connected ReLU networks of finite width. Specifically, for a fixed batch of $m$ inputs and random Gaussian weights, we derive closed-form expressions for the expected rank of the $m\times n$ hidden activation matrices. Our main result shows that $\mathbb{E}[EDim(\ell)]=m[1-(1-2/\pi)^\ell]+O(e^{-c m})$ so that the rank deficit decays geometrically with ratio $1-2 / \pi \approx 0.3634$. We also prove a sub-Gaussian concentration bound, and identify the “revival” depths at which the expected rank attains local maxima. In particular, these peaks occur at depths $\ell_k^*\approx(k+1/2)\pi/\log(1/\rho)$ with height $\approx (1-e^{-\pi/2}) m \approx 0.79m$. We further show that this oscillatory rank behavior is a finite-width phenomenon: under orthogonal weight initialization or strong negative-slope leaky-ReLU, the rank remains (nearly) full. These results provide a precise characterization of how random ReLU layers alternately collapse and partially revive the subspace of input variations, adding nuance to prior work on expressivity of deep networks.

[413] MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse

Kaiwen Chen, Xin Tan, Minchen Yu, Hong Xu

Main category: cs.LG

TL;DR: MemShare reduces memory overhead in Large Reasoning Models by reusing similar KV cache blocks, improving throughput by 84.79% without accuracy loss.

Details

Motivation: LRMs generate redundant intermediate reasoning steps, causing memory inefficiency. MemShare addresses this by reusing similar KV cache states.

Method: Uses collaborative filtering to identify reusable KV cache blocks and enables zero-copy cache reuse.

Result: Achieves up to 84.79% throughput improvement while maintaining accuracy.

Conclusion: MemShare is an effective KV cache management method for LRMs, balancing memory efficiency and performance.

Abstract: Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks. However, their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference. We observe that LRMs frequently produce highly similar intermediate reasoning steps, which correspond to similar KV cache states across layers. Motivated by this observation, we propose MemShare, a novel KV cache management approach that effectively reduces memory overhead. MemShare employs a collaborative filtering algorithm to efficiently identify reusable KV cache blocks and enables zero copy cache reuse to significantly reduce memory overhead, improve throughput while maintaining accuracy. Experimental results demonstrate that MemShare delivers up to 84.79% improvement in throughput while maintaining better accuracy compared to existing KV cache management methods.

[414] TPP-SD: Accelerating Transformer Point Process Sampling with Speculative Decoding

Shukai Gong, Yiyang Fu, Fengyuan Ran, Quyu Kong, Feng Zhou

Main category: cs.LG

TL;DR: TPP-SD accelerates Transformer temporal point process sampling using speculative decoding, achieving 2-6× speedup while maintaining output distribution.

Details

Motivation: Bridging the gap between powerful Transformer TPP models and the need for rapid sequence sampling.

Method: Adapts speculative decoding from language models, using a draft model to generate candidate events verified by a target model in parallel.

Result: Produces identical distributions as standard methods with 2-6× speedup, validated on synthetic and real datasets.

Conclusion: TPP-SD efficiently accelerates sampling without compromising output quality, enhancing practical usability.

Abstract: We propose TPP-SD, a novel approach that accelerates Transformer temporal point process (TPP) sampling by adapting speculative decoding (SD) techniques from language models. By identifying the structural similarities between thinning algorithms for TPPs and speculative decoding for language models, we develop an efficient sampling framework that leverages a smaller draft model to generate multiple candidate events, which are then verified by the larger target model in parallel. TPP-SD maintains the same output distribution as autoregressive sampling while achieving significant acceleration. Experiments on both synthetic and real datasets demonstrate that our approach produces samples from identical distributions as standard methods, but with 2-6$\times$ speedup. Our ablation studies analyze the impact of hyperparameters such as draft length and draft model size on sampling efficiency. TPP-SD bridges the gap between powerful Transformer TPP models and the practical need for rapid sequence sampling.

[415] MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

Yaowei Jin, Junjie Wang, Wenkai Xiang, Duanhua Cao, Dan Teng, Zhehuan Fan, Jiacheng Xiong, Xia Sheng, Chuanlong Zeng, Duo An, Mingyue Zheng, Shuangjia Zheng, Qian Shi

Main category: cs.LG

TL;DR: Proposes Parameter Interpolation Flow (PIF) for molecular generation, outperforming Bayesian Flow Networks (BFNs) in drug design tasks.

Details

Motivation: BFNs' limitations in flexibility and adaptability for diverse data distributions and tasks, and unexplored potential of simpler parameter-space models.

Method: Introduces PIF with theoretical foundation, training, and inference procedures, and develops MolPIF for drug design.

Result: MolPIF shows superior performance across metrics compared to baselines.

Conclusion: Validates parameter-space-based generative modeling for molecules and offers new design perspectives.

Abstract: Advances in deep learning for molecular generation show promise in accelerating drug discovery. Bayesian Flow Networks (BFNs) have recently shown impressive performance across diverse chemical tasks, with their success often ascribed to the paradigm of modeling in a low-variance parameter space. However, the Bayesian inference-based strategy imposes limitations on designing more flexible distribution transformation pathways, making it challenging to adapt to diverse data distributions and varied task requirements. Furthermore, the potential for simpler, more efficient parameter-space-based models is unexplored. To address this, we propose a novel Parameter Interpolation Flow model (named PIF) with detailed theoretical foundation, training, and inference procedures. We then develop MolPIF for structure-based drug design, demonstrating its superior performance across diverse metrics compared to baselines. This work validates the effectiveness of parameter-space-based generative modeling paradigm for molecules and offers new perspectives for model design.

[416] Spatial-Temporal Reinforcement Learning for Network Routing with Non-Markovian Traffic

Molly Wang, Kin. K Leung

Main category: cs.LG

TL;DR: The paper introduces a spatial-temporal RL (STRL) framework for packet routing to address the limitations of traditional RL methods in handling non-Markovian internet traffic and network topology spatial structure.

Details

Motivation: Traditional RL methods for packet routing rely on the Markov assumption and lack spatial modeling, which doesn't align with real-world non-Markovian internet traffic and network topologies.

Method: The authors design a network environment with non-Markovian traffic and propose a spatial-temporal RL (STRL) framework to better model routing decisions.

Result: The STRL framework outperforms traditional baselines by over 19% during training and 7% during inference, even with changes in network topology.

Conclusion: The STRL framework effectively addresses the limitations of traditional RL methods, offering improved performance in packet routing for non-Markovian and spatially structured networks.

Abstract: Reinforcement Learning (RL) has been widely used for packet routing in communication networks, but traditional RL methods rely on the Markov assumption that the current state contains all necessary information for decision-making. In reality, internet traffic is non-Markovian, and past states do influence routing performance. Moreover, common deep RL approaches use function approximators, such as neural networks, that do not model the spatial structure in network topologies. To address these shortcomings, we design a network environment with non-Markovian traffic and introduce a spatial-temporal RL (STRL) framework for packet routing. Our approach outperforms traditional baselines by more than 19% during training and 7% for inference despite a change in network topology.

[417] A Comprehensive Review of Diffusion Models in Smart Agriculture: Progress, Applications, and Challenges

Xing Hu, Haodong Chen, Qianqian Duan, Danfeng Hong, Ruijiao Li, Huiliang Shang, Linghua Jiang, Haima Yang, Dawei Zhang

Main category: cs.LG

TL;DR: Diffusion models show promise in agriculture for tasks like pest detection and data augmentation, outperforming GANs in stability and quality. Challenges remain, but their role in smart agriculture is growing.

Details

Motivation: Addressing the scarcity of arable land and the need for efficient agricultural practices, AI and deep learning, especially diffusion models, are explored for their potential in smart agriculture.

Method: The paper reviews diffusion models’ applications in agriculture, focusing on pest detection, remote sensing, crop prediction, and resource management.

Result: Diffusion models improve accuracy and robustness in tasks like data augmentation and image generation, even in complex environments.

Conclusion: Despite computational and generalization challenges, diffusion models are poised to significantly impact smart and precision agriculture, aiding global sustainability.

Abstract: With the global population growing and arable land resources becoming increasingly scarce,smart agriculture and precision agriculture have emerged as key directions for the future ofagricultural development.Artificial intelligence (AI) technologies, particularly deep learning models, have found widespread applications in areas such as crop monitoring and pest detection. As an emerging generative model, diffusion models have shown significant promise in tasks like agricultural image processing, data augmentation, and remote sensing. Compared to traditional generative adversarial networks (GANs), diffusion models offer superior training stability and generation quality, effectively addressing challenges such as limited agricultural data and imbalanced image samples. This paper reviews the latest advancements in the application of diffusion models in agriculture, focusing on their potential in crop pest and disease detection, remote sensing image enhancement, crop growth prediction, and agricultural resource management. Experimental results demonstrate that diffusion models significantly improve model accuracy and robustness in data augmentation, image generation, and denoising, especially in complex environments. Despite challenges related to computational efficiency and generalization capabilities, diffusion models are expected to play an increasingly important role in smart and precision agriculture as technology advances, providing substantial support for the sustainable development of global agriculture.

[418] GCL-GCN: Graphormer and Contrastive Learning Enhanced Attributed Graph Clustering Network

Binxiong Li, Xu Xiang, Xue Li, Quanzhou Lou, Binyu Zhao, Yujie Liu, Huijie Tang, Benhan Yang

Main category: cs.LG

TL;DR: GCL-GCN is a novel deep graph clustering model that improves clustering quality by capturing local and global dependencies in sparse, heterogeneous graph data using a Graphormer module and contrastive learning.

Details

Motivation: Addressing the challenge of leveraging graph information for clustering due to data complexity and attribute heterogeneity.

Method: Proposes GCL-GCN with a Graphormer module for centrality encoding and spatial relationships, and a contrastive learning module for enhanced feature distinction.

Result: Outperforms 14 advanced methods, with significant improvements on the Cora dataset (ACC: +4.94%, NMI: +13.01%, ARI: +10.97%).

Conclusion: GCL-GCN effectively enhances clustering quality and robustness in attributed graph clustering.

Abstract: Attributed graph clustering holds significant importance in modern data analysis. However, due to the complexity of graph data and the heterogeneity of node attributes, leveraging graph information for clustering remains challenging. To address this, we propose a novel deep graph clustering model, GCL-GCN, specifically designed to address the limitations of existing models in capturing local dependencies and complex structures when dealing with sparse and heterogeneous graph data. GCL-GCN introduces an innovative Graphormer module that combines centrality encoding and spatial relationships, effectively capturing both global and local information between nodes, thereby enhancing the quality of node representations. Additionally, we propose a novel contrastive learning module that significantly enhances the discriminative power of feature representations. In the pre-training phase, this module increases feature distinction through contrastive learning on the original feature matrix, ensuring more identifiable initial representations for subsequent graph convolution and clustering tasks. Extensive experimental results on six datasets demonstrate that GCL-GCN outperforms 14 advanced methods in terms of clustering quality and robustness. Specifically, on the Cora dataset, it improves ACC, NMI, and ARI by 4.94%, 13.01%, and 10.97%, respectively, compared to the primary comparison method MBN.

[419] AdaptHetero: Machine Learning Interpretation-Driven Subgroup Adaptation for EHR-Based Clinical Prediction

Ling Liao, Eva Aagaard

Main category: cs.LG

TL;DR: AdaptHetero is an MLI-driven framework that improves subgroup-specific modeling in EHRs by integrating SHAP-based interpretation and unsupervised clustering, enhancing predictive performance.

Details

Motivation: The complexity and heterogeneity of EHR data limit the effectiveness of MLI in guiding subgroup-specific modeling, necessitating a tailored approach.

Method: The framework uses SHAP-based interpretation and unsupervised clustering to transform interpretability insights into actionable guidance for model training and evaluation across subpopulations.

Result: Evaluated on three EHR datasets, AdaptHetero identifies heterogeneous model behaviors and improves predictive performance for ICU mortality, in-hospital death, and hidden hypoxemia.

Conclusion: AdaptHetero enhances the identification of clinically meaningful subgroup-specific characteristics, optimizing clinical deployment.

Abstract: Machine learning interpretation (MLI) has primarily been leveraged to build clinician trust and uncover actionable insights in EHRs. However, the intrinsic complexity and heterogeneity of EHR data limit its effectiveness in guiding subgroup-specific modeling. We propose AdaptHetero, a novel MLI-driven framework that transforms interpretability insights into actionable guidance for tailoring model training and evaluation across subpopulations within individual hospital systems. Evaluated on three large-scale EHR datasets: GOSSIS-1-eICU, WiDS, and MIMIC-IV, AdaptHetero consistently identifies heterogeneous model behaviors in predicting ICU mortality, in-hospital death, and hidden hypoxemia. By integrating SHAP-based interpretation and unsupervised clustering, the framework enhances the identification of clinically meaningful subgroup-specific characteristics, leading to improved predictive performance and optimized clinical deployment.

[420] CS-SHRED: Enhancing SHRED for Robust Recovery of Spatiotemporal Dynamics

Romulo B. da Silva, Diego Passos, Cássio M. Oishi, J. Nathan Kutz

Main category: cs.LG

TL;DR: CS-SHRED integrates Compressed Sensing with a Shallow Recurrent Decoder to reconstruct spatiotemporal dynamics from incomplete or noisy data, outperforming traditional methods in fidelity and robustness.

Details

Motivation: To address challenges in reconstructing spatiotemporal data from sparse, noisy, or incomplete sensor measurements.

Method: Combines CS techniques with SHRED, using an adaptive loss function (MSE, MAE, SNR regularization) and LSTM for temporal modeling.

Result: Higher reconstruction fidelity (improved SSIM, PSNR, LPIPS) and robustness in various applications like fluid flows and climate data.

Conclusion: CS-SHRED is effective for spatiotemporal data recovery, with broad applications in environmental and scientific analyses.

Abstract: We present CS-SHRED, a novel deep learning architecture that integrates Compressed Sensing (CS) into a Shallow Recurrent Decoder (SHRED) to reconstruct spatiotemporal dynamics from incomplete, compressed, or corrupted data. Our approach introduces two key innovations. First, by incorporating CS techniques into the SHRED architecture, our method leverages a batch-based forward framework with $\ell_1$ regularization to robustly recover signals even in scenarios with sparse sensor placements, noisy measurements, and incomplete sensor acquisitions. Second, an adaptive loss function dynamically combines Mean Squared Error (MSE) and Mean Absolute Error (MAE) terms with a piecewise Signal-to-Noise Ratio (SNR) regularization, which suppresses noise and outliers in low-SNR regions while preserving fine-scale features in high-SNR regions. We validate CS-SHRED on challenging problems including viscoelastic fluid flows, maximum specific humidity fields, sea surface temperature distributions, and rotating turbulent flows. Compared to the traditional SHRED approach, CS-SHRED achieves significantly higher reconstruction fidelity – as demonstrated by improved SSIM and PSNR values, lower normalized errors, and enhanced LPIPS scores-thereby providing superior preservation of small-scale structures and increased robustness against noise and outliers. Our results underscore the advantages of the jointly trained CS and SHRED design architecture which includes an LSTM sequence model for characterizing the temporal evolution with a shallow decoder network (SDN) for modeling the high-dimensional state space. The SNR-guided adaptive loss function for the spatiotemporal data recovery establishes CS-SHRED as a promising tool for a wide range of applications in environmental, climatic, and scientific data analyses.

[421] H2Tune: Federated Foundation Model Fine-Tuning with Hybrid Heterogeneity

Wei Guo, Siyuan Lu, Yiqi Tong, Zhaojun Hu, Fuzhen Zhuang, Xiao Zhang, Tao Fan, Jin Dong

Main category: cs.LG

TL;DR: HHFFT addresses hybrid heterogeneity in federated fine-tuning with H2Tune, improving accuracy by 15.4%.

Details

Motivation: Existing FFT methods lack solutions for double heterogeneity in model architectures and downstream tasks.

Method: H2Tune uses sparsified triple matrix decomposition, relation-guided alignment, and task-knowledge disentanglement.

Result: Achieves 15.4% accuracy improvement over baselines with proven O(1/√T) convergence.

Conclusion: H2Tune effectively handles hybrid heterogeneity in federated fine-tuning, outperforming existing methods.

Abstract: Different from existing federated fine-tuning (FFT) methods for foundation models, hybrid heterogeneous federated fine-tuning (HHFFT) is an under-explored scenario where clients exhibit double heterogeneity in model architectures and downstream tasks. This hybrid heterogeneity introduces two significant challenges: 1) heterogeneous matrix aggregation, where clients adopt different large-scale foundation models based on their task requirements and resource limitations, leading to dimensional mismatches during LoRA parameter aggregation; and 2) multi-task knowledge interference, where local shared parameters, trained with both task-shared and task-specific knowledge, cannot ensure only task-shared knowledge is transferred between clients. To address these challenges, we propose H2Tune, a federated foundation model fine-tuning with hybrid heterogeneity. Our framework H2Tune consists of three key components: (i) sparsified triple matrix decomposition to align hidden dimensions across clients through constructing rank-consistent middle matrices, with adaptive sparsification based on client resources; (ii) relation-guided matrix layer alignment to handle heterogeneous layer structures and representation capabilities; and (iii) alternating task-knowledge disentanglement mechanism to decouple shared and specific knowledge of local model parameters through alternating optimization. Theoretical analysis proves a convergence rate of O(1/\sqrt{T}). Extensive experiments show our method achieves up to 15.4% accuracy improvement compared to state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/H2Tune-1407.

[422] G-Core: A Simple, Scalable and Balanced RLHF Trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, Ting Yao

Main category: cs.LG

TL;DR: G-Core is a scalable RLHF training framework addressing challenges like controller bottlenecks and dynamic workloads, improving efficiency and utilization.

Details

Motivation: Existing RLHF systems struggle with scalability and adaptability in multi-modal workflows, especially under dynamic conditions.

Method: G-Core introduces a parallel controller model and dynamic resource placement to optimize RLHF workflows.

Result: G-Core successfully trained models for WeChat, proving its robustness and efficiency in real-world applications.

Conclusion: G-Core advances RLHF training, offering a scalable solution for large-scale, human-aligned models.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models (LLMs) and diffusion models. While existing RLHF training systems have enabled significant progress, they often face challenges in scaling to multi-modal and diffusion workflows and adapting to dynamic workloads. In particular, current approaches may encounter limitations in controller scalability, flexible resource placement, and efficient orchestration when handling complex RLHF pipelines, especially in scenarios involving dynamic sampling or generative reward modeling. In this paper, we present \textbf{G-Core}, a simple, scalable, and balanced RLHF training framework designed to address these challenges. G-Core introduces a parallel controller programming model, enabling flexible and efficient orchestration of complex RLHF workflows without the bottlenecks of a single centralized controller. Furthermore, we propose a dynamic placement schema that adaptively partitions resources and schedules workloads, significantly reducing hardware idle time and improving utilization, even under highly variable training conditions. G-Core has successfully trained models that support WeChat product features serving a large-scale user base, demonstrating its effectiveness and robustness in real-world scenarios. Our results show that G-Core advances the state of the art in RLHF training, providing a solid foundation for future research and deployment of large-scale, human-aligned models.

cs.MA

[423] Causal-Inspired Multi-Agent Decision-Making via Graph Reinforcement Learning

Jing Wang, Yan Jin, Fei Ding, Chongfeng Wei

Main category: cs.MA

TL;DR: The paper integrates causal learning with reinforcement learning to improve autonomous vehicle decision-making in multi-agent environments, achieving superior performance in metrics like collision rate and cumulative reward.

Details

Motivation: Existing autonomous driving research struggles with seamless multi-vehicle interactions, prompting the need for better decision-making methods in complex traffic scenarios.

Method: The study combines causal disentanglement representation learning (CDRL) with graph neural network-based reinforcement learning to identify and use causal features for decision-making.

Result: The proposed method achieves the highest average reward during training and outperforms other methods in collision rate and cumulative reward during testing.

Conclusion: The approach advances multi-agent autonomous driving systems, making navigation safer and more efficient in complex traffic environments.

Abstract: Since the advent of autonomous driving technology, it has experienced remarkable progress over the last decade. However, most existing research still struggles to address the challenges posed by environments where multiple vehicles have to interact seamlessly. This study aims to integrate causal learning with reinforcement learning-based methods by leveraging causal disentanglement representation learning (CDRL) to identify and extract causal features that influence optimal decision-making in autonomous vehicles. These features are then incorporated into graph neural network-based reinforcement learning algorithms to enhance decision-making in complex traffic scenarios. By using causal features as inputs, the proposed approach enables the optimization of vehicle behavior at an unsignalized intersection. Experimental results demonstrate that our proposed method achieves the highest average reward during training and our approach significantly outperforms other learning-based methods in several key metrics such as collision rate and average cumulative reward during testing. This study provides a promising direction for advancing multi-agent autonomous driving systems and make autonomous vehicles’ navigation safer and more efficient in complex traffic environments.

[424] Barriers to Healthcare: Agent-Based Modeling to Mitigate Inequity

Alba Aguilera, Georgina Curto, Nardine Osman

Main category: cs.MA

TL;DR: Agent-based simulations using the capability approach (CA) assess social policies for human well-being, applied to health inequity among Barcelona’s homeless population.

Details

Motivation: To evaluate social policies non-invasively and address human development challenges by modeling agent behavior and inequity criteria.

Method: Integrate CA framework into reinforcement learning for policy simulation, collaborating with stakeholders for a case study on homelessness.

Result: First proof-of-concept simulation aligned with CA to assess policy impacts under parliamentary discussion.

Conclusion: Demonstrates potential of agent-based simulations guided by CA for policy evaluation in real-world contexts.

Abstract: Agent-based simulations have an enormous potential as tools to evaluate social policies in a non-invasive way, before these are implemented to real-world populations. However, the recommendations that these computational approaches may offer to tackle urgent human development challenges can vary substantially depending on how we model agents’ (people) behaviour and the criteria that we use to measure inequity. In this paper, we integrate the conceptual framework of the capability approach (CA), which is explicitly designed to promote and assess human well-being, to guide the simulation and evaluate the effectiveness of policies. We define a reinforcement learning environment where agents behave to restore their capabilities under the constraints of a specific policy. Working in collaboration with local stakeholders, non-profits and domain experts, we apply our model in a case study to mitigate health inequity among the population experiencing homelessness (PEH) in Barcelona. By doing so, we present the first proof of concept simulation, aligned with the CA for human development, to assess the impact of policies under parliamentary discussion.

[425] A survey of multi-agent geosimulation methodologies: from ABM to LLM

Virginia Padilla, Jacinto Dávila

Main category: cs.MA

TL;DR: The paper examines agent-based approaches for multi-agent systems, proposing a framework for geosimulation platforms and showing LLMs can integrate effectively as agents.

Details

Motivation: To formalize principles and linkages in multi-agent systems, simulations, and information systems for geosimulation.

Method: Comprehensive examination and framework development based on two decades of study.

Result: LLMs can be effectively integrated as agent components in geosimulation if they follow a structured architecture.

Conclusion: The proposed framework provides a solid foundation for next-generation geosimulation systems.

Abstract: We provide a comprehensive examination of agent-based approaches that codify the principles and linkages underlying multi-agent systems, simulations, and information systems. Based on two decades of study, this paper confirms a framework intended as a formal specification for geosimulation platforms. Our findings show that large language models (LLMs) can be effectively incorporated as agent components if they follow a structured architecture specific to fundamental agent activities such as perception, memory, planning, and action. This integration is precisely consistent with the architecture that we formalize, providing a solid platform for next-generation geosimulation systems.

cs.MM

[426] Hybrid CNN-Mamba Enhancement Network for Robust Multimodal Sentiment Analysis

Xiang Li, Xianfu Cheng, Xiaoming Zhang, Zhoujun Li

Main category: cs.MM

TL;DR: Proposes HCMEN, a hybrid CNN-Mamba framework for robust multimodal sentiment analysis under missing modalities, outperforming existing methods.

Details

Motivation: Address challenges in aligning and fusing multimodal information effectively under missing modality conditions.

Method: Uses hierarchical unimodal modeling, cross-modal enhancement, and multimodal mix-up fusion with CNN for local details and Mamba for global context.

Result: Outperforms state-of-the-art methods on benchmark datasets in various missing modality scenarios.

Conclusion: HCMEN is effective for robust multimodal sentiment analysis with missing data, with plans to release the code publicly.

Abstract: Multimodal Sentiment Analysis (MSA) with missing modalities has recently attracted increasing attention. Although existing research mainly focuses on designing complex model architectures to handle incomplete data, it still faces significant challenges in effectively aligning and fusing multimodal information. In this paper, we propose a novel framework called the Hybrid CNN-Mamba Enhancement Network (HCMEN) for robust multimodal sentiment analysis under missing modality conditions. HCMEN is designed around three key components: (1) hierarchical unimodal modeling, (2) cross-modal enhancement and alignment, and (3) multimodal mix-up fusion. First, HCMEN integrates the strengths of Convolutional Neural Network (CNN) for capturing local details and the Mamba architecture for modeling global contextual dependencies across different modalities. Furthermore, grounded in the principle of Mutual Information Maximization, we introduce a cross-modal enhancement mechanism that generates proxy modalities from mixed token-level representations and learns fine-grained token-level correspondences between modalities. The enhanced unimodal features are then fused and passed through the CNN-Mamba backbone, enabling local-to-global cross-modal interaction and comprehensive multimodal integration. Extensive experiments on two benchmark MSA datasets demonstrate that HCMEN consistently outperforms existing state-of-the-art methods, achieving superior performance across various missing modality scenarios. The code will be released publicly in the near future.

Jie Qin, Wei Yang, Yan Su, Yiran Zhu, Weizhen Li, Yunyue Pan, Chengchang Pan, Honggang Qi

Main category: cs.MM

TL;DR: A bimodal prediction framework improves HER2 assessment in breast cancer by flexibly supporting single- or dual-modality inputs, enhancing accuracy and accessibility.

Details

Motivation: Clinical constraints and cost hinder acquiring both H&E and IHC images for HER2 assessment, necessitating a flexible solution.

Method: Proposes an adaptive framework with a dynamic branch selector and cross-modal GAN (CM-GAN) for modality completion or joint inference.

Result: Improves H&E-only accuracy to 94.25%, achieves 95.09% with dual-modality, and maintains 90.28% reliability for single-modality.

Conclusion: The framework offers a cost-effective, accessible solution for HER2 assessment, especially in resource-limited regions.

Abstract: In breast cancer HER2 assessment, clinical evaluation relies on combined H&E and IHC images, yet acquiring both modalities is often hindered by clinical constraints and cost. We propose an adaptive bimodal prediction framework that flexibly supports single- or dual-modality inputs through two core innovations: a dynamic branch selector activating modality completion or joint inference based on input availability, and a cross-modal GAN (CM-GAN) enabling feature-space reconstruction of missing modalities. This design dramatically improves H&E-only accuracy from 71.44% to 94.25%, achieves 95.09% with full dual-modality inputs, and maintains 90.28% reliability under single-modality conditions. The “dual-modality preferred, single-modality compatible” architecture delivers near-dual-modality accuracy without mandatory synchronized acquisition, offering a cost-effective solution for resource-limited regions and significantly improving HER2 assessment accessibility.

eess.AS

[428] Exploring Dynamic Parameters for Vietnamese Gender-Independent ASR

Sotheara Leang, Éric Castelli, Dominique Vaufreydaz, Sethserey Sam

Main category: eess.AS

TL;DR: The paper introduces polar parameters derived from Spectral Subband Centroid Frequencies (SSCFs) to capture speech dynamics, combined with MFCCs for Vietnamese ASR, reducing word error rates and improving gender independence.

Details

Motivation: To enhance ASR by capturing dynamic speech characteristics and minimizing spectral variation, especially for tonal languages like Vietnamese.

Method: Characterized acoustic transitions in SSCFs using polar parameters, combined with MFCCs, and used SSCF0 as a pseudo-feature for tonal information.

Result: Significant reduction in word error rates and improved gender independence compared to baseline MFCCs.

Conclusion: The proposed parameters effectively enhance ASR performance, particularly for tonal languages, by leveraging dynamic speech characteristics.

Abstract: The dynamic characteristics of speech signal provides temporal information and play an important role in enhancing Automatic Speech Recognition (ASR). In this work, we characterized the acoustic transitions in a ratio plane of Spectral Subband Centroid Frequencies (SSCFs) using polar parameters to capture the dynamic characteristics of the speech and minimize spectral variation. These dynamic parameters were combined with Mel-Frequency Cepstral Coefficients (MFCCs) in Vietnamese ASR to capture more detailed spectral information. The SSCF0 was used as a pseudo-feature for the fundamental frequency (F0) to describe the tonal information robustly. The findings showed that the proposed parameters significantly reduce word error rates and exhibit greater gender independence than the baseline MFCCs.

[429] Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Hung-yi Lee

Main category: eess.AS

TL;DR: Full-Duplex-Bench v1.5 is introduced as a modular benchmark for evaluating full-duplex speech agents, simulating four overlap scenarios and offering extensible metrics.

Details

Motivation: Overlap management in full-duplex speech agents is under-evaluated, necessitating a comprehensive benchmark.

Method: The benchmark simulates four overlap scenarios (interruption, backchannel, side conversation, ambient speech) and evaluates agents using metrics like latency, prosody, and speech quality.

Result: Benchmarking reveals two strategies: repair-first rapid yielding and continuity-first sustained flow, with scenario-dependent performance trends.

Conclusion: The open-sourced design allows customization, aiding practitioners in evaluating robust full-duplex speech systems.

Abstract: While full-duplex speech agents promise natural, low-latency human–machine interaction by concurrently processing input and output speech, overlap management remains under-evaluated. We introduce Full-Duplex-Bench v1.5, a modular, fully automated benchmark that simulates four overlap scenarios: user interruption, listener backchannel, side conversation, and ambient speech. Our framework supports both open-sourced and commercial models, offering a comprehensive, extensible metric suite – categorical dialogue behaviors, stop and response latency, prosodic adaptation, and perceived speech quality – that can be tailored to application-specific criteria. Benchmarking five state-of-the-art agents reveals two principal strategies: repair-first rapid yielding versus continuity-first sustained flow, and highlights scenario-dependent performance trends. The open-sourced design enables seamless extension with new audio assets, languages, and deployment contexts, empowering practitioners to customize and accelerate the evaluation of robust full-duplex speech systems.

[430] Feature Importance across Domains for Improving Non-Intrusive Speech Intelligibility Prediction in Hearing Aids

Ryandhimas E. Zezario, Sabato M. Siniscalchi, Fei Chen, Hsin-Min Wang, Yu Tsao

Main category: eess.AS

TL;DR: The paper introduces Feature Importance across Domains (FiDo) to improve non-intrusive speech intelligibility assessment in hearing aids, reducing RMSE by 7.62%.

Details

Motivation: Enhance performance of non-intrusive speech intelligibility assessment in hearing aids by focusing on important acoustic features.

Method: Estimates feature importance on spectral and time-domain features, projects features into new spaces, and concatenates them before assessment.

Result: RMSE reduced by 7.62% (26.10 to 24.11) and outperforms the 2023 Clarity Prediction Challenge’s best system by 3.98%.

Conclusion: FiDo effectively enhances neural speech assessment in hearing aids.

Abstract: Given the critical role of non-intrusive speech intelligibility assessment in hearing aids (HA), this paper enhances its performance by introducing Feature Importance across Domains (FiDo). We estimate feature importance on spectral and time-domain acoustic features as well as latent representations of Whisper. Importance weights are calculated per frame, and based on these weights, features are projected into new spaces, allowing the model to focus on important areas early. Next, feature concatenation is performed to combine the features before the assessment module processes them. Experimental results show that when FiDo is incorporated into the improved multi-branched speech intelligibility model MBI-Net+, RMSE can be reduced by 7.62% (from 26.10 to 24.11). MBI-Net+ with FiDo also achieves a relative RMSE reduction of 3.98% compared to the best system in the 2023 Clarity Prediction Challenge. These results validate FiDo’s effectiveness in enhancing neural speech assessment in HA.

[431] CUHK-EE Systems for the vTAD Challenge at NCMMSC 2025

Aemon Yat Fei Chiu, Jingyu Li, Yusheng Tian, Guangyan Zhang, Tan Lee

Main category: eess.AS

TL;DR: The paper introduces vTAD systems using WavLM-Large embeddings and Diff-Net variants (FFN and SE-ResFFN) for timbre attribute detection, showing trade-offs between model complexity and generalization.

Details

Motivation: To develop robust systems for detecting voice timbre attributes, addressing challenges like speaker identity, annotation subjectivity, and data imbalance.

Method: Uses WavLM-Large embeddings with attentive statistical pooling and two Diff-Net variants (FFN and SE-ResFFN) to compare timbre attributes between utterance pairs.

Result: WavLM-Large+FFN achieves 77.96% accuracy (21.79% EER) for unseen speakers, while WavLM-Large+SE-ResFFN scores 94.42% accuracy (5.49% EER) for seen speakers.

Conclusion: Architectural choices impact fine-grained speaker modeling, with trade-offs between complexity and generalization. Future work should focus on robustness and fairness.

Abstract: This paper presents the Voice Timbre Attribute Detection (vTAD) systems developed by the Digital Signal Processing & Speech Technology Laboratory (DSP&STL) of the Department of Electronic Engineering (EE) at The Chinese University of Hong Kong (CUHK) for the 20th National Conference on Human-Computer Speech Communication (NCMMSC 2025) vTAD Challenge. The proposed systems leverage WavLM-Large embeddings with attentive statistical pooling to extract robust speaker representations, followed by two variants of Diff-Net, i.e., Feed-Forward Neural Network (FFN) and Squeeze-and-Excitation-enhanced Residual FFN (SE-ResFFN), to compare timbre attribute intensities between utterance pairs. Experimental results demonstrate that the WavLM-Large+FFN system generalises better to unseen speakers, achieving 77.96% accuracy and 21.79% EER, while the WavLM-Large+SE-ResFFN model excels in the ‘Seen’ setting with 94.42% accuracy and 5.49% EER. These findings highlight a trade-off between model complexity and generalisation, and underscore the importance of architectural choices in fine-grained speaker modelling. Our analysis also reveals the impact of speaker identity, annotation subjectivity, and data imbalance on system performance, pointing to future directions for improving robustness and fairness in timbre attribute detection.

[432] MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Yadong Niu, Tianzi Wang, Heinrich Dinkel, Xingwei Sun, Jiahao Zhou, Gang Li, Jizhong Liu, Xunying Liu, Junbo Zhang, Jian Luan

Main category: eess.AS

TL;DR: The paper introduces MECAT, a benchmark for fine-grained audio understanding, and DATE, a novel evaluation metric, to address gaps in current audio-language models.

Details

Motivation: Current benchmarks lack the ability to distinguish between generic and detailed model outputs, limiting nuanced audio comprehension.

Method: MECAT is created using expert models and Chain-of-Thought reasoning, while DATE combines semantic similarity and discriminability for evaluation.

Result: The benchmark and metric provide insights into the capabilities and limitations of state-of-the-art audio models.

Conclusion: MECAT and DATE enhance fine-grained audio understanding and evaluation, with data and code publicly available.

Abstract: While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

[433] Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

Ming Cheng, Ming Li

Main category: eess.AS

TL;DR: The paper introduces MIMO-TSVAD, a novel audio-visual framework for speaker diarization, addressing limitations of audio-only TS-VAD by leveraging both acoustic and visual cues.

Details

Motivation: Traditional TS-VAD relies solely on audio, which struggles with overlapped speech. Visual data can mitigate this but faces issues like occlusion. The paper aims to combine both modalities for robust performance.

Method: Proposes MIMO-TSVAD, a sequence-to-sequence framework that flexibly uses audio, video, or both for speaker diarization, handling modality-missing scenarios.

Result: Achieves state-of-the-art DERs of 4.18%, 10.10%, and 8.15% on VoxConverse, DIHARD-III, and MISP 2022 datasets, respectively, and performs well even with missing visual data.

Conclusion: MIMO-TSVAD effectively integrates audio-visual inputs, outperforming audio-only methods and demonstrating robustness in challenging scenarios.

Abstract: Audio-visual learning has demonstrated promising results in many classical speech tasks (e.g., speech separation, automatic speech recognition, wake-word spotting). We believe that introducing visual modality will also benefit speaker diarization. To date, Target-Speaker Voice Activity Detection (TS-VAD) plays an important role in highly accurate speaker diarization. However, previous TS-VAD models take audio features and utilize the speaker’s acoustic footprint to distinguish his or her personal speech activities, which is easily affected by overlapped speech in multi-speaker scenarios. Although visual information naturally tolerates overlapped speech, it suffers from spatial occlusion, low resolution, etc. The potential modality-missing problem blocks TS-VAD towards an audio-visual approach. This paper proposes a novel Multi-Input Multi-Output Target-Speaker Voice Activity Detection (MIMO-TSVAD) framework for speaker diarization. The proposed method can take audio-visual input and leverage the speaker’s acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework. Experimental results show that the MIMO-TSVAD framework demonstrates state-of-the-art performance on the VoxConverse, DIHARD-III, and MISP 2022 datasets under corresponding evaluation metrics, obtaining the Diarization Error Rates (DERs) of 4.18%, 10.10%, and 8.15%, respectively. In addition, it can perform robustly in heavy lip-missing scenarios.

[434] Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion

Yu Zhang, Baotong Tian, Zhiyao Duan

Main category: eess.AS

TL;DR: Conan is a zero-shot online voice conversion model addressing real-time semantic fidelity, natural sound, and unseen speaker adaptation.

Details

Motivation: Current VC models struggle with real-time constraints, semantic fidelity, and adapting to unseen speaker characteristics.

Method: Conan uses a Stream Content Extractor (Emformer), Adaptive Style Encoder, and Causal Shuffle Vocoder (HiFiGAN with pixel-shuffle).

Result: Conan outperforms baseline models in subjective and objective metrics.

Conclusion: Conan effectively addresses real-time VC challenges with improved performance.

Abstract: Zero-shot online voice conversion (VC) holds significant promise for real-time communications and entertainment. However, current VC models struggle to preserve semantic fidelity under real-time constraints, deliver natural-sounding conversions, and adapt effectively to unseen speaker characteristics. To address these challenges, we introduce Conan, a chunkwise online zero-shot voice conversion model that preserves the content of the source while matching the voice timbre and styles of reference speech. Conan comprises three core components: 1) a Stream Content Extractor that leverages Emformer for low-latency streaming content encoding; 2) an Adaptive Style Encoder that extracts fine-grained stylistic features from reference speech for enhanced style adaptation; 3) a Causal Shuffle Vocoder that implements a fully causal HiFiGAN using a pixel-shuffle mechanism. Experimental evaluations demonstrate that Conan outperforms baseline models in subjective and objective metrics. Audio samples can be found at https://aaronz345.github.io/ConanDemo.

eess.IV

[435] CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography

Murong Xu, Tamaz Amiranashvili, Fernando Navarro, Maksym Fritsak, Ibrahim Ethem Hamamci, Suprosanna Shit, Bastian Wittmann, Sezgin Er, Sebastian M. Christ, Ezequiel de la Rosa, Julian Deseoe, Robert Graf, Hendrik Möller, Anjany Sekuboyina, Jan C. Peeken, Sven Becker, Giulia Baldini, Johannes Haubold, Felix Nensa, René Hosch, Nikhil Mirajkar, Saad Khalid, Stefan Zachow, Marc-André Weber, Georg Langs, Jakob Wasserthal, Mehmet Kemal Ozdemir, Andrey Fedorov, Ron Kikinis, Stephanie Tanadini-Lang, Jan S. Kirschke, Stephanie E. Combs, Bjoern Menze

Main category: eess.IV

TL;DR: The paper introduces CADS, an open-source framework for whole-body CT segmentation, addressing data heterogeneity and anatomical coverage limitations with a large-scale dataset and standardized model.

Details

Motivation: Current AI segmentation models are fragmented and lack comprehensive training data for whole-body CT scans, limiting clinical utility.

Method: CADS integrates and standardizes heterogeneous data sources, using a dataset of 22,022 CT volumes with 167 annotated structures, and develops a segmentation model based on established architectures.

Result: The CADS-model outperforms state-of-the-art approaches in evaluations across 18 public datasets and a real-world hospital cohort, proving clinically useful.

Conclusion: CADS advances robust AI solutions in radiology by providing a scalable, open-source framework for comprehensive anatomical analysis.

Abstract: Accurate delineation of anatomical structures in volumetric CT scans is crucial for diagnosis and treatment planning. While AI has advanced automated segmentation, current approaches typically target individual structures, creating a fragmented landscape of incompatible models with varying performance and disparate evaluation protocols. Foundational segmentation models address these limitations by providing a holistic anatomical view through a single model. Yet, robust clinical deployment demands comprehensive training data, which is lacking in existing whole-body approaches, both in terms of data heterogeneity and, more importantly, anatomical coverage. In this work, rather than pursuing incremental optimizations in model architecture, we present CADS, an open-source framework that prioritizes the systematic integration, standardization, and labeling of heterogeneous data sources for whole-body CT segmentation. At its core is a large-scale dataset of 22,022 CT volumes with complete annotations for 167 anatomical structures, representing a significant advancement in both scale and coverage, with 18 times more scans than existing collections and 60% more distinct anatomical targets. Building on this diverse dataset, we develop the CADS-model using established architectures for accessible and automated full-body CT segmentation. Through comprehensive evaluation across 18 public datasets and an independent real-world hospital cohort, we demonstrate advantages over SoTA approaches. Notably, thorough testing of the model’s performance in segmentation tasks from radiation oncology validates its direct utility for clinical interventions. By making our large-scale dataset, our segmentation models, and our clinical software tool publicly available, we aim to advance robust AI solutions in radiology and make comprehensive anatomical analysis accessible to clinicians and researchers alike.

[436] LesionGen: A Concept-Guided Diffusion Model for Dermatology Image Synthesis

Jamil Fayyad, Nourhan Bayasi, Ziyang Yu, Homayoun Najjaran

Main category: eess.IV

TL;DR: LesionGen, a T2I-DPM framework, generates realistic skin lesion images using structured dermatological captions, achieving classification accuracy comparable to real images.

Details

Motivation: Limited datasets for skin disease classification due to privacy, cost, and demographic gaps; T2I-DPMs are underexplored in dermatology.

Method: Fine-tuning a pretrained diffusion model on expert-annotated and pseudo-generated dermatological captions for image synthesis.

Result: Synthetic dataset-trained models match real-image accuracy, with improved worst-case subgroup performance.

Conclusion: LesionGen offers a viable solution for data scarcity in dermatology, enhancing model robustness.

Abstract: Deep learning models for skin disease classification require large, diverse, and well-annotated datasets. However, such resources are often limited due to privacy concerns, high annotation costs, and insufficient demographic representation. While text-to-image diffusion probabilistic models (T2I-DPMs) offer promise for medical data synthesis, their use in dermatology remains underexplored, largely due to the scarcity of rich textual descriptions in existing skin image datasets. In this work, we introduce LesionGen, a clinically informed T2I-DPM framework for dermatology image synthesis. Unlike prior methods that rely on simplistic disease labels, LesionGen is trained on structured, concept-rich dermatological captions derived from expert annotations and pseudo-generated, concept-guided reports. By fine-tuning a pretrained diffusion model on these high-quality image-caption pairs, we enable the generation of realistic and diverse skin lesion images conditioned on meaningful dermatological descriptions. Our results demonstrate that models trained solely on our synthetic dataset achieve classification accuracy comparable to those trained on real images, with notable gains in worst-case subgroup performance. Code and data are available here.

[437] Diffusion model for gradient preconditioning in hyperspectral imaging inverse problems

Jonathan Monsalve, Kumar Vijay Mishra

Main category: eess.IV

TL;DR: A novel framework uses denoising diffusion models to clean noisy gradients in hyperspectral imaging, improving optimization and reconstruction quality.

Details

Motivation: Hyperspectral imaging faces challenges in recovering high-dimensional data due to limited measurements, leading to noisy gradient estimates in optimization.

Method: The paper reinterprets gradient noise as a diffusion process and proposes a denoising diffusion model to reverse this noise in gradient space, preconditioning the optimization.

Result: The method shows significant improvements in accuracy and stability for hyperspectral recovery tasks compared to traditional approaches.

Conclusion: The framework successfully bridges generative modeling and inverse problem solving, enhancing convergence and reconstruction under aggressive sampling.

Abstract: Recovering high-dimensional statistical structure from limited measurements is a fundamental challenge in hyperspectral imaging, where capturing full-resolution data is often infeasible due to sensor, bandwidth, or acquisition constraints. A common workaround is to partition measurements and estimate local statistics-such as the covariance matrix-using only partial observations. However, this strategy introduces noise in the optimization gradients, especially when each partition contains few samples. In this work, we reinterpret this accumulation of gradient noise as a diffusion process, where successive partitions inject increasing uncertainty into the learning signal. Building on this insight, we propose a novel framework that leverages denoising diffusion models to learn a reverse process in gradient space. The model is trained to map noisy gradient estimates toward clean, well-conditioned updates, effectively preconditioning the optimization. Our approach bridges generative modeling and inverse problem solving, improving convergence and reconstruction quality under aggressive sampling regimes. We validate our method on hyperspectral recovery tasks, demonstrating significant gains in accuracy and stability over traditional optimization pipelines.

[438] MRpro - open PyTorch-based MR reconstruction and processing package

Felix Frederik Zimmermann, Patrick Schuenke, Christoph S. Aigner, Bill A. Bernhardt, Mara Guastini, Johannes Hammacher, Noah Jaitner, Andreas Kofler, Leonid Lunin, Stefan Martin, Catarina Redshaw Kranich, Jakob Schattenfroh, David Schote, Yanglei Wu, Christoph Kolbitsch

Main category: eess.IV

TL;DR: MRpro is an open-source PyTorch-based framework for MR image reconstruction, offering unified data structures, composable operators, and deep learning tools for reproducible research.

Details

Motivation: To provide a versatile, collaborative, and reproducible framework for MR image reconstruction, addressing the need for standardized tools in the field.

Method: MRpro integrates unified data structures, a library of operators and algorithms, and deep learning components, supported by automated quality control.

Result: Demonstrated effectiveness in various applications like Cartesian, radial, and spiral acquisitions, motion correction, and quantitative parameter estimation.

Conclusion: MRpro is an extensible, maintainable framework that supports collaborative development and advances MR imaging research.

Abstract: We introduce MRpro, an open-source image reconstruction package built upon PyTorch and open data formats. The framework comprises three main areas. First, it provides unified data structures for the consistent manipulation of MR datasets and their associated metadata (e.g., k-space trajectories). Second, it offers a library of composable operators, proximable functionals, and optimization algorithms, including a unified Fourier operator for all common trajectories and an extended phase graph simulation for quantitative MR. These components are used to create ready-to-use implementations of key reconstruction algorithms. Third, for deep learning, MRpro includes essential building blocks such as data consistency layers, differentiable optimization layers, and state-of-the-art backbone networks and integrates public datasets to facilitate reproducibility. MRpro is developed as a collaborative project supported by automated quality control. We demonstrate the versatility of MRpro across multiple applications, including Cartesian, radial, and spiral acquisitions; motion-corrected reconstruction; cardiac MR fingerprinting; learned spatially adaptive regularization weights; model-based learned image reconstruction and quantitative parameter estimation. MRpro offers an extensible framework for MR image reconstruction. With reproducibility and maintainability at its core, it facilitates collaborative development and provides a foundation for future MR imaging research.

[439] Learning Arbitrary-Scale RAW Image Downscaling with Wavelet-based Recurrent Reconstruction

Yang Ren, Hai Jiang, Wei Li, Menglong Yang, Heng Zhang, Zehua Sheng, Qingsheng Ye, Shuaicheng Liu

Main category: eess.IV

TL;DR: A wavelet-based framework for RAW image downscaling outperforms existing methods by preserving structural and textural integrity, using a novel dataset and energy-maximization loss.

Details

Motivation: Existing sRGB-based downscaling methods cause blurred details and artifacts; RAW images lack specialized frameworks.

Method: Proposes a wavelet-based recurrent reconstruction framework with LASDM and HFPM modules, plus an energy-maximization loss.

Result: Outperforms state-of-the-art methods quantitatively and visually.

Conclusion: The framework and new dataset (Real-NIRD) advance arbitrary-scale RAW image downscaling.

Abstract: Image downscaling is critical for efficient storage and transmission of high-resolution (HR) images. Existing learning-based methods focus on performing downscaling within the sRGB domain, which typically suffers from blurred details and unexpected artifacts. RAW images, with their unprocessed photonic information, offer greater flexibility but lack specialized downscaling frameworks. In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. Furthermore, we introduce the Realistic Non-Integer RAW Downscaling (Real-NIRD) dataset, featuring a non-integer downscaling factor of 1.3$\times$, and incorporate it with publicly available datasets with integer factors (2$\times$, 3$\times$, 4$\times$) for comprehensive benchmarking arbitrary-scale image downscaling purposes. Extensive experiments demonstrate that our method outperforms existing state-of-the-art competitors both quantitatively and visually. The code and dataset will be released at https://github.com/RenYangSCU/ASRD.

[440] EMORe: Motion-Robust 5D MRI Reconstruction via Expectation-Maximization-Guided Binning Correction and Outlier Rejection

Syed M. Arshad, Lee C. Potter, Yingmin Liu, Christopher Crabtree, Matthew S. Tong, Rizwan Ahmad

Main category: eess.IV

TL;DR: EMORe is an adaptive reconstruction method for 5D cardiac MRI that improves motion robustness by integrating inter-bin correction and outlier rejection within an EM framework, outperforming traditional methods in simulations and in vivo tests.

Details

Motivation: Traditional self-gating-based motion binning in 5D MRI often leads to residual motion artifacts due to inaccuracies in signal extraction and bulk motion, limiting clinical utility.

Method: EMORe uses an EM framework with adaptive inter-bin correction and explicit outlier rejection. The E-step refines probabilistic bin assignments, while the M-step improves image estimates.

Result: Simulated and in vivo tests showed EMORe outperforms compressed sensing in metrics like PSNR, SSIM, edge sharpness, and bin assignment accuracy, especially in motion-heavy scenarios.

Conclusion: EMORe enhances clinical applicability of 5D cardiac MRI by robustly handling motion artifacts, despite a slight computational cost increase.

Abstract: We propose EMORe, an adaptive reconstruction method designed to enhance motion robustness in free-running, free-breathing self-gated 5D cardiac magnetic resonance imaging (MRI). Traditional self-gating-based motion binning for 5D MRI often results in residual motion artifacts due to inaccuracies in cardiac and respiratory signal extraction and sporadic bulk motion, compromising clinical utility. EMORe addresses these issues by integrating adaptive inter-bin correction and explicit outlier rejection within an expectation-maximization (EM) framework, whereby the E-step and M-step are executed alternately until convergence. In the E-step, probabilistic (soft) bin assignments are refined by correcting misassignment of valid data and rejecting motion-corrupted data to a dedicated outlier bin. In the M-step, the image estimate is improved using the refined soft bin assignments. Validation in a simulated 5D MRXCAT phantom demonstrated EMORe’s superior performance compared to standard compressed sensing reconstruction, showing significant improvements in peak signal-to-noise ratio, structural similarity index, edge sharpness, and bin assignment accuracy across varying levels of simulated bulk motion. In vivo validation in 13 volunteers further confirmed EMORe’s robustness, significantly enhancing blood-myocardium edge sharpness and reducing motion artifacts compared to compressed sensing, particularly in scenarios with controlled coughing-induced motion. Although EMORe incurs a modest increase in computational complexity, its adaptability and robust handling of bulk motion artifacts significantly enhance the clinical applicability and diagnostic confidence of 5D cardiac MRI.

[441] EMedNeXt: An Enhanced Brain Tumor Segmentation Framework for Sub-Saharan Africa using MedNeXt V2 with Deep Supervision

Ahmed Jaheen, Abdelrahman Elsayed, Damir Kim, Daniil Tikhonov, Matheus Scatolin, Mohor Banerjee, Qiankun Ji, Mostafa Salem, Hu Wang, Sarim Hashmi, Mohammad Yaqub

Main category: eess.IV

TL;DR: The paper introduces EMedNeXt, an enhanced brain tumor segmentation framework for low-resource settings in sub-Saharan Africa, addressing challenges like poor MRI quality and limited expertise.

Details

Motivation: Manual MRI segmentation for gliomas is time-consuming and unreliable in under-resourced regions, necessitating automated solutions.

Method: EMedNeXt builds on MedNeXt V2 with deep supervision, optimized post-processing, a larger ROI, improved nnU-Net v2 architecture, and robust ensembling.

Result: Achieved high accuracy (LesionWise DSC 0.897, NSD 0.541/0.84 at 0.5mm/1.0mm tolerance) on validation data.

Conclusion: EMedNeXt offers a promising solution for robust tumor segmentation in resource-limited settings like SSA.

Abstract: Brain cancer affects millions worldwide, and in nearly every clinical setting, doctors rely on magnetic resonance imaging (MRI) to diagnose and monitor gliomas. However, the current standard for tumor quantification through manual segmentation of multi-parametric MRI is time-consuming, requires expert radiologists, and is often infeasible in under-resourced healthcare systems. This problem is especially pronounced in low-income regions, where MRI scanners are of lower quality and radiology expertise is scarce, leading to incorrect segmentation and quantification. In addition, the number of acquired MRI scans in Africa is typically small. To address these challenges, the BraTS-Lighthouse 2025 Challenge focuses on robust tumor segmentation in sub-Saharan Africa (SSA), where resource constraints and image quality degradation introduce significant shifts. In this study, we present EMedNeXt – an enhanced brain tumor segmentation framework based on MedNeXt V2 with deep supervision and optimized post-processing pipelines tailored for SSA. EMedNeXt introduces three key contributions: a larger region of interest, an improved nnU-Net v2-based architectural skeleton, and a robust model ensembling system. Evaluated on the hidden validation set, our solution achieved an average LesionWise DSC of 0.897 with an average LesionWise NSD of 0.541 and 0.84 at a tolerance of 0.5 mm and 1.0 mm, respectively.

[442] Pixel Embedding Method for Tubular Neurite Segmentation

Huayu Fu, Jiamin Li, Haozhi Qu, Xiaolin Hu, Zengcai Guo

Main category: eess.IV

TL;DR: Proposes an improved framework for automatic neuronal topology segmentation using deep learning, addressing challenges like occlusions and introducing a novel evaluation metric.

Details

Motivation: The intricate morphology and occlusions in neuronal branches make deep learning-based segmentation challenging, necessitating a more effective solution.

Method: Introduces a deep network with pixel-level embedding vectors and a tailored loss function, followed by an end-to-end pipeline mapping images to SWC-formatted trees. A novel topological metric is also proposed for evaluation.

Result: Significantly reduces error rates in neuronal topology reconstruction compared to classical methods, as demonstrated on an fMOST imaging dataset.

Conclusion: The proposed framework effectively addresses segmentation challenges and improves accuracy, with the new metric providing better evaluation of segmentation quality.

Abstract: Automatic segmentation of neuronal topology is critical for handling large scale neuroimaging data, as it can greatly accelerate neuron annotation and analysis. However, the intricate morphology of neuronal branches and the occlusions among fibers pose significant challenges for deep learning based segmentation. To address these issues, we propose an improved framework: First, we introduce a deep network that outputs pixel level embedding vectors and design a corresponding loss function, enabling the learned features to effectively distinguish different neuronal connections within occluded regions. Second, building on this model, we develop an end to end pipeline that directly maps raw neuronal images to SWC formatted neuron structure trees. Finally, recognizing that existing evaluation metrics fail to fully capture segmentation accuracy, we propose a novel topological assessment metric to more appropriately quantify the quality of neuron segmentation and reconstruction. Experiments on our fMOST imaging dataset demonstrate that, compared to several classical methods, our approach significantly reduces the error rate in neuronal topology reconstruction.

[443] Smart Video Capsule Endoscopy: Raw Image-Based Localization for Enhanced GI Tract Investigation

Oliver Bause, Julia Werner, Paul Palomero Bernardo, Oliver Bringmann

Main category: eess.IV

TL;DR: The paper proposes an energy-efficient AI-based image classification method for low-power sensor edge devices, achieving 93.06% accuracy on Bayer images with minimal energy usage.

Details

Motivation: Deep neural networks are often unsuitable for low-power edge devices due to large model sizes and high computational demands. The need for hardware-efficient AI is highlighted, particularly in medical applications like Video Capsule Endoscopy, where battery life is critical.

Method: The approach involves a compact CNN (63,000 parameters) and Viterbi decoding for time-series analysis, applied directly to Bayer images to avoid RGB conversion. A customized PULPissimo System-on-Chip with a RISC-V core and hardware accelerator is used for energy-efficient processing.

Result: Achieves 93.06% accuracy for organ classification, requiring only 5.31 µJ per image, saving 89.9% energy compared to traditional methods.

Conclusion: The method successfully addresses the limitations of resource-constrained devices, offering a viable solution for energy-efficient AI in medical and edge applications.

Abstract: For many real-world applications involving low-power sensor edge devices deep neural networks used for image classification might not be suitable. This is due to their typically large model size and require- ment of operations often exceeding the capabilities of such resource lim- ited devices. Furthermore, camera sensors usually capture images with a Bayer color filter applied, which are subsequently converted to RGB images that are commonly used for neural network training. However, on resource-constrained devices, such conversions demands their share of energy and optimally should be skipped if possible. This work ad- dresses the need for hardware-suitable AI targeting sensor edge devices by means of the Video Capsule Endoscopy, an important medical proce- dure for the investigation of the small intestine, which is strongly limited by its battery lifetime. Accurate organ classification is performed with a final accuracy of 93.06% evaluated directly on Bayer images involv- ing a CNN with only 63,000 parameters and time-series analysis in the form of Viterbi decoding. Finally, the process of capturing images with a camera and raw image processing is demonstrated with a customized PULPissimo System-on-Chip with a RISC-V core and an ultra-low power hardware accelerator providing an energy-efficient AI-based image clas- sification approach requiring just 5.31 {\mu}J per image. As a result, it is possible to save an average of 89.9% of energy before entering the small intestine compared to classic video capsules.

[444] Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation

Zheyuan Zhang, Linkai Peng, Wanying Dou, Cuiling Sun, Halil Ertugrul Aktas, Andrea M. Bejar, Elif Keles, Gorkem Durak, Ulas Bagci

Main category: eess.IV

TL;DR: PancreasDG introduces a large-scale multi-center 3D MRI dataset for pancreas segmentation, addressing cross-center and cross-sequence domain shifts in medical imaging. It reveals key insights and proposes a semi-supervised method outperforming existing techniques.

Details

Motivation: Existing benchmarks overlook variability in MR sequences, and pancreas segmentation is challenging due to low contrast and under-representation in datasets, despite its clinical importance.

Method: The dataset includes 563 MRI scans from six institutions with pixel-accurate masks. A semi-supervised approach leverages anatomical invariances for domain generalization.

Result: The proposed method achieves 61.63% Dice score improvements and 87.00% on cross-sequence segmentation, outperforming state-of-the-art techniques.

Conclusion: PancreasDG sets a new benchmark for domain generalization in medical imaging, with insights on sampling variance and cross-sequence shifts.

Abstract: Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.

[445] JPEG Processing Neural Operator for Backward-Compatible Coding

Woo Kyoung Han, Yongjun Lee, Byeonghun Lee, Sang Hyun Park, Sunghoon Im, Kyong Hwan Jin

Main category: eess.IV

TL;DR: JPNeO is a backward-compatible JPEG algorithm using neural operators to improve chroma preservation and reconstruction fidelity, with reduced memory and parameter usage.

Details

Motivation: Standardizing learning-based lossy compression while maintaining backward compatibility with JPEG is challenging.

Method: Incorporates neural operators in encoding and decoding stages to enhance performance.

Result: Improves chroma preservation, reconstruction fidelity, and reduces memory/parameter usage.

Conclusion: JPNeO offers a high-performance, out-of-the-box image compression solution without altering source coding protocols.

Abstract: Despite significant advances in learning-based lossy compression algorithms, standardizing codecs remains a critical challenge. In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. Our JPNeO improves chroma component preservation and enhances reconstruction fidelity compared to existing artifact removal methods by incorporating neural operators in both the encoding and decoding stages. JPNeO achieves practical benefits in terms of reduced memory usage and parameter count. We further validate our hypothesis about the existence of a space with high mutual information through empirical evidence. In summary, the JPNeO functions as a high-performance out-of-the-box image compression pipeline without changing source coding’s protocol. Our source code is available at https://github.com/WooKyoungHan/JPNeO.

[446] Towards Field-Ready AI-based Malaria Diagnosis: A Continual Learning Approach

Louise Guillon, Soheib Biga, Yendoube E. Kantchire, Mouhamadou Lamine Sane, Grégoire Pasquier, Kossi Yakpa, Stéphane E. Sossou, Marc Thellier, Laurent Bonnardot, Laurence Lachaud, Renaud Piarroux, Ameyo M. Dorkenoo

Main category: eess.IV

TL;DR: The paper explores continual learning (CL) to improve malaria CAD models’ robustness against domain shifts, using a YOLO-based detector and evaluating four CL strategies on multi-site data.

Details

Motivation: Malaria diagnosis in low-resource settings lacks expert microscopy, and existing CAD systems struggle with generalization across varying conditions.

Method: The study frames the problem as domain-incremental learning, testing four CL strategies (two rehearsal-based, two regularization-based) on a multi-site dataset of thin blood smear images.

Result: CL, especially rehearsal-based methods, significantly enhances model performance across domains.

Conclusion: Continual learning shows promise for developing deployable, field-ready malaria CAD tools.

Abstract: Malaria remains a major global health challenge, particularly in low-resource settings where access to expert microscopy may be limited. Deep learning-based computer-aided diagnosis (CAD) systems have been developed and demonstrate promising performance on thin blood smear images. However, their clinical deployment may be hindered by limited generalization across sites with varying conditions. Yet very few practical solutions have been proposed. In this work, we investigate continual learning (CL) as a strategy to enhance the robustness of malaria CAD models to domain shifts. We frame the problem as a domain-incremental learning scenario, where a YOLO-based object detector must adapt to new acquisition sites while retaining performance on previously seen domains. We evaluate four CL strategies, two rehearsal-based and two regularization-based methods, on real-life conditions thanks to a multi-site clinical dataset of thin blood smear images. Our results suggest that CL, and rehearsal-based methods in particular, can significantly improve performance. These findings highlight the potential of continual learning to support the development of deployable, field-ready CAD tools for malaria.

[447] Topology Optimization in Medical Image Segmentation with Fast Euler Characteristic

Liu Li, Qiang Ma, Cheng Ouyang, Johannes C. Paetzold, Daniel Rueckert, Bernhard Kainz

Main category: eess.IV

TL;DR: A novel topology-aware segmentation method using the Euler Characteristic (χ) improves topological correctness in medical images while maintaining pixel-wise accuracy.

Details

Motivation: Conventional metrics like Dice score often fail to ensure clinically acceptable topological accuracy (e.g., continuous boundaries). Existing topology-aware methods using persistent homology are computationally expensive.

Method: Proposes a fast χ computation for 2D/3D data, uses χ error as a metric, identifies topological violations via a map, and refines segmentation with a correction network.

Result: Experiments on 2D/3D datasets show significant improvement in topological correctness without sacrificing pixel-wise accuracy.

Conclusion: The method efficiently addresses topological constraints in segmentation, offering a practical alternative to persistent homology-based approaches.

Abstract: Deep learning-based medical image segmentation techniques have shown promising results when evaluated based on conventional metrics such as the Dice score or Intersection-over-Union. However, these fully automatic methods often fail to meet clinically acceptable accuracy, especially when topological constraints should be observed, e.g., continuous boundaries or closed surfaces. In medical image segmentation, the correctness of a segmentation in terms of the required topological genus sometimes is even more important than the pixel-wise accuracy. Existing topology-aware approaches commonly estimate and constrain the topological structure via the concept of persistent homology (PH). However, these methods are difficult to implement for high dimensional data due to their polynomial computational complexity. To overcome this problem, we propose a novel and fast approach for topology-aware segmentation based on the Euler Characteristic ($\chi$). First, we propose a fast formulation for $\chi$ computation in both 2D and 3D. The scalar $\chi$ error between the prediction and ground-truth serves as the topological evaluation metric. Then we estimate the spatial topology correctness of any segmentation network via a so-called topological violation map, i.e., a detailed map that highlights regions with $\chi$ errors. Finally, the segmentation results from the arbitrary network are refined based on the topological violation maps by a topology-aware correction network. Our experiments are conducted on both 2D and 3D datasets and show that our method can significantly improve topological correctness while preserving pixel-wise segmentation accuracy.

[448] Towards High-Resolution Alignment and Super-Resolution of Multi-Sensor Satellite Imagery

Philip Wootaek Shin, Vishal Gaur, Rahul Ramachandran, Manil Maskey, Jack Sampson, Vijaykrishnan Narayanan, Sujit Roy

Main category: eess.IV

TL;DR: A framework for aligning and harmonizing 30m HLS imagery with 10m HLS reference data to improve super-resolution for heterogeneous satellite sensors.

Details

Motivation: Address challenges in data fusion due to differing spatial resolutions and spectral/temporal characteristics of satellite sensors.

Method: Develop a preliminary framework using HLS10 as reference to align and harmonize HLS30 imagery.

Result: Quantitative and qualitative evaluations show improved super-resolved Landsat imagery.

Conclusion: The study demonstrates feasibility of heterogeneous satellite image super-resolution and suggests future advancements.

Abstract: High-resolution satellite imagery is essential for geospatial analysis, yet differences in spatial resolution across satellite sensors present challenges for data fusion and downstream applications. Super-resolution techniques can help bridge this gap, but existing methods rely on artificially downscaled images rather than real sensor data and are not well suited for heterogeneous satellite sensors with differing spectral, temporal characteristics. In this work, we develop a preliminary framework to align and Harmonized Landsat Sentinel 30m(HLS 30) imagery using Harmonized Landsat Sentinel 10m(HLS10) as a reference from the HLS dataset. Our approach aims to bridge the resolution gap between these sensors and improve the quality of super-resolved Landsat imagery. Quantitative and qualitative evaluations demonstrate the effectiveness of our method, showing its potential for enhancing satellite-based sensing applications. This study provides insights into the feasibility of heterogeneous satellite image super-resolution and highlights key considerations for future advancements in the field.

[449] Exploiting Scale-Variant Attention for Segmenting Small Medical Objects

Wei Dai, Rui Liu, Zixuan Wu, Tianyi Wu, Min Wang, Junxian Zhou, Yixuan Yuan, Jun Liu

Main category: eess.IV

TL;DR: The paper proposes SvANet, a scale-variant attention-based network, to improve segmentation of small medical objects by addressing CNN limitations like information loss and compression artifacts.

Details

Motivation: Early detection of small pathological regions is crucial for disease diagnosis, but existing CNN-based methods struggle with small-scale object segmentation due to information loss.

Method: SvANet integrates scale-variant attention, cross-scale guidance, Monte Carlo attention, and vision transformers to enhance small object segmentation.

Result: SvANet achieves high Dice coefficients (72.58%-96.12%) across seven datasets for small medical objects occupying <1% of image areas.

Conclusion: SvANet effectively addresses challenges in small-scale medical object segmentation, demonstrating superior performance over existing methods.

Abstract: Early detection and accurate diagnosis can predict the risk of malignant disease transformation, thereby increasing the probability of effective treatment. Identifying mild syndrome with small pathological regions serves as an ominous warning and is fundamental in the early diagnosis of diseases. While deep learning algorithms, particularly convolutional neural networks (CNNs), have shown promise in segmenting medical objects, analyzing small areas in medical images remains challenging. This difficulty arises due to information losses and compression defects from convolution and pooling operations in CNNs, which become more pronounced as the network deepens, especially for small medical objects. To address these challenges, we propose a novel scale-variant attention-based network (SvANet) for accurately segmenting small-scale objects in medical images. The SvANet consists of scale-variant attention, cross-scale guidance, Monte Carlo attention, and vision transformer, which incorporates cross-scale features and alleviates compression artifacts for enhancing the discrimination of small medical objects. Quantitative experimental results demonstrate the superior performance of SvANet, achieving 96.12%, 96.11%, 89.79%, 84.15%, 80.25%, 73.05%, and 72.58% in mean Dice coefficient for segmenting kidney tumors, skin lesions, hepatic tumors, polyps, surgical excision cells, retinal vasculatures, and sperms, which occupy less than 1% of the image areas in KiTS23, ISIC 2018, ATLAS, PolypGen, TissueNet, FIVES, and SpermHealth datasets, respectively.

Vishwesh Ramanathan, Tony Xu, Pushpak Pati, Faruk Ahmed, Maged Goubran, Anne L. Martel

Main category: eess.IV

TL;DR: ModalTune is a fine-tuning framework for digital pathology that integrates new modalities without altering SLFM weights, using LLMs for label encoding, achieving SOTA results in multi-task and multi-modal settings.

Details

Motivation: Current methods in digital pathology under-utilize shared information between tasks and modalities, limiting performance in low-data regimes.

Method: Proposes ModalTune with Modal Adapter for modality integration and LLMs for label encoding, enabling multi-task and multi-modal training.

Result: Achieves SOTA results in survival and subtype prediction across four cancer types, generalizes to OOD datasets.

Conclusion: ModalTune is the first unified framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology.

Abstract: Prediction tasks in digital pathology are challenging due to the massive size of whole-slide images (WSIs) and the weak nature of training signals. Advances in computing, data availability, and self-supervised learning (SSL) have paved the way for slide-level foundation models (SLFMs) that can improve prediction tasks in low-data regimes. However, current methods under-utilize shared information between tasks and modalities. To overcome this challenge, we propose ModalTune, a novel fine-tuning framework which introduces the Modal Adapter to integrate new modalities without modifying SLFM weights. Additionally, we use large-language models (LLMs) to encode labels as text, capturing semantic relationships across multiple tasks and cancer types in a single training recipe. ModalTune achieves state-of-the-art (SOTA) results against both uni-modal and multi-modal models across four cancer types, jointly improving survival and cancer subtype prediction while remaining competitive in pan-cancer settings. Additionally, we show ModalTune is generalizable to two out-of-distribution (OOD) datasets. To our knowledge, this is the first unified fine-tuning framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Large Language Models in the Travel Domain: An Industrial Experience

[2] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

[3] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

[4] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

[5] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs

[6] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

[7] Theoretical Foundations and Mitigation of Hallucination in Large Language Models

[8] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

[9] Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

[10] Reading Between the Timelines: RAG for Answering Diachronic Questions

[11] Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

[12] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

[13] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

[14] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

[15] Predicting stock prices with ChatGPT-annotated Reddit sentiment

[16] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting

[17] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers

[18] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

[19] Multi-Relation Extraction in Entity Pairs using Global Context

[20] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

[21] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

[22] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

[23] Enhancing RAG Efficiency with Adaptive Context Compression

[24] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

[25] Augmented Vision-Language Models: A Systematic Review

[26] Deep Learning Approaches for Multimodal Intent Recognition: A Survey

[27] OAEI-LLM-T: A TBox Benchmark Dataset for Understanding Large Language Model Hallucinations in Ontology Matching

[28] Trusted Knowledge Extraction for Operations and Maintenance Intelligence

[29] Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

[30] CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

[31] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

[32] PARROT: An Open Multilingual Radiology Reports Dataset

[33] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

[34] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

[35] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

[36] Opacity as Authority: Arbitrariness and the Preclusion of Contestation

[37] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

[38] Math Natural Language Inference: this should be easy!

[39] Exploring In-Context Learning for Frame-Semantic Parsing

[40] Context-aware Rotary Position Embedding

[41] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

[42] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

[43] Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

[44] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

[45] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

[46] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

[47] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

[48] Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

[49] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

[50] Enabling Few-Shot Alzheimer’s Disease Diagnosis on Tabular Biomarker Data with LLMs

[51] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

[52] Evaluating LLMs’ Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

[53] Text-to-SQL Task-oriented Dialogue Ontology Construction

[54] Unveiling Super Experts in Mixture-of-Experts Large Language Models

[55] What’s Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

[56] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

[57] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

[58] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

[59] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

[60] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

[61] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

[62] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

[63] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

[64] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

[65] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

[66] DiffLoRA: Differential Low-Rank Adapters for Large Language Models

[67] Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

[68] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

[69] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

[70] LiMe: a Latin Corpus of Late Medieval Criminal Sentences

[71] Explaining vague language

[72] Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability

[73] Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

[74] Neutral Residues: Revisiting Adapters for Model Extension

[75] Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

[76] Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

[77] Inside-Out: Hidden Factual Knowledge in LLMs