Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 70]
cs.CV [Total: 135]
cs.AI [Total: 39]
cs.SD [Total: 7]
cs.LG [Total: 105]
cs.MA [Total: 5]
cs.MM [Total: 1]
eess.AS [Total: 9]
eess.IV [Total: 14]

cs.CL

[1] IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

Vanessa Rebecca Wiyono, David Anugraha, Ayu Purwarianti, Genta Indra Winata

Main category: cs.CL

TL;DR: Introducing IndoPref, a human-authored Indonesian preference dataset to evaluate LLM text quality, addressing underrepresentation in multilingual research.

Details

Motivation: Indonesian is underrepresented in LLM preference research, with existing datasets often lacking cultural authenticity due to English translations.

Method: Created IndoPref, a fully human-authored, multi-domain Indonesian dataset, evaluated using Krippendorff’s alpha for annotator agreement.

Result: Strong inter-annotator agreement demonstrated; benchmarked multiple LLMs for output quality.

Conclusion: IndoPref fills a critical gap in Indonesian LLM research, providing authentic, high-quality data for evaluation.

Abstract: Over 200 million people speak Indonesian, yet the language remains significantly underrepresented in preference-based research for large language models (LLMs). Most existing multilingual datasets are derived from English translations, often resulting in content that lacks cultural and linguistic authenticity. To address this gap, we introduce IndoPref, the first fully human-authored and multi-domain Indonesian preference dataset specifically designed to evaluate the naturalness and quality of LLM-generated text. All annotations are natively written in Indonesian and evaluated using Krippendorff’s alpha, demonstrating strong inter-annotator agreement. Additionally, we benchmark the dataset across multiple LLMs and assess the output quality of each model.

[2] Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Kimberly Le Truong, Riccardo Fogliato, Hoda Heidari, Zhiwei Steven Wu

Main category: cs.CL

TL;DR: The paper explores how diverse writing styles in evaluation prompts affect LLM performance, finding significant impacts and offering a scalable method to improve benchmark validity.

Details

Motivation: Current benchmarks lack writing style diversity, potentially leading to brittle LLM performance with non-standard inputs.

Method: Rewriting evaluation prompts using persona-based LLM prompting to emulate diverse writing styles.

Result: Writing style variations significantly impact LLM performance, with certain styles consistently triggering high or low performance across models.

Conclusion: The study provides a scalable way to enhance benchmarks, improving their validity for assessing LLMs across linguistic variations.

Abstract: Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with “non-standard” input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.

[3] A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models

Adam M. Morgan, Adeen Flinker

Main category: cs.CL

TL;DR: An automated pipeline using LLMs estimates Verb Frame Frequencies (VFFs) more efficiently and accurately than existing tools, producing a scalable VFF database.

Details

Motivation: Existing tools for calculating VFFs are limited in scale, accuracy, or accessibility, hindering research in syntax for human and machine language systems.

Method: Utilizes LLMs to generate and analyze a corpus of sentences with 476 English verbs, mimicking expert linguist behavior for syntactic parsing.

Result: Outperforms two widely used syntactic parsers, requires fewer resources than manual parsing, and produces a detailed VFF database.

Conclusion: The pipeline enables rapid, scalable VFF estimation and is customizable for future research, with all code and data released.

Abstract: We present an automated pipeline for estimating Verb Frame Frequencies (VFFs), the frequency with which a verb appears in particular syntactic frames. VFFs provide a powerful window into syntax in both human and machine language systems, but existing tools for calculating them are limited in scale, accuracy, or accessibility. We use large language models (LLMs) to generate a corpus of sentences containing 476 English verbs. Next, by instructing an LLM to behave like an expert linguist, we had it analyze the syntactic structure of the sentences in this corpus. This pipeline outperforms two widely used syntactic parsers across multiple evaluation datasets. Furthermore, it requires far fewer resources than manual parsing (the gold-standard), thereby enabling rapid, scalable VFF estimation. Using the LLM parser, we produce a new VFF database with broader verb coverage, finer-grained syntactic distinctions, and explicit estimates of the relative frequencies of structural alternates commonly studied in psycholinguistics. The pipeline is easily customizable and extensible to new verbs, syntactic frames, and even other languages. We present this work as a proof of concept for automated frame frequency estimation, and release all code and data to support future research.

[4] Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors

Jia Li, Yichao He, Jiacheng Xu, Tianhao Luo, Zhenzhen Hu, Richang Hong, Meng Wang

Main category: cs.CL

TL;DR: The paper proposes a novel framework, ‘Traits Run Deep,’ for accurate personality assessment using psychology-informed prompts and a Text-Centric Trait Fusion Network to align cross-modal signals, achieving a 45% reduction in MSE and ranking first in the AVI Challenge 2025.

Details

Motivation: Personality assessment is crucial for emotional intelligence, mental health, and education, but traditional methods struggle with cross-modal understanding and semantic modeling.

Method: The framework uses psychology-informed prompts for LLMs to extract personality-aware semantics and a fusion network (Chunk-Wise Projector, Cross-Modal Connector, Text Feature Enhancer) to align asynchronous signals.

Result: Achieved a 45% MSE reduction on the AVI validation set and ranked first in the AVI Challenge 2025.

Conclusion: The proposed framework effectively improves personality assessment accuracy by leveraging cross-modal fusion and psychology-informed prompts.

Abstract: Accurate and reliable personality assessment plays a vital role in many fields, such as emotional intelligence, mental health diagnostics, and personalized education. Unlike fleeting emotions, personality traits are stable, often subconsciously leaked through language, facial expressions, and body behaviors, with asynchronous patterns across modalities. It was hard to model personality semantics with traditional superficial features and seemed impossible to achieve effective cross-modal understanding. To address these challenges, we propose a novel personality assessment framework called \textit{\textbf{Traits Run Deep}}. It employs \textit{\textbf{psychology-informed prompts}} to elicit high-level personality-relevant semantic representations. Besides, it devises a \textit{\textbf{Text-Centric Trait Fusion Network}} that anchors rich text semantics to align and integrate asynchronous signals from other modalities. To be specific, such fusion module includes a Chunk-Wise Projector to decrease dimensionality, a Cross-Modal Connector and a Text Feature Enhancer for effective modality fusion and an ensemble regression head to improve generalization in data-scarce situations. To our knowledge, we are the first to apply personality-specific prompts to guide large language models (LLMs) in extracting personality-aware semantics for improved representation quality. Furthermore, extracting and fusing audio-visual apparent behavior features further improves the accuracy. Experimental results on the AVI validation set have demonstrated the effectiveness of the proposed components, i.e., approximately a 45% reduction in mean squared error (MSE). Final evaluations on the test set of the AVI Challenge 2025 confirm our method’s superiority, ranking first in the Personality Assessment track. The source code will be made available at https://github.com/MSA-LMC/TraitsRunDeep.

[5] The role of media memorability in facilitating startups’ access to venture capital funding

L. Toschi, S. Torrisi, A. Fronzetti Colladon

Main category: cs.CL

TL;DR: Media memorability, not just exposure, significantly impacts venture capital investment by influencing investor memory through distinctiveness and semantic network connectivity.

Details

Motivation: To address the narrow focus on general media exposure in prior research and explore how nuanced media content, like memorability, affects funding decisions.

Method: Analyzed data from 197 UK startups in micro and nanotechnology (1995-2004) to measure media memorability’s impact on investment outcomes.

Result: Media memorability significantly influences investment, with venture capitalists relying on cues like startup distinctiveness and semantic network connectivity.

Conclusion: Startups should prioritize targeted, meaningful media coverage to enhance memorability, emphasizing uniqueness and industry relevance for better funding outcomes.

Abstract: Media reputation plays an important role in attracting venture capital investment. However, prior research has focused too narrowly on general media exposure, limiting our understanding of how media truly influences funding decisions. As informed decision-makers, venture capitalists respond to more nuanced aspects of media content. We introduce the concept of media memorability - the media’s ability to imprint a startup’s name in the memory of relevant investors. Using data from 197 UK startups in the micro and nanotechnology sector (funded between 1995 and 2004), we show that media memorability significantly influences investment outcomes. Our findings suggest that venture capitalists rely on detailed cues such as a startup’s distinctiveness and connectivity within news semantic networks. This contributes to research on entrepreneurial finance and media legitimation. In practice, startups should go beyond frequent media mentions to strengthen brand memorability through more targeted, meaningful coverage highlighting their uniqueness and relevance within the broader industry conversation.

[6] How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?

Christian Clark, Byung-Doh Oh, William Schuler

Main category: cs.CL

TL;DR: The paper highlights the limitations of using first-token approximations for contextual entropy and proposes Monte Carlo estimates for more accurate word entropy measurement, showing divergent effects in reading times.

Details

Motivation: To address the underestimation and distortion of true word entropy caused by first-token approximations in psycholinguistic studies.

Method: Monte Carlo (MC) estimates are generated to allow words to span a variable number of tokens, improving entropy measurement.

Result: Regression experiments on reading times reveal divergent results between first-token and MC word entropy, indicating the limitations of first-token approximations.

Conclusion: Caution is advised when using first-token approximations for contextual entropy, as MC estimates provide more accurate and reliable results.

Abstract: Contextual entropy is a psycholinguistic measure capturing the anticipated difficulty of processing a word just before it is encountered. Recent studies have tested for entropy-related effects as a potential complement to well-known effects from surprisal. For convenience, entropy is typically estimated based on a language model’s probability distribution over a word’s first subword token. However, this approximation results in underestimation and potential distortion of true word entropy. To address this, we generate Monte Carlo (MC) estimates of word entropy that allow words to span a variable number of tokens. Regression experiments on reading times show divergent results between first-token and MC word entropy, suggesting a need for caution in using first-token approximations of contextual entropy.

[7] Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

Jia Li, Yang Wang, Wenhao Qian, Zhenzhen Hu, Richang Hong, Meng Wang

Main category: cs.CL

TL;DR: A novel framework for holistic interview performance assessment using multimodal data (video, audio, text) and ensemble learning, achieving top results in AVI Challenge 2025.

Details

Motivation: To ensure fair and comprehensive evaluation of candidates by capturing explicit and implicit cues from multimodal data.

Method: Integrates three modalities, six responses, and five evaluation dimensions. Uses modality-specific feature extractors and a Shared Compression Multilayer Perceptron for fusion. Employs a two-level ensemble learning strategy for robust predictions.

Result: Achieved a multi-dimensional average MSE of 0.1824, securing first place in AVI Challenge 2025.

Conclusion: The framework effectively advances automated, multimodal interview assessment, providing unbiased and comprehensive evaluations.

Abstract: Interview performance assessment is essential for determining candidates’ suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365’’ aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.

Dongyub Jude Lee, Zhenyi Ye, Pengcheng He

Main category: cs.CL

TL;DR: RLfR is a novel framework for machine translation that uses continuous feedback from GPT-4o to improve translation quality, outperforming traditional methods.

Details

Motivation: Existing preference-learning methods rely on static datasets and struggle with generalization, prompting the need for a dynamic feedback-based approach.

Method: RLfR uses a teacher model (GPT-4o) to refine translations iteratively, rewarding alignment with refinements via negative edit distance and COMET scores.

Result: RLfR outperforms baselines on FLORES-200, improving COMET and M-ETA scores across multiple languages.

Conclusion: RLfR offers a scalable and effective alternative to static triplet-based methods, enhancing translation quality through iterative learning.

Abstract: Preference-learning methods for machine translation (MT)–such as Direct Preference Optimization (DPO)–have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher’s refinement. Guided by two complementary signals–(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy–the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.

[9] Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

Supantho Rakshit, Adele Goldberg

Main category: cs.CL

TL;DR: The study shows that LLMs like Pythia-1.4B learn graded, meaning-infused representations of constructions, aligning with usage-based constructionist principles.

Details

Motivation: To investigate if LLMs' internal representations reflect the function-infused gradience proposed by the usage-based constructionist approach.

Method: Analyzed neural representations of English dative constructions in Pythia-1.4B using a dataset of 5000 sentence pairs varied for human-rated preference strength. Macro-level geometric analysis measured separability (Energy Distance, Jensen-Shannon Divergence).

Result: Separability between construction representations is modulated by preference strength; prototypical exemplars occupy more distinct regions in activation space.

Conclusion: LLMs learn rich, graded representations of constructions, supporting geometric measures of constructionist principles.

Abstract: The usage-based constructionist (UCx) approach posits that language comprises a network of learned form-meaning pairings (constructions) whose use is largely determined by their meanings or functions, requiring them to be graded and probabilistic. This study investigates whether the internal representations in Large Language Models (LLMs) reflect the proposed function-infused gradience. We analyze the neural representations of the English dative constructions (Double Object and Prepositional Object) in Pythia-$1.4$B, using a dataset of $5000$ sentence pairs systematically varied for human-rated preference strength. A macro-level geometric analysis finds that the separability between construction representations, as measured by Energy Distance or Jensen-Shannon Divergence, is systematically modulated by gradient preference strength. More prototypical exemplars of each construction occupy more distinct regions in the activation space of LLMs. These results provide strong evidence that LLMs learn rich, meaning-infused, graded representations of constructions and offer support for geometric measures of basic constructionist principles in LLMs.

[10] Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations

Galo Castillo-López, Gaël de Chalendar, Nasredine Semmar

Main category: cs.CL

TL;DR: A hybrid approach combining BERT and LLMs for intent recognition and OOS detection in TODS, improving performance in zero/few-shot settings.

Details

Motivation: Traditional TODS need large annotated data; this work aims to reduce dependency on such data while maintaining reliability.

Method: Combines BERT’s efficiency with LLMs’ generalization power, sharing BERT outputs to LLMs for better performance.

Result: Improved system performance on multi-party conversation corpora.

Conclusion: The hybrid approach effectively enhances intent recognition and OOS detection in low-data scenarios.

Abstract: Intent recognition is a fundamental component in task-oriented dialogue systems (TODS). Determining user intents and detecting whether an intent is Out-of-Scope (OOS) is crucial for TODS to provide reliable responses. However, traditional TODS require large amount of annotated data. In this work we propose a hybrid approach to combine BERT and LLMs in zero and few-shot settings to recognize intents and detect OOS utterances. Our approach leverages LLMs generalization power and BERT’s computational efficiency in such scenarios. We evaluate our method on multi-party conversation corpora and observe that sharing information from BERT outputs to LLMs leads to system performance improvement.

[11] A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

Roxana Petcu, Samarth Bhargav, Maarten de Rijke, Evangelos Kanoulas

Main category: cs.CL

TL;DR: The paper studies negation in neural information retrieval and LLM-based models, introducing a taxonomy, benchmark datasets, and a logic-based classification mechanism to improve model performance on negation.

Details

Motivation: Understanding and solving complex reasoning tasks, especially those involving negation, is crucial for addressing user information needs, as current dense neural models underperform on such queries.

Method: The authors (1) introduce a taxonomy of negation, (2) generate benchmark datasets for evaluation and fine-tuning, and (3) propose a logic-based classification mechanism to analyze model performance.

Result: The taxonomy improves data distribution and training setup, leading to faster convergence on the NevIR dataset. The classification schema reveals coverage gaps in existing datasets, aiding generalization insights.

Conclusion: The proposed taxonomy, datasets, and classification mechanism enhance model performance and understanding of negation in retrieval tasks.

Abstract: Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.

[12] PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs

Homaira Huda Shomee, Suman Kalyan Maity, Sourav Medya

Main category: cs.CL

TL;DR: PATENTWRITER is a benchmarking framework for evaluating LLMs in patent abstract generation, assessing quality, robustness, and downstream applicability.

Details

Motivation: To streamline the tedious patent-filing process by leveraging LLMs for patent abstract generation.

Method: Evaluates six LLMs (e.g., GPT-4, LLaMA-3) using zero-shot, few-shot, and chain-of-thought prompting, with metrics like BLEU, ROUGE, and BERTScore.

Result: Modern LLMs generate high-fidelity, stylistically appropriate patent abstracts, often outperforming domain-specific baselines.

Conclusion: PATENTWRITER demonstrates LLMs’ potential in patent writing, with open-sourced code and dataset for reproducibility.

Abstract: Large language models (LLMs) have emerged as transformative approaches in several important fields. This paper aims for a paradigm shift for patent writing by leveraging LLMs to overcome the tedious patent-filing process. In this work, we present PATENTWRITER, the first unified benchmarking framework for evaluating LLMs in patent abstract generation. Given the first claim of a patent, we evaluate six leading LLMs – including GPT-4 and LLaMA-3 – under a consistent setup spanning zero-shot, few-shot, and chain-of-thought prompting strategies to generate the abstract of the patent. Our benchmark PATENTWRITER goes beyond surface-level evaluation: we systematically assess the output quality using a comprehensive suite of metrics – standard NLP measures (e.g., BLEU, ROUGE, BERTScore), robustness under three types of input perturbations, and applicability in two downstream patent classification and retrieval tasks. We also conduct stylistic analysis to assess length, readability, and tone. Experimental results show that modern LLMs can generate high-fidelity and stylistically appropriate patent abstracts, often surpassing domain-specific baselines. Our code and dataset are open-sourced to support reproducibility and future research.

[13] BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

Paige Tuttösí, Mantaj Dhillon, Luna Sang, Shane Eastwood, Poorvi Bhatia, Quang Minh Dinh, Avni Kapoor, Yewon Jin, Angelica Lim

Main category: cs.CL

TL;DR: The paper introduces the BERSt dataset for evaluating speech recognition tasks like ASR and SER, highlighting challenges in real-world scenarios like distanced speech and emotions.

Details

Motivation: Current ASR systems struggle with complex real-world situations, such as distanced speech and emotional variations, despite nearing human performance in controlled metrics.

Method: The BERSt dataset was created with 4 hours of English speech from 98 actors in diverse acoustic environments, using smartphones placed in 19 positions, including obstructions and different rooms. It includes shouted/spoken utterances and 7 emotion prompts.

Result: ASR performance degrades with increased distance and shout level, and varies by emotion. The dataset proves challenging for ASR and SER tasks.

Conclusion: The BERSt dataset highlights the need for improved robustness in ASR and SER systems for real-world applications.

Abstract: Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.

[14] Question Generation for Assessing Early Literacy Reading Comprehension

Xiaocheng Yang, Sumuk Shashidhar, Dilek Hakkani-Tur

Main category: cs.CL

TL;DR: A novel method for generating diverse comprehension questions for K-2 English learners, evaluated using FairytaleQA, with potential for AI-driven instruction.

Details

Motivation: To enhance reading comprehension assessment by adapting questions to learners' proficiencies and ensuring thorough material coverage.

Method: Proposes a framework for generating diverse question types at varying difficulty levels, tested with language models on the FairytaleQA dataset.

Result: Demonstrates the approach’s effectiveness in creating tailored comprehension questions for young learners.

Conclusion: The method shows promise for integration into autonomous AI-driven English instruction tools.

Abstract: Assessment of reading comprehension through content-based interactions plays an important role in the reading acquisition process. In this paper, we propose a novel approach for generating comprehension questions geared to K-2 English learners. Our method ensures complete coverage of the underlying material and adaptation to the learner’s specific proficiencies, and can generate a large diversity of question types at various difficulty levels to ensure a thorough evaluation. We evaluate the performance of various language models in this framework using the FairytaleQA dataset as the source material. Eventually, the proposed approach has the potential to become an important part of autonomous AI-driven English instructors.

[15] NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

Hyeonseok Moon, Heuiseok Lim

Main category: cs.CL

TL;DR: The NIAH benchmark may overestimate LLMs’ long-context understanding. NeedleChain, a new benchmark with fully query-relevant contexts, is introduced, along with ROPE Contraction to improve LLM performance.

Details

Motivation: To address the limitations of the NIAH benchmark in accurately assessing LLMs' long-context understanding.

Method: Introduces NeedleChain, a benchmark with entirely query-relevant contexts, and proposes ROPE Contraction for improving LLM performance.

Result: State-of-the-art LLMs struggle with fully understanding long contexts, even when they are query-relevant.

Conclusion: NeedleChain and ROPE Contraction provide better tools for evaluating and enhancing LLMs’ long-context understanding.

Abstract: The Needle-in-a-Haystack (NIAH) benchmark is widely used to evaluate Large Language Models’ (LLMs) ability to understand long contexts (LC). It evaluates the capability to identify query-relevant context within extensive query-irrelevant passages. Although this method serves as a widely accepted standard for evaluating long-context understanding, our findings suggest it may overestimate the true LC capability of LLMs. We demonstrate that even state-of-the-art models such as GPT-4o struggle to intactly incorporate given contexts made up of solely query-relevant ten sentences. In response, we introduce a novel benchmark, \textbf{NeedleChain}, where the context consists entirely of query-relevant information, requiring the LLM to fully grasp the input to answer correctly. Our benchmark allows for flexible context length and reasoning order, offering a more comprehensive analysis of LLM performance. Additionally, we propose an extremely simple yet compelling strategy to improve LC understanding capability of LLM: ROPE Contraction. Our experiments with various advanced LLMs reveal a notable disparity between their ability to process large contexts and their capacity to fully understand them. Source code and datasets are available at https://github.com/hyeonseokk/NeedleChain

[16] AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini

Jill Walker Rettberg, Hermann Wigers

Main category: cs.CL

TL;DR: GPT-4o-mini generates culturally homogenized stories across 236 countries, favoring stability and tradition over diversity and conflict.

Details

Motivation: To assess if a language model trained on Anglo-American texts can produce culturally relevant stories for other nationalities.

Method: Generated 11,800 stories (50 per country) using GPT-4o-mini with a prompt for culturally specific narratives.

Result: Stories conform to a single plot structure, sanitizing conflicts and lacking cultural depth.

Conclusion: AI narrative homogenization is a form of bias, highlighting the need for cultural alignment in generative AI.

Abstract: Can a language model trained largely on Anglo-American texts generate stories that are culturally relevant to other nationalities? To find out, we generated 11,800 stories - 50 for each of 236 countries - by sending the prompt “Write a 1500 word potential {demonym} story” to OpenAI’s model gpt-4o-mini. Although the stories do include surface-level national symbols and themes, they overwhelmingly conform to a single narrative plot structure across countries: a protagonist lives in or returns home to a small town and resolves a minor conflict by reconnecting with tradition and organising community events. Real-world conflicts are sanitised, romance is almost absent, and narrative tension is downplayed in favour of nostalgia and reconciliation. The result is a narrative homogenisation: an AI-generated synthetic imaginary that prioritises stability above change and tradition above growth. We argue that the structural homogeneity of AI-generated narratives constitutes a distinct form of AI bias, a narrative standardisation that should be acknowledged alongside the more familiar representational bias. These findings are relevant to literary studies, narratology, critical AI studies, NLP research, and efforts to improve the cultural alignment of generative AI.

[17] Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, Mugariya Farooq, Giulia Campesan, Ruxandra Cojocaru, Yasser Djilali, Shi Hu, Iheb Chaabane, Puneesh Khanna, Mohamed El Amine Seddik, Ngoc Dung Huynh, Phuc Le Khac, Leen AlQadi, Billel Mokeddem, Mohamed Chami, Abdalgader Abubaker, Mikhail Lubinets, Kacper Piskorski, Slim Frikha

Main category: cs.CL

TL;DR: Falcon-H1 introduces hybrid architecture LLMs combining Transformer and State Space Models for high performance and efficiency, outperforming larger models with fewer parameters.

Details

Motivation: To optimize LLM performance and efficiency by integrating Transformer-based attention with State Space Models, addressing long-context memory and computational challenges.

Method: Adopts a parallel hybrid architecture, revisits model design, data strategy, and training dynamics, and releases multiple configurations (0.5B to 34B parameters).

Result: Falcon-H1 models outperform larger models (e.g., Qwen3-32B, Llama3.3-70B) with fewer parameters, excelling in reasoning, math, multilingual tasks, and more.

Conclusion: Falcon-H1 sets a new standard for efficient and high-performing LLMs, released under an open-source license for broad accessibility.

Abstract: In this report, we introduce Falcon-H1, a new series of large language models (LLMs) featuring hybrid architecture designs optimized for both high performance and efficiency across diverse use cases. Unlike earlier Falcon models built solely on Transformer or Mamba architectures, Falcon-H1 adopts a parallel hybrid approach that combines Transformer-based attention with State Space Models (SSMs), known for superior long-context memory and computational efficiency. We systematically revisited model design, data strategy, and training dynamics, challenging conventional practices in the field. Falcon-H1 is released in multiple configurations, including base and instruction-tuned variants at 0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B parameters. Quantized instruction-tuned models are also available, totaling over 30 checkpoints on Hugging Face Hub. Falcon-H1 models demonstrate state-of-the-art performance and exceptional parameter and training efficiency. The flagship Falcon-H1-34B matches or outperforms models up to 70B scale, such as Qwen3-32B, Qwen2.5-72B, and Llama3.3-70B, while using fewer parameters and less data. Smaller models show similar trends: the Falcon-H1-1.5B-Deep rivals current leading 7B-10B models, and Falcon-H1-0.5B performs comparably to typical 7B models from 2024. These models excel across reasoning, mathematics, multilingual tasks, instruction following, and scientific knowledge. With support for up to 256K context tokens and 18 languages, Falcon-H1 is suitable for a wide range of applications. All models are released under a permissive open-source license, underscoring our commitment to accessible and impactful AI research.

[18] What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models

Tian Yun, Chen Sun, Ellie Pavlick

Main category: cs.CL

TL;DR: LLMs perform poorly in zero-shot settings but improve significantly with minor parameter tuning, though this doesn’t generalize across datasets. The findings prompt a reevaluation of what defines an ‘abstract reasoner.’

Details

Motivation: To challenge the claim that LLMs lack abstract reasoning by demonstrating their potential with minimal tuning and exploring the implications of their performance.

Method: Revisiting experiments on LLMs, testing zero-shot performance and the impact of tuning a small subset of parameters for input encoding.

Result: LLMs achieve near-perfect performance with minor tuning but fail to generalize this improvement across different datasets.

Conclusion: The results call for a deeper discussion on the definition of ‘abstract reasoning’ and its relevance to LLMs.

Abstract: Recent work has argued that large language models (LLMs) are not “abstract reasoners”, citing their poor zero-shot performance on a variety of challenging tasks as evidence. We revisit these experiments in order to add nuance to the claim. First, we show that while LLMs indeed perform poorly in a zero-shot setting, even tuning a small subset of parameters for input encoding can enable near-perfect performance. However, we also show that this finetuning does not necessarily transfer across datasets. We take this collection of empirical results as an invitation to (re-)open the discussion of what it means to be an “abstract reasoner”, and why it matters whether LLMs fit the bill.

[19] IFEvalCode: Controlled Code Generation

Jian Yang, Wei Zhang, Shukai Liu, Linzheng Chai, Yingshui Tan, Jiaheng Liu, Ge Zhang, Wangchunshu Zhou, Guanglin Niu, Zhoujun Li, Binyuan Hui, Junyang Lin

Main category: cs.CL

TL;DR: The paper introduces forward and backward constraints generation to enhance Code LLMs’ adherence to detailed requirements in controlled code generation, alongside a new benchmark, IFEvalCode, for nuanced evaluation.

Details

Motivation: Real-world applications require stricter adherence to coding guidelines beyond correctness, which current Code LLMs struggle with.

Method: The authors propose forward and backward constraints generation and introduce IFEvalCode, a multilingual benchmark with 1.6K test samples across seven languages, evaluating correctness and instruction-following separately.

Result: Experiments on 40+ LLMs show closed-source models outperform open-source ones in controllable code generation, with a notable gap between correctness and instruction-following.

Conclusion: The study highlights the need for improved instruction-following in Code LLMs and provides tools (constraints generation and IFEvalCode) to address this gap.

Abstract: Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed requirements such as coding style, line count, and structural constraints, beyond mere correctness. To address this, the paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs in controlled code generation, ensuring outputs align more closely with human-defined guidelines. The authors further present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages (Python, Java, JavaScript, TypeScript, Shell, C++, and C#), with each sample featuring both Chinese and English queries. Unlike existing benchmarks, IFEvalCode decouples evaluation into two metrics: correctness (Corr.) and instruction-following (Instr.), enabling a more nuanced assessment. Experiments on over 40 LLMs reveal that closed-source models outperform open-source ones in controllable code generation and highlight a significant gap between the models’ ability to generate correct code versus code that precisely follows instructions.

[20] SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

Lei Sheng, Shuai-Shuai Xu

Main category: cs.CL

TL;DR: SLMs (0.5B-1.5B parameters) underperform in Text-to-SQL tasks but offer speed and edge deployment advantages. Post-training techniques (fine-tuning, reinforcement learning) on derived datasets improved performance, with models achieving up to 67.08% execution accuracy.

Details

Motivation: To explore the potential of SLMs in Text-to-SQL tasks despite their limited logical reasoning, leveraging their advantages in speed and edge deployment.

Method: Used SynSQL-2.5M to create datasets for SQL generation and merge revision. Applied supervised fine-tuning, reinforcement learning, and corrective self-consistency inference.

Result: Average improvement of 31.4 points on BIRD development set; 0.5B model reached 56.87% EX, 1.5B model achieved 67.08% EX.

Conclusion: Post-training techniques significantly enhance SLM performance in Text-to-SQL, validating the SLM-SQL method’s effectiveness and generalizability.

Abstract: Large language models (LLMs) have demonstrated strong performance in translating natural language questions into SQL queries (Text-to-SQL). In contrast, small language models (SLMs) ranging from 0.5B to 1.5B parameters currently underperform on Text-to-SQL tasks due to their limited logical reasoning capabilities. However, SLMs offer inherent advantages in inference speed and suitability for edge deployment. To explore their potential in Text-to-SQL applications, we leverage recent advancements in post-training techniques. Specifically, we used the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. We then applied supervised fine-tuning and reinforcement learning-based post-training to the SLM, followed by inference using a corrective self-consistency approach. Experimental results validate the effectiveness and generalizability of our method, SLM-SQL. On the BIRD development set, the five evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87% execution accuracy (EX), while the 1.5B model achieved 67.08% EX. We will release our dataset, model, and code to github: https://github.com/CycloneBoy/slm_sql.

[21] CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

Dongchen Li, Jitao Liang, Wei Li, Xiaoyu Wang, Longbing Cao, Kun Yu

Main category: cs.CL

TL;DR: CliCARE is a framework that grounds LLMs in clinical guidelines to improve decision support for cancer EHRs, addressing challenges like long-range dependencies and clinical hallucination.

Details

Motivation: To enhance clinical decision support and reduce physician burnout by leveraging LLMs for synthesizing complex cancer EHRs, while overcoming challenges like multilingual records and unreliable evaluation metrics.

Method: Transforms unstructured EHRs into Temporal Knowledge Graphs (TKGs) and aligns them with normative guideline knowledge graphs for evidence-grounded decision support.

Result: Outperforms baselines in diverse datasets (Chinese cancer and MIMIC-IV), with high clinical validity confirmed by expert oncologists.

Conclusion: CliCARE effectively addresses key challenges in applying LLMs to oncology, providing reliable, guideline-grounded decision support.

Abstract: Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and multilingual nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these diverse settings, CliCARE significantly outperforms strong baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by expert oncologists.

[22] A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support

Long S. T. Nguyen, Truong P. Hua, Thanh M. Nguyen, Toan Q. Pham, Nam K. Ngo, An X. Nguyen, Nghi D. M. Pham, Nghia H. Nguyen, Tho T. Quan

Main category: cs.CL

TL;DR: The paper introduces CSConDa, a Vietnamese customer support QA dataset, and evaluates 11 lightweight ViLLMs to address the lack of domain-specific benchmarks for practical applications.

Details

Motivation: The rapid growth of AI and LLMs in QA systems lacks domain-specific evaluations and benchmark datasets for Vietnamese customer support, hindering model selection.

Method: The authors curated CSConDa, a dataset of 9,000 QA pairs from real interactions, and evaluated 11 lightweight ViLLMs using automatic metrics and syntactic analysis.

Result: The study provides insights into model strengths, weaknesses, and linguistic patterns, aiding in performance comparison and identifying improvement areas.

Conclusion: CSConDa and the evaluation framework enable informed model selection for customer service QA and advance Vietnamese LLM research.

Abstract: With the rapid growth of Artificial Intelligence, Large Language Models (LLMs) have become essential for Question Answering (QA) systems, improving efficiency and reducing human workload in customer service. The emergence of Vietnamese LLMs (ViLLMs) highlights lightweight open-source models as a practical choice for their accuracy, efficiency, and privacy benefits. However, domain-specific evaluations remain limited, and the absence of benchmark datasets reflecting real customer interactions makes it difficult for enterprises to select suitable models for support applications. To address this gap, we introduce the Customer Support Conversations Dataset (CSConDa), a curated benchmark of over 9,000 QA pairs drawn from real interactions with human advisors at a large Vietnamese software company. Covering diverse topics such as pricing, product availability, and technical troubleshooting, CSConDa provides a representative basis for evaluating ViLLMs in practical scenarios. We further present a comprehensive evaluation framework, benchmarking 11 lightweight open-source ViLLMs on CSConDa with both automatic metrics and syntactic analysis to reveal model strengths, weaknesses, and linguistic patterns. This study offers insights into model behavior, explains performance differences, and identifies key areas for improvement, supporting the development of next-generation ViLLMs. By establishing a robust benchmark and systematic evaluation, our work enables informed model selection for customer service QA and advances research on Vietnamese LLMs. The dataset is publicly available at https://huggingface.co/datasets/ura-hcmut/Vietnamese-Customer-Support-QA.

[23] ControlMed: Adding Reasoning Control to Medical Language Model

Sung-Min Lee, Siyoon Lee, Juyeon Kim, Kyungmin Roh

Main category: cs.CL

TL;DR: ControlMed is a medical LLM allowing users to control reasoning length, improving efficiency without sacrificing accuracy.

Details

Motivation: Existing LLMs in medicine produce lengthy reasoning, causing computational overhead and latency, hindering clinical use.

Method: Three-stage training: pre-training on synthetic data, supervised fine-tuning with length-control markers, and reinforcement learning for accuracy.

Result: Matches or outperforms state-of-the-art models, with flexible length control for efficiency.

Conclusion: ControlMed is a practical, adaptable solution for clinical QA and medical analysis.

Abstract: Reasoning Large Language Models (LLMs) with enhanced accuracy and explainability are increasingly being adopted in the medical domain, as the life-critical nature of clinical decision-making demands reliable support. Despite these advancements, existing reasoning LLMs often generate unnecessarily lengthy reasoning processes, leading to significant computational overhead and response latency. These limitations hinder their practical deployment in real-world clinical environments. To address these challenges, we introduce \textbf{ControlMed}, a medical language model that enables users to actively control the length of the reasoning process at inference time through fine-grained control markers. ControlMed is trained through a three-stage pipeline: 1) pre-training on a large-scale synthetic medical instruction dataset covering both \textit{direct} and \textit{reasoning responses}; 2) supervised fine-tuning with multi-length reasoning data and explicit length-control markers; and 3) reinforcement learning with model-based reward signals to enhance factual accuracy and response quality. Experimental results on a variety of English and Korean medical benchmarks demonstrate that our model achieves similar or better performance compared to state-of-the-art models. Furthermore, users can flexibly balance reasoning accuracy and computational efficiency by controlling the reasoning length as needed. These findings demonstrate that ControlMed is a practical and adaptable solution for clinical question answering and medical information analysis.

[24] Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Xikang Yang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu

Main category: cs.CL

TL;DR: CognitiveAttack is a novel framework exploiting multi-bias interactions to bypass LLM safety, achieving higher success rates than existing methods.

Details

Motivation: LLM safety mechanisms are vulnerable to adversarial attacks exploiting cognitive biases, which are underexplored.

Method: Combines supervised fine-tuning and reinforcement learning to generate prompts with optimized bias combinations.

Result: Achieves 60.1% attack success rate, outperforming SOTA methods, and exposes vulnerabilities in 30 LLMs.

Conclusion: Multi-bias interactions are a powerful attack vector, bridging cognitive science and AI safety for more robust systems.

Abstract: Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases – systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

[25] Unveiling the Influence of Amplifying Language-Specific Neurons

Inaya Rahmanisa, Lyzander Marciano Andrylie, Krisna Mahardika Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji

Main category: cs.CL

TL;DR: Amplifying language-specific neurons in LLMs improves performance in the target language but often degrades cross-language results, with limited benefits for cross-lingual transfer.

Details

Motivation: To explore the role of language-specific neurons in multilingual behavior, particularly their amplification effects on model performance across languages, including low-resource ones.

Method: Interventions amplifying language-specific neurons in three models across 18 languages, evaluated using the Language Steering Shift (LSS) score and downstream tasks (commonsense reasoning, knowledge, translation).

Result: Optimal amplification factors effectively steer output to target languages, improving self-language performance but generally degrading cross-language results.

Conclusion: Amplification of language-specific neurons benefits low-resource languages but offers limited advantage for cross-lingual transfer, highlighting their role in multilingual behavior.

Abstract: Language-specific neurons in LLMs that strongly correlate with individual languages have been shown to influence model behavior by deactivating them. However, their role in amplification remains underexplored. This work investigates the effect of amplifying language-specific neurons through interventions across 18 languages, including low-resource ones, using three models primarily trained in different languages. We compare amplification factors by their effectiveness in steering to the target language using a proposed Language Steering Shift (LSS) evaluation score, then evaluate it on downstream tasks: commonsense reasoning (XCOPA, XWinograd), knowledge (Include), and translation (FLORES). The optimal amplification factors effectively steer output toward nearly all tested languages. Intervention using this factor on downstream tasks improves self-language performance in some cases but generally degrades cross-language results. These findings highlight the effect of language-specific neurons in multilingual behavior, where amplification can be beneficial especially for low-resource languages, but provides limited advantage for cross-lingual transfer.

[26] BALSAM: A Platform for Benchmarking Arabic Large Language Models

Rawan Al-Matham, Kareem Darwish, Raghad Al-Rasheed, Waad Alshammari, Muneera Alhoshan, Amal Almazrua, Asma Al Wazrah, Mais Alheraki, Firoj Alam, Preslav Nakov, Norah Alzahrani, Eman alBilali, Nizar Habash, Abdelrahman El-Sheikh, Muhammad Elmallah, Haonan Li, Hamdy Mubarak, Mohamed Anwar, Zaid Alyafeai, Ahmed Abdelali, Nora Altwairesh, Maram Hasanain, Abdulmohsen Al Thubaity, Shady Shehata, Bashar Alhafni, Injy Hamed, Go Inoue, Khalid Elmadani, Ossama Obeid, Fatima Haouari, Tamer Elsayed, Emad Alghamdi, Khalid Almubarak, Saied Alshahrani, Ola Aljarrah, Safa Alajlan, Areej Alshaqarawi, Maryam Alshihri, Sultana Alghurabi, Atikah Alzeghayer, Afrah Altamimi, Abdullah Alfaifi, Abdulrahman AlOsaimy

Main category: cs.CL

TL;DR: BALSAM is a community-driven benchmark for Arabic LLMs, addressing gaps like data scarcity and poor benchmarks by offering 78 NLP tasks and a transparent evaluation platform.

Details

Motivation: Arabic LLMs lag behind due to data scarcity, linguistic diversity, and poor benchmarks. BALSAM aims to bridge these gaps.

Method: Introduces BALSAM, a benchmark with 78 NLP tasks, 52K examples, and a blind evaluation platform.

Result: BALSAM provides a unified, transparent platform for evaluating Arabic LLMs, covering diverse tasks and mitigating data contamination.

Conclusion: BALSAM sets standards and fosters collaboration to advance Arabic LLM capabilities.

Abstract: The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

[27] Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann

Main category: cs.CL

TL;DR: The paper investigates language-specific neurons in LLMs, showing their clustering in deeper layers and specialization for non-Latin scripts. Language arithmetics effectively steer models for multilingual tasks, with better results for high-resource languages and typologically similar ones.

Details

Motivation: To understand the neural mechanisms behind language-specific processing in LLMs and explore methods to manipulate these neurons for improved multilingual task performance.

Method: Analyze language-specific neurons in LLMs using the LAPE method and perform language arithmetics (activation addition/multiplication) to steer model behavior.

Result: Language-specific neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Language arithmetics outperforms simpler methods in multilingual tasks.

Conclusion: Manipulating language-specific neurons enhances multilingual task performance, with effectiveness influenced by language resource availability and typological similarity.

Abstract: Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal “fallback” mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

[28] Multilingual Political Views of Large Language Models: Identification and Steering

Daniil Gurgurov, Katharina Trinley, Ivan Vykopal, Josef van Genabith, Simon Ostermann, Roberto Zamparelli

Main category: cs.CL

TL;DR: The paper investigates political biases in LLMs, revealing a libertarian-left skew, and tests a method to manipulate these biases across languages.

Details

Motivation: Address gaps in understanding political biases in LLMs, including generalizability and controllability.

Method: Evaluated seven open-source LLMs across 14 languages using the Political Compass Test with paraphrases. Tested bias manipulation via activation intervention.

Result: Larger models lean libertarian-left, with variations by language and model. Bias manipulation was successful.

Conclusion: LLMs exhibit political biases that can be controlled, highlighting the need for awareness and mitigation in deployment.

Abstract: Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases–frequently skewing toward liberal or progressive positions–key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.

[29] From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs

Jie He, Victor Gutierrez Basulto, Jeff Z. Pan

Main category: cs.CL

TL;DR: The paper introduces TIRESRAG-R1, a reinforcement learning-based RAG framework that improves reasoning in LLMs by addressing three failure patterns and using multi-dimensional rewards.

Details

Motivation: Existing RAG methods rely on final-answer rewards, ignoring intermediate reasoning quality, leading to issues like insufficient retrieval, faulty reasoning, and answer-reasoning inconsistency.

Method: TIRESRAG-R1 uses a think-retrieve-reflect process with sufficiency, reasoning quality, and reflection rewards, plus difficulty-aware reweighting and sample filtering.

Result: Experiments on multi-hop QA datasets show TIRESRAG-R1 outperforms prior methods and generalizes to single-hop tasks.

Conclusion: TIRESRAG-R1 enhances reasoning and stability in RAG methods, with code and data publicly available.

Abstract: Reinforcement learning-based retrieval-augmented generation (RAG) methods enhance the reasoning abilities of large language models (LLMs). However, most rely only on final-answer rewards, overlooking intermediate reasoning quality. This paper analyzes existing RAG reasoning models and identifies three main failure patterns: (1) information insufficiency, meaning the model fails to retrieve adequate support; (2) faulty reasoning, where logical or content-level flaws appear despite sufficient information; and (3) answer-reasoning inconsistency, where a valid reasoning chain leads to a mismatched final answer. We propose TIRESRAG-R1, a novel framework using a think-retrieve-reflect process and a multi-dimensional reward system to improve reasoning and stability. TIRESRAG-R1 introduces: (1) a sufficiency reward to encourage thorough retrieval; (2) a reasoning quality reward to assess the rationality and accuracy of the reasoning chain; and (3) a reflection reward to detect and revise errors. It also employs a difficulty-aware reweighting strategy and training sample filtering to boost performance on complex tasks. Experiments on four multi-hop QA datasets show that TIRESRAG-R1 outperforms prior RAG methods and generalizes well to single-hop tasks. The code and data are available at: https://github.com/probe2/TIRESRAG-R1.

[30] Investigating Hallucination in Conversations for Low Resource Languages

Amit Das, Md. Najib Hasan, Souvika Sarkar, Zheng Zhang, Fatemeh Jamshidi, Tathagata Bhattacharya, Nilanjana Raychawdhury, Dongji Feng, Vinija Jain, Aman Chadha

Main category: cs.CL

TL;DR: The paper investigates hallucination in LLMs across Hindi, Farsi, and Mandarin, finding fewer hallucinations in Mandarin compared to Hindi and Farsi.

Details

Motivation: To address the issue of hallucination in LLMs, particularly in non-English languages, to improve reliability.

Method: Analysis of conversational data in Hindi, Farsi, and Mandarin using models like GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1, and Qwen-3.

Result: LLMs produce fewer hallucinations in Mandarin but more in Hindi and Farsi.

Conclusion: Hallucination rates vary by language, highlighting the need for language-specific improvements in LLMs.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as ‘hallucination’. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

[31] Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning

Benedikt Roth, Stephan Rappensperger, Tianming Qiu, Hamza Imamović, Julian Wörmann, Hao Shen

Main category: cs.CL

TL;DR: The paper explores adaptation strategies for LLMs to improve text embeddings for non-generative tasks, achieving state-of-the-art performance on MTEB’s clustering track.

Details

Motivation: LLMs' token-level representations lose crucial information when pooled into embeddings, yet many tasks rely on accurate sentence/document embeddings.

Method: Three strategies: (i) token embedding aggregation, (ii) task-specific prompt engineering, (iii) contrastive fine-tuning with synthetic data.

Result: Combined strategies yield top performance on MTEB’s clustering track; fine-tuning shifts focus to semantically relevant words.

Conclusion: LLMs can be effectively adapted for text embeddings via prompt engineering and efficient contrastive fine-tuning.

Abstract: Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), achieving impressive performance in text generation. Their token-level representations capture rich, human-aligned semantics. However, pooling these vectors into a text embedding discards crucial information. Nevertheless, many non-generative downstream tasks, such as clustering, classification, or retrieval, still depend on accurate and controllable sentence- or document-level embeddings. We explore several adaptation strategies for pre-trained, decoder-only LLMs: (i) various aggregation techniques for token embeddings, (ii) task-specific prompt engineering, and (iii) text-level augmentation via contrastive fine-tuning. Combining these components yields state-of-the-art performance on the English clustering track of the Massive Text Embedding Benchmark (MTEB). An analysis of the attention map further shows that fine-tuning shifts focus from prompt tokens to semantically relevant words, indicating more effective compression of meaning into the final hidden state. Our experiments demonstrate that LLMs can be effectively adapted as text embedding models through a combination of prompt engineering and resource-efficient contrastive fine-tuning on synthetically generated positive pairs.

[32] Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index

Praveenkumar Katwe, Rakesh Chandra, Balabantaray Kali, Prasad Vittala

Main category: cs.CL

TL;DR: A reward-driven fine-tuning framework is introduced to reduce entity hallucinations in abstractive summarization by optimizing for the Entity Hallucination Index (EHI).

Details

Motivation: Reducing hallucinations in abstractive summarization is critical for real-world deployment of language models.

Method: Uses reinforcement learning to fine-tune models, optimizing for EHI scores derived from automatic entity extraction and matching.

Result: Consistent improvements in EHI scores, with reduced entity hallucinations without compromising fluency or informativeness.

Conclusion: The framework enables scalable, annotation-free fine-tuning and is released as a reproducible pipeline for further research.

Abstract: Reducing hallucinations in abstractive summarization remains a critical challenge for deploying language models (LMs) in real-world settings. In this work, we introduce a rewarddriven fine-tuning framework that explicitly optimizes for Entity Hallucination Index (EHI), a metric designed to quantify the presence, correctness, and grounding of named entities in generated summaries. Given a corpus of meeting transcripts, we first generate baseline summaries using a pre-trained LM and compute EHI scores via automatic entity extraction and matching. We then apply reinforcement learning to fine-tune the model parameters, using EHI as a reward signal to bias generation toward entity-faithful outputs. Our approach does not rely on human-written factuality annotations, enabling scalable fine-tuning. Experiments demonstrate consistent improvements in EHI across datasets, with qualitative analysis revealing a significant reduction in entity-level hallucinations without degradation in fluency or informativeness. We release a reproducible Colab pipeline, facilitating further research on hallucination-aware model fine-tuning using lightweight, hallucintion metrics like EHI.

[33] CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Jindřich Libovický, Jindřich Helcl, Andrei Manea, Gianluca Vico

Main category: cs.CL

TL;DR: A benchmark for open-ended regional QA with textual and visual modalities is introduced, using LLMs as baselines. The dataset includes multilingual questions grounded in Wikipedia, revealing gaps in LLMs’ regional knowledge and weak correlation between automated metrics and human judgment.

Details

Motivation: To assess regional knowledge in LLMs, study cross-lingual consistency, and improve evaluation metrics for open-ended QA.

Method: Manually curated multilingual dataset (Czech, Slovak, Ukrainian, English) with textual and visual questions. Evaluated LLMs via prompting and human judgments.

Result: Significant gap in LLMs’ regional knowledge; minimal correlation between automated metrics and human evaluations.

Conclusion: The dataset aids in evaluating LLMs’ regional knowledge, cross-lingual consistency, and refining QA evaluation metrics.

Abstract: We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results highlight a significant gap in regional knowledge among current LLMs. Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering.

[34] Opportunities and Challenges of LLMs in Education: An NLP Perspective

Sowmya Vajjala, Bashar Alhafni, Stefano Bannò, Kaushal Kumar Maurya, Ekaterina Kochmar

Main category: cs.CL

TL;DR: The paper explores the impact of large language models (LLMs) on education, focusing on assistance and assessment across reading, writing, speaking, and tutoring. It highlights new opportunities and challenges for future NLP-enabled educational applications.

Details

Motivation: To understand how LLMs can transform education by enhancing teaching, learning, and assessment through NLP.

Method: Examines LLMs in educational NLP, focusing on assistance and assessment across four dimensions: reading, writing, speaking, and tutoring.

Result: Identifies new directions enabled by LLMs and key challenges to address for future applications.

Conclusion: Provides a holistic overview for NLP researchers to explore LLMs in language-focused educational applications.

Abstract: Interest in the role of large language models (LLMs) in education is increasing, considering the new opportunities they offer for teaching, learning, and assessment. In this paper, we examine the impact of LLMs on educational NLP in the context of two main application scenarios: {\em assistance} and {\em assessment}, grounding them along the four dimensions – reading, writing, speaking, and tutoring. We then present the new directions enabled by LLMs, and the key challenges to address. We envision that this holistic overview would be useful for NLP researchers and practitioners interested in exploring the role of LLMs in developing language-focused and NLP-enabled educational applications of the future.

[35] MASCA: LLM based-Multi Agents System for Credit Assessment

Gautam Jajoo, Pranjal A Chitale, Saksham Agarwal

Main category: cs.CL

TL;DR: MASCA is an LLM-driven multi-agent system for credit assessment, outperforming traditional methods by leveraging hierarchical collaboration and contrastive learning.

Details

Motivation: Credit assessment is underexplored in LLM applications, with traditional methods relying on rule-based or statistical approaches.

Method: MASCA uses a layered architecture with specialized LLM-based agents and integrates contrastive learning for risk/reward assessment.

Result: Experiments show MASCA outperforms baseline methods in credit scoring.

Conclusion: Hierarchical LLM-based multi-agent systems like MASCA are effective for financial applications, particularly credit assessment.

Abstract: Recent advancements in financial problem-solving have leveraged LLMs and agent-based systems, with a primary focus on trading and financial modeling. However, credit assessment remains an underexplored challenge, traditionally dependent on rule-based methods and statistical models. In this paper, we introduce MASCA, an LLM-driven multi-agent system designed to enhance credit evaluation by mirroring real-world decision-making processes. The framework employs a layered architecture where specialized LLM-based agents collaboratively tackle sub-tasks. Additionally, we integrate contrastive learning for risk and reward assessment to optimize decision-making. We further present a signaling game theory perspective on hierarchical multi-agent systems, offering theoretical insights into their structure and interactions. Our paper also includes a detailed bias analysis in credit assessment, addressing fairness concerns. Experimental results demonstrate that MASCA outperforms baseline approaches, highlighting the effectiveness of hierarchical LLM-based multi-agent systems in financial applications, particularly in credit scoring.

[36] DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph

Debayan Banerjee, Tilahun Abedissa Taffa, Ricardo Usbeck

Main category: cs.CL

TL;DR: A zero-shot entity linker for DBLP’s 2025 RDF-based Knowledge Graph, using LLMs and a novel re-ranking method based on log-probabilities of the ‘yes’ token.

Details

Motivation: To improve entity linking in DBLP's updated Knowledge Graph by introducing a zero-shot approach, leveraging LLMs for better performance without extensive training.

Method: Developed a zero-shot entity linker using LLMs, re-ranking candidates based on log-probabilities of the ‘yes’ token at the penultimate layer.

Result: Proposed method avoids the need for training KG-embeddings or re-rankers, simplifying the entity linking process.

Conclusion: The zero-shot approach with LLMs offers a scalable and efficient solution for entity linking in DBLP’s Knowledge Graph.

Abstract: In this work we present an entity linker for DBLP’s 2025 version of RDF-based Knowledge Graph. Compared to the 2022 version, DBLP now considers publication venues as a new entity type called dblp:Stream. In the earlier version of DBLPLink, we trained KG-embeddings and re-rankers on a dataset to produce entity linkings. In contrast, in this work, we develop a zero-shot entity linker using LLMs using a novel method, where we re-rank candidate entities based on the log-probabilities of the “yes” token output at the penultimate layer of the LLM.

[37] Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization

Weijia Zhang, Songgaojun Deng, Evangelos Kanoulas

Main category: cs.CL

TL;DR: The paper proposes TaSoF, a structured plan for query-focused table summarization, and SPaGe, a framework that formalizes reasoning into three phases, outperforming prior models in benchmarks.

Details

Motivation: Natural language plans for table summarization are ambiguous and lack structure, limiting scalability and conversion to executable programs like SQL.

Method: Introduces TaSoF, a structured plan, and SPaGe, a framework with three phases: Structured Planning, Graph-based Execution, and Summary Generation.

Result: SPaGe outperforms prior models in single- and multi-table benchmarks, showing robustness and scalability.

Conclusion: Structured representations like TaSoF and SPaGe enhance reliability and scalability in query-focused table summarization.

Abstract: Query-focused table summarization requires complex reasoning, often approached through step-by-step natural language (NL) plans. However, NL plans are inherently ambiguous and lack structure, limiting their conversion into executable programs like SQL and hindering scalability, especially for multi-table tasks. To address this, we propose a paradigm shift to structured representations. We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi-agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases: 1) Structured Planning to generate TaSoF from a query, 2) Graph-based Execution to convert plan steps into SQL and model dependencies via a directed cyclic graph for parallel execution, and 3) Summary Generation to produce query-focused summaries. Our method explicitly captures complex dependencies and improves reliability. Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single- and multi-table settings, demonstrating the advantages of structured representations for robust and scalable summarization.

[38] Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

Kwesi Cobbina, Tianyi Zhou

Main category: cs.CL

TL;DR: The paper identifies a positional bias in in-context learning (ICL) called DEMOS’ POSITION IN PROMPT (DPP) bias, showing that demo placement affects model accuracy and predictions. Placing demos at the start of the prompt yields the best results.

Details

Motivation: To investigate the unexplored positional bias in ICL, where the placement of demonstrations (demos) in the prompt affects model performance.

Method: A systematic evaluation pipeline is designed to study positional bias across tasks like classification, QA, summarization, and reasoning. Two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, quantify the impact.

Result: Experiments on ten LLMs show that demo placement significantly affects accuracy and predictions. Placing demos at the start improves stability (+6 points), while placing them at the end flips 30% of predictions without accuracy gains. Smaller models are most affected.

Conclusion: The study highlights the importance of demo placement in ICL, with optimal positioning at the prompt’s start for stable and accurate outputs.

Abstract: In-context learning (ICL) is a critical emerging capability of large language models (LLMs), enabling few-shot learning during inference by including a few demonstrations (demos) in the prompt. However, it has been found that ICL’s performance can be sensitive to the choices of demos and their order. This paper investigates an unexplored new positional bias of ICL for the first time: we observe that the predictions and accuracy can drift drastically when the positions of demos, the system prompt, and the user message in LLM input are varied. We refer to this bias as DEMOS’ POSITION IN PROMPT (DPP) bias. We design a systematic evaluation pipeline to study this type of positional bias across classification, question answering, summarization, and reasoning tasks. We introduce two metrics, ACCURACY-CHANGE and PREDICTION-CHANGE, to quantify net gains and output volatility induced by changes in the demos’ position. Extensive experiments on ten LLMs from four open-source model families (QWEN, LLAMA3, MISTRAL, COHERE) verify that the bias significantly affects their accuracy and predictions: placing demos at the start of the prompt yields the most stable and accurate outputs with gains of up to +6 points. In contrast, placing demos at the end of the user message flips over 30% of predictions without improving correctness on QA tasks. Smaller models are most affected by this sensitivity, though even large models remain marginally affected on complex tasks.

[39] Towards the Law of Capacity Gap in Distilling Language Models

Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, Yan Hu

Main category: cs.CL

TL;DR: The paper introduces the ’law of capacity gap’ to optimize teacher-student LM distillation, showing linear scaling between teacher and student sizes for best performance.

Details

Motivation: Address the computational inefficiency of finding optimal teacher sizes for LM distillation, especially with large LMs.

Method: Induct the ’law of capacity gap’ from small-scale LM distillation and validate it on larger-scale (7B) LMs.

Result: Versatile LLMs outperformed competitors by applying the linear scaling law.

Conclusion: The ’law of capacity gap’ efficiently guides optimal teacher selection for LM distillation, reducing computational costs.

Abstract: Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the \textit{curse of capacity gap}, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the \textit{law of capacity gap} inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.

[40] Instruction-tuned Large Language Models for Machine Translation in the Medical Domain

Miguel Rios

Main category: cs.CL

TL;DR: Instruction-tuned LLMs outperform baseline models in medical machine translation, especially with added specialized terminology.

Details

Motivation: LLMs underperform in specialized domains like medical translation, where terminology consistency is critical.

Method: Compare baseline LLMs with instruction-tuned LLMs and incorporate medical terminology into datasets for fine-tuning.

Result: Instruction-tuned LLMs show significant improvement over baseline models in automatic metrics.

Conclusion: Fine-tuning LLMs with domain-specific instructions and terminology enhances translation performance in specialized domains.

Abstract: Large Language Models (LLMs) have shown promising results on machine translation for high resource language pairs and domains. However, in specialised domains (e.g. medical) LLMs have shown lower performance compared to standard neural machine translation models. The consistency in the machine translation of terminology is crucial for users, researchers, and translators in specialised domains. In this study, we compare the performance between baseline LLMs and instruction-tuned LLMs in the medical domain. In addition, we introduce terminology from specialised medical dictionaries into the instruction formatted datasets for fine-tuning LLMs. The instruction-tuned LLMs significantly outperform the baseline models with automatic metrics.

[41] Past Meets Present: Creating Historical Analogy with Large Language Models

Nianqi Li, Siyu Yuan, Jiangjie Chen, Jiaqing Liang, Feng Wei, Zujie Liang, Deqing Yang, Yanghua Xiao

Main category: cs.CL

TL;DR: The paper explores methods for acquiring historical analogies using LLMs, proposing a self-reflection technique to improve accuracy and reduce biases.

Details

Motivation: Historical analogies aid decision-making but are hard to find, and prior AI research has neglected this area.

Method: Retrieval and generation methods using LLMs, enhanced by a self-reflection approach to address hallucinations and stereotypes.

Result: LLMs show potential for historical analogies, with performance improving using the self-reflection method.

Conclusion: The study highlights LLMs’ capability for historical analogies and the effectiveness of self-reflection in enhancing results.

Abstract: Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.

[42] Neutral Residues: Revisiting Adapters for Model Extension

Franck Signe Talla, Edouard Grave, Hervé Jégou

Main category: cs.CL

TL;DR: The paper introduces ’neutral residues,’ an improved adapter method for extending pretrained LLMs to new domains without degrading original domain performance.

Details

Motivation: To address the trade-off between adapting to new domains and maintaining performance on the original domain in pretrained LLMs.

Method: Enhances adapters by jointly optimizing data, architecture, and training, ensuring new residual blocks output near-zeros on the original domain.

Result: Outperforms finetuning, LoRA, and vanilla adapters in adapting to a new language while preserving English performance.

Conclusion: Neutral residues offer a superior solution for domain adaptation in LLMs, balancing new domain learning and original domain retention.

Abstract: We address the problem of extending a pretrained large language model to a new domain that was not seen during training. Standard techniques, such as finetuning or low-rank adaptation (LoRA) are successful at domain adaptation, but do not formally add capacity to the model. This often leads to a trade-off, between performing well on the new domain vs. degrading performance on the original domain. Here, we revisit and improve adapters to extend LLMs from three angles: data, architecture and training procedure, which are advantageously considered jointly. The resulting method, called neutral residues, modifies adapters in a way that leads each new residual block to output near-zeros on the original domain. This solution leads to strong results when adapting a state-of-the-art model originally trained on English to a new language. Neutral residues significantly outperform competing approaches such as finetuning, LoRA or vanilla adapters in terms of the trade-off between learning the new language and not forgetting English.

[43] Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

Farid Ariai, Joel Mackenzie, Gianluca Demartini

Main category: cs.CL

TL;DR: A survey on NLP in the legal field, reviewing 131 studies, highlighting challenges like complex language and limited datasets, and identifying 16 open research challenges.

Details

Motivation: To explore NLP's potential in the legal sector, addressing unique challenges and tasks like summarization and bias detection.

Method: Followed the PRISMA framework, reviewing 154 studies and filtering to 131, analyzing legal NLP tasks and language models.

Result: Identified key NLP tasks, legal-oriented models, and 16 open challenges, including bias mitigation and model interpretability.

Conclusion: NLP holds promise for legal applications but requires addressing challenges like bias and explainability to advance the field.

Abstract: Natural Language Processing (NLP) is revolutionising the way both professionals and laypersons operate in the legal field. The considerable potential for NLP in the legal sector, especially in developing computational assistance tools for various legal processes, has captured the interest of researchers for years. This survey follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses framework, reviewing 154 studies, with a final selection of 131 after manual filtering. It explores foundational concepts related to NLP in the legal domain, illustrating the unique aspects and challenges of processing legal texts, such as extensive document lengths, complex language, and limited open legal datasets. We provide an overview of NLP tasks specific to legal text, such as Document Summarisation, Named Entity Recognition, Question Answering, Argument Mining, Text Classification, and Judgement Prediction. Furthermore, we analyse both developed legal-oriented language models, and approaches for adapting general-purpose language models to the legal domain. Additionally, we identify sixteen open research challenges, including the detection and mitigation of bias in artificial intelligence applications, the need for more robust and interpretable models, and improving explainability to handle the complexities of legal language and reasoning.

[44] Yankari: A Monolingual Yoruba Dataset

Maro Akpobi

Main category: cs.CL

TL;DR: Yankari is a large-scale monolingual dataset for Yoruba, addressing the lack of NLP resources for this widely spoken but underrepresented language.

Details

Motivation: To bridge the gap in NLP resources for Yoruba, a West African language spoken by over 30 million people but lacking adequate digital tools.

Method: The dataset was created through careful source selection, automated quality control, and rigorous data cleaning, totaling 51,407 documents from 13 diverse sources.

Result: Yankari comprises over 30 million tokens and demonstrates high quality through automated evaluations, outperforming existing resources.

Conclusion: Yankari significantly advances Yoruba language resources, enabling better NLP models, linguistic studies, and digital accessibility.

Abstract: This paper presents Yankari, a large-scale monolingual dataset for the Yoruba language, aimed at addressing the critical gap in Natural Language Processing (NLP) resources for this important West African language. Despite being spoken by over 30 million people, Yoruba has been severely underrepresented in NLP research and applications. We detail our methodology for creating this dataset, which includes careful source selection, automated quality control, and rigorous data cleaning processes. The Yankari dataset comprises 51,407 documents from 13 diverse sources, totaling over 30 million tokens. Our approach focuses on ethical data collection practices, avoiding problematic sources and addressing issues prevalent in existing datasets. We provide thorough automated evaluations of the dataset, demonstrating its quality compared to existing resources. The Yankari dataset represents a significant advancement in Yoruba language resources, providing a foundation for developing more accurate NLP models, supporting comparative linguistic studies, and contributing to the digital accessibility of the Yoruba language.

[45] Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck

Andor Diera, Lukas Galke, Fabian Karl, Ansgar Scherp

Main category: cs.CL

TL;DR: A discrete key-value bottleneck (DKVB) is introduced for NLP continual learning, reducing catastrophic forgetting and computational costs while maintaining performance.

Details

Motivation: Addressing catastrophic forgetting in NLP continual learning by enabling localized updates.

Method: Introduces DKVB for encoder-only language models, compares bottleneck architectures, and proposes task-independent key initialization.

Result: DKVB alleviates forgetting, achieves competitive performance, and works well in single-head scenarios without task IDs.

Conclusion: DKVB is an efficient and effective solution for continual learning in NLP, even in challenging scenarios.

Abstract: Continual learning remains a challenge across various natural language processing (NLP) tasks, as models updated with new training data often risk catastrophic forgetting of previously acquired knowledge. We introduce a discrete key-value bottleneck (DKVB) for encoder-only language models, enabling efficient continual learning through localized updates. Inspired by a discrete key-value bottleneck in vision, we consider new and NLP-specific challenges. We compare different bottleneck architectures for NLP and introduce a new, task-independent initialization technique for the discrete keys. We evaluate our DKVB for NLP in four continual learning scenarios and show that it alleviates catastrophic forgetting. Our experiments demonstrate that the proposed approach achieves competitive performance compared to popular continual learning methods while incurring lower computational costs. Furthermore, we show that DKVB remains effective even in challenging single-head continual learning scenarios where no task ID is provided.

[46] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

William Han, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao

Main category: cs.CL

TL;DR: ECG-Byte is a new method for generating text from ECG signals using a single-stage LLM training approach, improving efficiency and interpretability.

Details

Motivation: Existing methods for ECG-to-text generation are inefficient due to multi-stage training and lack interpretability of encoder features.

Method: Proposes ECG-Byte, a BPE tokenizer pipeline that encodes ECG signals into tokens for direct end-to-end LLM training, combining ECG and text tokens.

Result: Achieves competitive NLG performance, trains 3 times faster, and uses 48% less data than traditional two-stage methods.

Conclusion: ECG-Byte offers a more efficient and interpretable solution for ECG-to-text generation compared to existing approaches.

Abstract: Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding textual prompts. Existing approaches often involve a two-stage process: pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective, followed by finetuning an LLM for natural language generation (NLG) using encoder-derived features. However, these methods face two key limitations: inefficiency due to multi-stage training and challenges in interpreting encoder-generated features. To overcome these issues, we propose ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. ECG-Byte compresses and encodes ECG signals into tokens, enabling direct end-to-end LLM training by combining ECG and text tokens. This approach enhances interpretability, as ECG tokens can be directly mapped back to the original signals. Leveraging ECG-Byte, we achieve competitive NLG performance while training 3 times faster and using just 48% of the data required by traditional two-stage methods.

[47] Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs

Hortense Fong, George Gui

Main category: cs.CL

TL;DR: The paper introduces a framework using large language models to model audience forward-looking beliefs in stories, enhancing engagement prediction by 31%.

Details

Motivation: Existing methods focus on content features but lack modeling of audience beliefs about future story developments, which are crucial for engagement.

Method: Leverages large language models to generate story continuations and extracts features like expectations, uncertainty, and surprise.

Result: Applied to 30,000 book chapters, the framework increased explanatory power by 31% and showed distinct engagement drivers.

Conclusion: The framework offers a novel way to study audience beliefs’ impact on engagement, aiding content-focused industries.

Abstract: Understanding when and why consumers engage with stories is crucial for content creators and platforms. While existing theories suggest that audience beliefs of what is going to happen should play an important role in engagement decisions, empirical work has mostly focused on developing techniques to directly extract features from actual content, rather than capturing forward-looking beliefs, due to the lack of a principled way to model such beliefs in unstructured narrative data. To complement existing feature extraction techniques, this paper introduces a novel framework that leverages large language models to model audience forward-looking beliefs about how stories might unfold. Our method generates multiple potential continuations for each story and extracts features related to expectations, uncertainty, and surprise using established content analysis techniques. Applying our method to over 30,000 book chapters, we demonstrate that our framework complements existing feature engineering techniques by amplifying their marginal explanatory power on average by 31%. The results reveal that different types of engagement-continuing to read, commenting, and voting-are driven by distinct combinations of current and anticipated content features. Our framework provides a novel way to study and explore how audience forward-looking beliefs shape their engagement with narrative media, with implications for marketing strategy in content-focused industries.

[48] Rationale-guided Prompting for Knowledge-based Visual Question Answering

Zhongjian Hu, Peng Yang, Bing Li, Fengyuan Liu

Main category: cs.CL

TL;DR: PLRH framework improves knowledge-based VQA by prompting LLMs with rationale heuristics (CoT) for intermediate thought processes, outperforming baselines by over 2.2 and 2.1 on OK-VQA and A-OKVQA.

Details

Motivation: Prior methods directly prompt LLMs for answers without leveraging intermediate thought processes, underutilizing their capabilities.

Method: PLRH prompts LLMs with Chain of Thought (CoT) to generate rationale heuristics, which then inspire answer prediction.

Result: PLRH outperforms baselines by more than 2.2 on OK-VQA and 2.1 on A-OKVQA.

Conclusion: Activating LLMs’ intermediate reasoning via rationale heuristics significantly enhances performance in knowledge-based VQA.

Abstract: Recently, Large Language Models (LLMs) have been used for knowledge-based Visual Question Answering (VQA). Despite the encouraging results of previous studies, prior methods prompt LLMs to predict answers directly, neglecting intermediate thought processes. We argue that prior methods do not sufficiently activate the capacities of LLMs. We propose a framework called PLRH that Prompts LLMs with Rationale Heuristics for knowledge-based VQA. The PLRH prompts LLMs with Chain of Thought (CoT) to generate rationale heuristics, i.e., intermediate thought processes, and then leverages the rationale heuristics to inspire LLMs to predict answers. Experiments show that our approach outperforms the existing baselines by more than 2.2 and 2.1 on OK-VQA and A-OKVQA, respectively.

[49] FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

Hongzhou Yu, Tianhao Cheng, Yingwen Wang, Wen He, Qing Wang, Ying Cheng, Yuejie Zhang, Rui Feng, Xiaobo Zhang

Main category: cs.CL

TL;DR: FineMedLM-o1 improves medical LLMs with advanced reasoning, using synthetic data and Test-Time Training (TTT), outperforming prior models by 23% and gaining an extra 14% boost from TTT.

Details

Motivation: Existing medical LLMs lack deep reasoning for complex tasks like differential diagnosis. FineMedLM-o1 aims to bridge this gap.

Method: Uses high-quality synthetic data and long-form reasoning for SFT and DPO, introduces TTT for domain adaptation, and proposes a novel medical dialogue synthesis method.

Result: 23% average improvement over prior models, with an additional 14% boost from TTT. Dataset quality and complexity surpass others.

Conclusion: FineMedLM-o1 advances medical LLMs with superior reasoning and adaptation, supported by high-quality data. Project and data will be open-sourced.

Abstract: Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the deep reasoning required for complex medical problems, such as differential diagnosis and medication recommendations. We propose FineMedLM-o1, which leverages high-quality medical synthetic data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduce Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also propose a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.

[50] GneissWeb: Preparing High Quality Data for LLMs at Scale

Hajar Emami Gohari, Swanand Ravindra Kadhe, Syed Yousaf Shah, Constantin Adam, Abdulhamid Adebayo, Praneet Adusumilli, Farhan Ahmed, Nathalie Baracaldo Angel, Santosh Subhashrao Borse, Yuan-Chi Chang, Xuan-Hong Dang, Nirmit Desai, Revital Eres, Ran Iwamoto, Alexei Karve, Yan Koyfman, Wei-Han Lee, Changchang Liu, Boris Lublinsky, Takuyo Ohko, Pablo Pesce, Maroun Touma, Shiqiang Wang, Shalisha Witherspoon, Herbert Woisetschläger, David Wood, Kun-Lung Wu, Issei Yoshida, Syed Zawad, Petros Zerfos, Yi Zhou, Bishwaranjan Bhattacharjee

Main category: cs.CL

TL;DR: GneissWeb is a large dataset (10 trillion tokens) designed for training LLMs, outperforming existing open datasets like FineWeb-V1.1.0 in benchmarks.

Details

Motivation: Existing open datasets for LLMs are either small or lack quality, limiting model performance. GneissWeb addresses this gap.

Method: GneissWeb uses sharded exact sub-string deduplication and quality filters to balance data quality and quantity.

Result: Models trained on GneissWeb outperform FineWeb-V1.1.0 by 2.73 percentage points on 11 benchmarks and 1.75 points on 20 benchmarks.

Conclusion: GneissWeb provides a high-quality, large-scale dataset for LLM training, improving model performance over existing open datasets.

Abstract: Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM’s ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.

[51] QE4PE: Word-level Quality Estimation for Human Post-Editing

Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza

Main category: cs.CL

TL;DR: The study examines how word-level quality estimation (QE) methods impact human post-editing of machine translations, comparing four error-highlighting methods in a realistic setting with 42 professionals.

Details

Motivation: To assess the usability and downstream effects of word-level QE on post-editing speed, quality, and editing choices, which are understudied despite extensive accuracy evaluations.

Method: Comparison of four error-span highlight modalities (supervised and uncertainty-based QE methods) in a realistic post-editing scenario with professional editors across two translation directions.

Result: Domain, language, and editor speed significantly influence highlight effectiveness, with minor differences between human and automated QE highlights, revealing a gap between accuracy and usability.

Conclusion: Word-level QE’s impact on post-editing is context-dependent, with automated methods showing potential but requiring further alignment with professional workflows.

Abstract: Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors’ speed are critical factors in determining highlights’ effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.

Hannah Kim, Sofia Martinez, Jason Lee

Main category: cs.CL

TL;DR: Proposes CSS-GR, a cross-modal summarization framework using state-space models and graph reasoning, improving quality and interpretability while being computationally efficient.

Details

Motivation: Addressing high computational overheads and limited interpretability in prior cross-modal summarization methods.

Method: Combines state-space models with graph-based message passing to capture inter- and intra-modal relationships for holistic reasoning.

Result: Significantly improves summarization quality and interpretability, validated on standard benchmarks.

Conclusion: CSS-GR offers an efficient and interpretable solution for cross-modal summarization, supported by ablation studies.

Abstract: The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.

[53] Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

Jihyun Janice Ahn, Wenpeng Yin

Main category: cs.CL

TL;DR: The paper introduces Prompt-Reverse Inconsistency (PRIN), a new type of inconsistency in LLMs where models give conflicting answers when asked to identify correct vs. incorrect responses. It explores PRIN’s impact, mitigation, and relationship with other inconsistencies.

Details

Motivation: To uncover and analyze PRIN, a previously unstudied inconsistency in LLMs, which challenges their reliability as judges and adherence to logical rules.

Method: Conducted experiments to measure PRIN across LLMs, tested mitigation strategies, and compared it with Randomness and Paraphrase Inconsistencies.

Result: PRIN is prevalent and undermines LLM credibility, highlighting a need for improved consistency in model judgments.

Conclusion: The study provides insights into LLM behavior, contributing to the development of more trustworthy AI systems.

Abstract: While the inconsistency of LLMs is not a novel topic, prior research has predominantly addressed two types of generative inconsistencies: i) Randomness Inconsistency: running the same LLM multiple trials, yielding varying responses; ii) Paraphrase Inconsistency: paraphrased prompts result in different responses from the same LLM. Randomness Inconsistency arises from the inherent randomness due to stochastic sampling in generative models, while Paraphrase Inconsistency is a consequence of the language modeling objectives, where paraphrased prompts alter the distribution of vocabulary logits. This research discovers Prompt-Reverse Inconsistency (PRIN), a new form of LLM self-inconsistency: given a question and a couple of LLM-generated answer candidates, the LLM often has conflicting responses when prompted “Which are correct answers?” and “Which are incorrect answers?”. PRIN poses a big concern as it undermines the credibility of LLM-as-a-judge, and suggests a challenge for LLMs to adhere to basic logical rules. We conduct a series of experiments to investigate PRIN, examining the extent of PRIN across different LLMs, methods to mitigate it, potential applications, and its relationship with Randomness Inconsistency and Paraphrase Inconsistency. As the first study to explore PRIN, our findings offer valuable insights into the inner workings of LLMs and contribute to advancing trustworthy AI.

[54] Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears

Anastasiia Ivanova, Natalia Fedorova, Sergei Tilga, Ekaterina Artemova

Main category: cs.CL

TL;DR: The paper explores AI-driven tools’ impact on professional writing, focusing on language support, ethics, and creativity. Surveys reveal insights on LLM adoption, misinformation, and usability.

Details

Motivation: To address underexplored aspects of LLM adoption in professional writing, such as multilingual support, ethical concerns, and long-term effects on creativity.

Method: Conducted a questionnaire (N=301) and interactive survey (N=36) with professional writers using AI, analyzing practices across 25+ languages, ethics, and user expectations.

Result: Findings highlight LLM adoption for non-English speakers, misinformation risks, domain/style adaptation, and usability features.

Conclusion: The insights can guide future LLM development to better serve writers and a broader audience.

Abstract: The rapid development of AI-driven tools, particularly large language models (LLMs), is reshaping professional writing. Still, key aspects of their adoption such as languages support, ethics, and long-term impact on writers voice and creativity remain underexplored. In this work, we conducted a questionnaire (N = 301) and an interactive survey (N = 36) targeting professional writers regularly using AI. We examined LLM-assisted writing practices across 25+ languages, ethical concerns, and user expectations. The findings of the survey demonstrate important insights, reflecting upon the importance of: LLMs adoption for non-English speakers; the degree of misinformation, domain and style adaptation; usability and key features of LLMs. These insights can guide further development, benefiting both writers and a broader user base.

[55] Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom

Main category: cs.CL

TL;DR: The paper highlights the lack of comprehensive and rigorous evaluation practices for multilingual large language models (mLLMs) and proposes adopting best practices from machine translation (MT) evaluation to improve mLLM assessment.

Details

Motivation: Current evaluation methods for mLLMs are inconsistent and lack scientific rigor, hindering their development. The paper aims to address this by leveraging insights from MT evaluation.

Method: The authors conduct experiments across key stages of the generative evaluation pipeline, applying MT evaluation best practices to mLLMs. They also identify components for robust meta-evaluation.

Result: The study demonstrates how MT evaluation practices can enhance understanding of mLLM quality differences and provides a checklist for improving mLLM evaluation.

Conclusion: The paper concludes with actionable recommendations for mLLM research, emphasizing the need for transparent and rigorous evaluation methods.

Abstract: Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly. However, evaluation practices for generative abilities of mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs, which undermines their potential to meaningfully guide mLLM development. We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards and reliable evaluations for multilingual generative models. Through targeted experiments across key stages of the generative evaluation pipeline, we demonstrate how best practices from MT evaluation can deepen the understanding of quality differences between models. Additionally, we identify essential components for robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are rigorously assessed. We distill these insights into a checklist of actionable recommendations for mLLM research and development.

[56] IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation

Kazuki Hayashi, Hidetaka Kamigaito, Shinya Kouda, Taro Watanabe

Main category: cs.CL

TL;DR: IterKey enhances RAG by using LLMs to iteratively refine keywords for sparse retrieval, balancing accuracy and interpretability.

Details

Motivation: Address the trade-off between accuracy (dense retrieval) and interpretability (sparse retrieval) in RAG systems.

Method: Three-stage LLM-driven framework: keyword generation, answer generation, and validation, iterating if validation fails.

Result: 5% to 20% accuracy improvement over BM25-based RAG, comparable to dense retrieval methods.

Conclusion: IterKey effectively balances accuracy and interpretability in RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a way to complement the in-context knowledge of Large Language Models (LLMs) by integrating external documents. However, real-world applications demand not only accuracy but also interpretability. While dense retrieval methods provide high accuracy, they lack interpretability; conversely, sparse retrieval methods offer transparency but often fail to capture the full intent of queries due to their reliance on keyword matching. To address these issues, we introduce IterKey, an LLM-driven iterative keyword generation framework that enhances RAG via sparse retrieval. IterKey consists of three LLM-driven stages: generating keywords for retrieval, generating answers based on retrieved documents, and validating the answers. If validation fails, the process iteratively repeats with refined keywords. Across four QA tasks, experimental results show that IterKey achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines. Its performance is comparable to dense retrieval-based RAG and prior iterative query refinement methods using dense models. In summary, IterKey is a novel BM25-based approach leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with interpretability.

[57] What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization

Weixiao Zhou, Junnan Zhu, Gengyao Li, Xianfu Cheng, Xinnian Liang, Feifei Zhai, Zhoujun Li

Main category: cs.CL

TL;DR: The paper introduces Knowledge-Grounded Discussion Summarization (KGDS) to address the limitations of traditional dialogue summarization by providing background context and clarified opinion summaries. It also presents a benchmark and evaluation framework, revealing challenges for advanced LLMs.

Details

Motivation: Traditional dialogue summarization often fails for discussions with shared background, leading to unclear summaries. KGDS aims to solve this by incorporating background context and explicit references.

Method: The authors propose KGDS, create a benchmark with news-discussion pairs and expert annotations, and introduce a hierarchical evaluation framework with fine-grained metrics. They evaluate 12 LLMs.

Result: Advanced LLMs struggle with KGDS, missing key facts in background summaries and failing to resolve implicit references in opinion summaries.

Conclusion: KGDS is a challenging task for current LLMs, highlighting the need for improved models to handle context-rich discussions effectively.

Abstract: Traditional dialogue summarization primarily focuses on dialogue content, assuming it comprises adequate information for a clear summary. However, this assumption often fails for discussions grounded in shared background, where participants frequently omit context and use implicit references. This results in summaries that are confusing to readers unfamiliar with the background. To address this, we introduce Knowledge-Grounded Discussion Summarization (KGDS), a novel task that produces a supplementary background summary for context and a clear opinion summary with clarified references. To facilitate research, we construct the first KGDS benchmark, featuring news-discussion pairs and expert-created multi-granularity gold annotations for evaluating sub-summaries. We also propose a novel hierarchical evaluation framework with fine-grained and interpretable metrics. Our extensive evaluation of 12 advanced large language models (LLMs) reveals that KGDS remains a significant challenge. The models frequently miss key facts and retain irrelevant ones in background summarization, and often fail to resolve implicit references in opinion summary integration.

[58] Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du

Main category: cs.CL

TL;DR: SDCV improves LLM steering by denoising concept vectors using sparse autoencoders, boosting success rates by 4-16%.

Details

Motivation: Existing methods for steering LLMs with linear concept vectors are hindered by noisy features in diverse datasets, reducing robustness.

Method: SDCV selectively retains discriminative SAE latents, scaling up top-k latents to separate concept signals from noise.

Result: SDCV enhances steering success rates by 4-16% across six challenging concepts while preserving topic relevance.

Conclusion: SDCV effectively denoises concept vectors, improving robustness and success rates in LLM steering.

Abstract: Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16% across six challenging concepts, while maintaining topic relevance.

[59] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin

Main category: cs.CL

TL;DR: MiniLongBench is a compressed version of LongBench, reducing evaluation costs to 4.5% while maintaining high correlation (0.97) with original results.

Details

Motivation: Existing LCU benchmarks are costly due to redundancy, hindering efficient evaluation of LLMs.

Method: Proposed a concise data compression method for long-text data, pruning LongBench to create MiniLongBench with 237 samples.

Result: MiniLongBench reduces evaluation costs significantly (4.5%) and maintains strong correlation (0.97) with LongBench.

Conclusion: MiniLongBench is a low-cost, efficient benchmark for future LCU research in LLMs.

Abstract: Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.

[60] Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

Bidyarthi Paul, Jalisha Jashim Era, Mirazur Rahman Zim, Tahmid Sattar Aothoi, Faisal Muhammad Shah

Main category: cs.CL

TL;DR: The paper introduces SOMADHAN, a dataset for Bengali Math Word Problems (MWPs), and evaluates LLMs using CoT prompting, achieving 88% accuracy with LLaMA-3.3 70B.

Details

Motivation: Addressing the lack of human-annotated Bengali MWP datasets and improving mathematical reasoning in low-resource languages.

Method: Created SOMADHAN dataset (8792 problems), evaluated LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA, etc.) with zero-shot/few-shot CoT prompting, and used LoRA for efficient fine-tuning.

Result: CoT improved performance; LLaMA-3.3 70B achieved 88% accuracy with few-shot CoT. LoRA enabled efficient adaptation.

Conclusion: SOMADHAN fills a critical gap, advancing equitable NLP research and enhancing reasoning in low-resource languages.

Abstract: Solving Bengali Math Word Problems (MWPs) remains a major challenge in natural language processing (NLP) due to the language’s low-resource status and the multi-step reasoning required. Existing models struggle with complex Bengali MWPs, largely because no human-annotated Bengali dataset has previously addressed this task. This gap has limited progress in Bengali mathematical reasoning. To address this, we created SOMADHAN, a dataset of 8792 complex Bengali MWPs with manually written, step-by-step solutions. We designed this dataset to support reasoning-focused evaluation and model development in a linguistically underrepresented context. Using SOMADHAN, we evaluated a range of large language models (LLMs) - including GPT-4o, GPT-3.5 Turbo, LLaMA series models, Deepseek, and Qwen - through both zero-shot and few-shot prompting with and without Chain of Thought (CoT) reasoning. CoT prompting consistently improved performance over standard prompting, especially in tasks requiring multi-step logic. LLaMA-3.3 70B achieved the highest accuracy of 88% with few-shot CoT prompting. We also applied Low-Rank Adaptation (LoRA) to fine-tune models efficiently, enabling them to adapt to Bengali MWPs with minimal computational cost. Our work fills a critical gap in Bengali NLP by providing a high-quality reasoning dataset and a scalable framework for solving complex MWPs. We aim to advance equitable research in low-resource languages and enhance reasoning capabilities in educational and language technologies.

[61] MuSciClaims: Multimodal Scientific Claim Verification

Yash Kumar Lal, Manikanta Bandham, Mohammad Saqib Hasan, Apoorva Kashi, Mahnaz Koupaee, Niranjan Balasubramanian

Main category: cs.CL

TL;DR: The paper introduces MuSciClaims, a benchmark for testing claim verification abilities using multimodal data from scientific figures, highlighting poor performance of vision-language models.

Details

Motivation: To address the lack of multimodal benchmarks for verifying scientific claims using figures, which is crucial for assessing scientific literature.

Method: Automatically extract supported claims from articles, manually perturb them to create contradicted claims, and introduce diagnostic tasks to analyze model failures.

Result: Vision-language models perform poorly (0.3-0.5 F1), with the best model at 0.72 F1, showing bias towards supported claims and struggles with evidence localization and multimodal aggregation.

Conclusion: Current models lack nuanced understanding of scientific claims and multimodal data, necessitating improved benchmarks and model capabilities.

Abstract: Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.

[62] LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian

Main category: cs.CL

TL;DR: The paper introduces LLM-as-a-qualitative-judge, an approach using LLMs to generate structured reports of common issues in NLG outputs, aiding developers in system improvements.

Details

Motivation: Current LLM-as-a-judge methods focus on numerical scores, lacking qualitative insights for NLG system improvement.

Method: The approach involves open-ended per-instance issue analysis and clustering using a cumulative algorithm, evaluated with ~300 annotations from 12 NLG datasets.

Result: The method correctly identifies instance-specific issues in 2/3 cases and produces error reports similar to human annotators.

Conclusion: LLM-as-a-qualitative-judge effectively provides actionable insights for NLG system enhancement, with code and data publicly available.

Abstract: Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.

[63] MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations

Jackson Trager, Diego Alves, Matteo Guida, Mikel K. Ngueajio, Ameeta Agrawal, Flor Plaza-del-Arco, Yalda Daryanai, Farzan Karimi-Malekabadi, Francielle Vargas

Main category: cs.CL

TL;DR: MFTCXplain is a multilingual benchmark for evaluating LLMs’ moral reasoning using hate speech explanations, revealing misalignment between LLMs and human moral judgments.

Details

Motivation: Addressing the lack of transparent, multilingual benchmarks for assessing LLMs' moral reasoning in diverse cultural contexts.

Method: Introduces MFTCXplain, a dataset of 3,000 tweets in four languages, annotated with hate speech labels, moral categories, and rationales, evaluated using MFT.

Result: LLMs perform well in hate speech detection (F1 up to 0.836) but poorly in moral sentiment prediction (F1 < 0.35), with limited rationale alignment in underrepresented languages.

Conclusion: Current LLMs struggle to internalize human moral reasoning, highlighting the need for improved benchmarks and model capabilities.

Abstract: Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.

[64] Probing Information Distribution in Transformer Architectures through Entropy Analysis

Amedeo Buonanno, Alessandro Rivetti, Francesco A. N. Palmieri, Giovanni Di Gennaro, Gianmarco Romano

Main category: cs.CL

TL;DR: The paper uses entropy analysis to study information distribution in Transformer models, focusing on token-level uncertainty and processing stages, with GPT as a case study.

Details

Motivation: To understand how information is managed and transformed in Transformer-based architectures.

Method: Entropy analysis is applied to quantify token-level uncertainty and examine entropy patterns across processing stages, using a GPT-based model as a case study.

Result: The methodology reveals insights into model behavior and internal representations.

Conclusion: This approach could enhance interpretability and evaluation frameworks for Transformer models.

Abstract: This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models

[65] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: EviOmni improves RAG by denoising retrieval inputs through explicit reasoning and conscious extraction of key cues, enhancing LLM accuracy.

Details

Motivation: Retrieval noises degrade LLM generation quality, and existing methods lack explicit reasoning, risking key clue omission and poor generalization.

Method: EviOmni integrates evidence reasoning and extraction into a unified framework, uses knowledge token masks, and employs verifiable rewards for training.

Result: EviOmni outperforms on benchmarks, providing high-quality evidence and boosting downstream task accuracy.

Conclusion: EviOmni effectively denoises RAG inputs, improving LLM performance and practical application in online systems.

Abstract: Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose EviOmni, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of EviOmni, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.

[66] Reservoir Computing as a Language Model

Felix Köster, Atsushi Uchida

Main category: cs.CL

TL;DR: The paper compares reservoir computing and transformer-based models for text processing, highlighting transformers’ superior prediction quality and reservoir computing’s efficiency.

Details

Motivation: Address the energy and speed bottlenecks of LLMs by exploring reservoir computing as an alternative for efficient text processing.

Method: Compare three approaches: two reservoir computing methods (static linear readout and attention-enhanced) and transformer-based models, using a consistent pipeline with equal trainable parameters.

Result: Transformers outperform in prediction quality, while reservoir computing is faster and more efficient in training and inference.

Conclusion: Reservoir computing offers a resource-efficient alternative to transformers, with trade-offs between performance and efficiency.

Abstract: Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing still a bottleneck for further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different reservoir computing approaches, where only an output layer is trainable, and the well-known transformer-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a traditional reservoir with a static linear readout, and an attention-enhanced reservoir that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.

[67] Basic Reading Distillation

Zhi Zhou, Sirui Miao, Xiangyu Duan, Hao Yang, Min Zhang

Main category: cs.CL

TL;DR: The paper introduces Basic Reading Distillation (BRD), a method to train small models by imitating LLMs’ basic reading behaviors, improving performance on various tasks.

Details

Motivation: Large language models (LLMs) require high computation resources, limiting real-world deployment. Existing distillation methods neglect generic text education for small models.

Method: Proposes BRD, which trains small models to imitate LLMs’ basic reading behaviors (e.g., named entity recognition, Q&A) on generic texts.

Result: The small model outperforms or matches 20x larger LLMs on tasks like language inference and BIG-bench.

Conclusion: BRD effectively influences small models’ probability distribution and complements existing distillation techniques.

Abstract: Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are \emph{unrelated} to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.

[68] FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

Likun Tan, Kuan-Wei Huang, Kevin Wu

Main category: cs.CL

TL;DR: The paper introduces a method to detect and edit factual inaccuracies in large language model responses, focusing on financial domains, and demonstrates improved performance with fine-tuned models.

Details

Motivation: Address hallucinations in large language models to ensure factual reliability, especially in high-stakes domains like finance.

Method: Constructs a synthetic dataset with tagged errors, fine-tunes four models (Phi-4, Phi-4-mini, Qwen3-4B, Qwen3-14B) for detection and editing.

Result: Fine-tuned Phi-4 improves binary F1 by 8% and overall detection by 30%; Phi-4-mini remains competitive with minimal performance drop.

Conclusion: Provides a practical solution for factual inconsistency detection and editing, with a generalizable framework for enhancing model trustworthiness.

Abstract: Hallucinations in large language models pose a critical challenge for applications requiring factual reliability, particularly in high-stakes domains such as finance. This work presents an effective approach for detecting and editing factually incorrect content in model-generated responses based on the provided context. Given a user-defined domain-specific error taxonomy, we construct a synthetic dataset by inserting tagged errors into financial question-answering corpora and then fine-tune four language models, Phi-4, Phi-4-mini, Qwen3-4B, and Qwen3-14B, to detect and edit these factual inaccuracies. Our best-performing model, fine-tuned Phi-4, achieves an 8% improvement in binary F1 score and a 30% gain in overall detection performance compared to OpenAI-o3. Notably, our fine-tuned Phi-4-mini model, despite having only 4 billion parameters, maintains competitive performance with just a 2% drop in binary detection and a 0.1% decline in overall detection compared to OpenAI-o3. Our work provides a practical solution for detecting and editing factual inconsistencies in financial text generation while introducing a generalizable framework that can enhance the trustworthiness and alignment of large language models across diverse applications beyond finance. Our code and data are available at https://github.com/pegasi-ai/shield.

[69] Training language models to be warm and empathetic makes them less reliable and more sycophantic

Lujain Ibrahim, Franziska Sofia Hafner, Luc Rocher

Main category: cs.CL

TL;DR: Optimizing AI language models for warmth and empathy reduces their reliability, especially in safety-critical tasks, despite preserved performance on standard benchmarks.

Details

Motivation: To investigate the trade-off between warmth and reliability in AI language models, particularly when users express vulnerability.

Method: Controlled experiments on five language models of varying sizes and architectures, training them for warmer responses and evaluating on safety-critical tasks.

Result: Warm models showed higher error rates, promoted misinformation, and validated incorrect beliefs, especially with vulnerable users.

Conclusion: Current AI development and oversight practices need reevaluation to address these risks as human-like AI systems scale.

Abstract: Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we show how this creates a significant trade-off: optimizing language models for warmth undermines their reliability, especially when users express vulnerability. We conducted controlled experiments on five language models of varying sizes and architectures, training them to produce warmer, more empathetic responses, then evaluating them on safety-critical tasks. Warm models showed substantially higher error rates (+10 to +30 percentage points) than their original counterparts, promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice. They were also significantly more likely to validate incorrect user beliefs, particularly when user messages expressed sadness. Importantly, these effects were consistent across different model architectures, and occurred despite preserved performance on standard benchmarks, revealing systematic risks that current evaluation practices may fail to detect. As human-like AI systems are deployed at an unprecedented scale, our findings indicate a need to rethink how we develop and oversee these systems that are reshaping human relationships and social interaction.

[70] DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router

Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, Wei Cheng

Main category: cs.CL

TL;DR: DeepSieve is an agentic RAG framework that improves reasoning and retrieval by decomposing queries and routing sub-questions to suitable knowledge sources, filtering noise through multi-stage distillation.

Details

Motivation: LLMs struggle with knowledge-intensive tasks due to lack of dynamic access to up-to-date or domain-specific information. Existing RAG methods lack fine-grained control, leading to noisy retrieval and shallow reasoning.

Method: DeepSieve decomposes queries into structured sub-questions, routes them to optimal knowledge sources, and filters irrelevant information via multi-stage distillation.

Result: Experiments show DeepSieve outperforms conventional RAG in reasoning depth, retrieval precision, and interpretability on multi-hop QA tasks.

Conclusion: DeepSieve offers a modular, transparent, and adaptable solution for enhancing LLM performance in knowledge-intensive tasks.

Abstract: Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches. Our codes are available at https://github.com/MinghoKwok/DeepSieve.

cs.CV

[71] Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go?

Shuqing Li, Qiang Chen, Xiaoxue Ren, Michael R. Lyu

Main category: cs.CV

TL;DR: The paper presents a large-scale study on physics failures in Physics Engines (PEs), analyzing their manifestations, detection methods, and developer insights to improve reliability.

Details

Motivation: PEs are critical but prone to physics failures, which can harm reliability and user experience. Current testing methods are inadequate, focusing on crashes rather than complex failures.

Method: The study investigates three research questions: manifestations of physics failures, detection techniques’ effectiveness, and developer perceptions. It includes a taxonomy, evaluation of detection methods (deep learning, prompt-based techniques, multimodal models), and insights from developers.

Result: Key contributions are a taxonomy of physics failures, evaluation of detection methods, and actionable insights from developers. The PhysiXFails dataset and materials are released for future research.

Conclusion: The study highlights the need for better detection approaches for physics failures in PEs and provides resources to advance research in this area.

Abstract: Physics Engines (PEs) are fundamental software frameworks that simulate physical interactions in applications ranging from entertainment to safety-critical systems. Despite their importance, PEs suffer from physics failures, deviations from expected physical behaviors that can compromise software reliability, degrade user experience, and potentially cause critical failures in autonomous vehicles or medical robotics. Current testing approaches for PE-based software are inadequate, typically requiring white-box access and focusing on crash detection rather than semantically complex physics failures. This paper presents the first large-scale empirical study characterizing physics failures in PE-based software. We investigate three research questions addressing the manifestations of physics failures, the effectiveness of detection techniques, and developer perceptions of current detection practices. Our contributions include: (1) a taxonomy of physics failure manifestations; (2) a comprehensive evaluation of detection methods including deep learning, prompt-based techniques, and large multimodal models; and (3) actionable insights from developer experiences for improving detection approaches. To support future research, we release PhysiXFails, code, and other materials at https://sites.google.com/view/physics-failure-detection.

[72] Trade-offs in Image Generation: How Do Different Dimensions Interact?

Sicheng Zhang, Binzhu Xie, Zhonghao Yan, Yuli Zhang, Donghao Zhou, Xiaofei Chen, Shi Qiu, Jiaqi Liu, Guoyang Xie, Zhichao Lu

Main category: cs.CV

TL;DR: TRIG-Bench introduces a dataset and metric (TRIGScore) to evaluate trade-offs in image generation across 10 dimensions, analyzing 14 models and proposing a Dimension Trade-off Map (DTM) for improvement.

Details

Motivation: Current models lack fine-grained evaluation of trade-offs in image generation due to limited datasets and single-metric approaches.

Method: Developed TRIG-Bench (40,200 samples, 10 dimensions) and TRIGScore (VLM-as-judge metric), evaluated 14 models, and proposed DTM for visualizing trade-offs.

Result: DTM provides comprehensive insights into model trade-offs, and fine-tuning based on DTM improves performance.

Conclusion: TRIG-Bench and DTM offer a systematic way to analyze and enhance image generation models by addressing dimension-specific weaknesses.

Abstract: Model performance in text-to-image (T2I) and image-to-image (I2I) generation often depends on multiple aspects, including quality, alignment, diversity, and robustness. However, models’ complex trade-offs among these dimensions have rarely been explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) the use of a single metric for multiple dimensions. To bridge this gap, we introduce TRIG-Bench (Trade-offs in Image Generation), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity, and Bias), contains 40,200 samples, and covers 132 pairwise dimensional subsets. Furthermore, we develop TRIGScore, a VLM-as-judge metric that automatically adapts to various dimensions. Based on TRIG-Bench and TRIGScore, we evaluate 14 models across T2I and I2I tasks. In addition, we propose the Relation Recognition System to generate the Dimension Trade-off Map (DTM) that visualizes the trade-offs among model-specific capabilities. Our experiments demonstrate that DTM consistently provides a comprehensive understanding of the trade-offs between dimensions for each type of generative model. Notably, we show that the model’s dimension-specific weaknesses can be mitigated through fine-tuning on DTM to enhance overall performance. Code is available at: https://github.com/fesvhtr/TRIG

[73] AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock

Umair Nawaz, Muhammad Zaigham Zaheer, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer

Main category: cs.CV

TL;DR: A survey reviewing over 200 AI research works in agriculture, covering machine learning, deep learning, and vision-language models for tasks like crop disease detection and livestock health, while addressing challenges like data variability and suggesting future research directions.

Details

Motivation: To address challenges in global food production (climate variability, resource limitations, sustainability) using AI technologies.

Method: Systematic review of 200+ studies on AI in agriculture, including conventional ML, deep learning (e.g., vision transformers), and vision-language models (e.g., CLIP).

Result: Identifies key tasks (e.g., disease detection, livestock monitoring) and challenges (data variability, deployment).

Conclusion: Highlights future research needs: multimodal data integration, edge-device deployment, and adaptable AI models for diverse farming environments.

Abstract: Crops, fisheries and livestock form the backbone of global food production, essential to feed the ever-growing global population. However, these sectors face considerable challenges, including climate variability, resource limitations, and the need for sustainable management. Addressing these issues requires efficient, accurate, and scalable technological solutions, highlighting the importance of artificial intelligence (AI). This survey presents a systematic and thorough review of more than 200 research works covering conventional machine learning approaches, advanced deep learning techniques (e.g., vision transformers), and recent vision-language foundation models (e.g., CLIP) in the agriculture domain, focusing on diverse tasks such as crop disease detection, livestock health management, and aquatic species monitoring. We further cover major implementation challenges such as data variability and experimental aspects: datasets, performance evaluation metrics, and geographical focus. We finish the survey by discussing potential open research directions emphasizing the need for multimodal data integration, efficient edge-device deployment, and domain-adaptable AI models for diverse farming environments. Rapid growth of evolving developments in this field can be actively tracked on our project page: https://github.com/umair1221/AI-in-Agriculture

[74] Color as the Impetus: Transforming Few-Shot Learner

Chaofei Qi, Zhitai Liu, Jianbin Qiu

Main category: cs.CV

TL;DR: The paper introduces ColorSense Learner, a bio-inspired meta-learning framework simulating human color perception for few-shot learning, and ColorSense Distiller, a meta-distiller enhancing learning via knowledge distillation.

Details

Motivation: Humans' innate meta-learning capabilities, linked to color perception, inspired the development of a framework to improve few-shot learning by leveraging color-channel interactions.

Method: The proposed ColorSense Learner uses inter-channel feature extraction and interactive learning, while ColorSense Distiller incorporates teacher knowledge for enhanced meta-learning.

Result: Experiments on eleven benchmarks show strong generalization, robustness, and transferability, validating the framework’s effectiveness in few-shot classification.

Conclusion: The framework bridges the gap in meta-learning by integrating color perception, achieving superior performance in few-shot tasks.

Abstract: Humans possess innate meta-learning capabilities, partly attributable to their exceptional color perception. In this paper, we pioneer an innovative viewpoint on few-shot learning by simulating human color perception mechanisms. We propose the ColorSense Learner, a bio-inspired meta-learning framework that capitalizes on inter-channel feature extraction and interactive learning. By strategically emphasizing distinct color information across different channels, our approach effectively filters irrelevant features while capturing discriminative characteristics. Color information represents the most intuitive visual feature, yet conventional meta-learning methods have predominantly neglected this aspect, focusing instead on abstract feature differentiation across categories. Our framework bridges the gap via synergistic color-channel interactions, enabling better intra-class commonality extraction and larger inter-class differences. Furthermore, we introduce a meta-distiller based on knowledge distillation, ColorSense Distiller, which incorporates prior teacher knowledge to augment the student network’s meta-learning capacity. We’ve conducted comprehensive coarse/fine-grained and cross-domain experiments on eleven few-shot benchmarks for validation. Numerous experiments reveal that our methods have extremely strong generalization ability, robustness, and transferability, and effortless handle few-shot classification from the perspective of color perception.

[75] Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Sara Sarto, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Main category: cs.CV

TL;DR: PAC-S++ is a learnable metric for caption evaluation and generation, leveraging CLIP and improved pre-training, enhancing caption quality and fine-tuning.

Details

Motivation: Existing caption evaluation metrics lack precision due to reliance on non-specific references or noisy data, necessitating a better metric for evaluation and generation.

Method: Proposes PAC-S++, using CLIP with curated pre-training and regularization, applied in SCST for fine-tuning captioning models.

Result: PAC-S++ outperforms popular metrics, reduces object hallucinations, and improves caption quality with fewer errors.

Conclusion: PAC-S++ enhances caption evaluation and generation, with demonstrated efficacy in fine-tuning and out-of-domain benchmarks.

Abstract: Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.

[76] Enhancing efficiency in paediatric brain tumour segmentation using a pathologically diverse single-center clinical dataset

A. Piffer, J. A. Buchner, A. G. Gennari, P. Grehten, S. Sirin, E. Ross, I. Ezhov, M. Rosier, J. C. Peeken, M. Piraud, B. Menze, A. Guerreiro Stücklin, A. Jakab, F. Kofler

Main category: cs.CV

TL;DR: DL-based segmentation using nnU-Net shows promise for paediatric brain tumour delineation, with robust performance for whole tumour and T2-hyperintensity, but challenges remain for enhancing tumour and cystic component segmentation.

Details

Motivation: To evaluate the feasibility and performance of deep learning (DL) segmentation across diverse paediatric brain tumour (PBT) subtypes and MRI protocols, addressing diagnostic and therapeutic challenges.

Method: A retrospective study of 174 paediatric patients with various PBTs used MRI sequences (T1, T1-C, T2, FLAIR). A 3D nnU-Net model was trained and tested (121/53 split), assessing segmentation performance via Dice similarity coefficient (DSC) and comparing it to human variability.

Result: The model performed well for whole tumour (WT) and T2-hyperintensity (T2H) (mean DSC: 0.85), matching human annotator variability (mean DSC: 0.86). Enhancing tumour (ET) was moderately accurate (mean DSC: 0.75), while cystic component (CC) was poor. Performance varied by tumour type, MRI sequence, and location.

Conclusion: DL is feasible for PBT segmentation, especially for WT and T2H, but requires refinement for ET and CC. Findings suggest potential for protocol simplification and workflow automation in paediatric neuro-oncology.

Abstract: Background Brain tumours are the most common solid malignancies in children, encompassing diverse histological, molecular subtypes and imaging features and outcomes. Paediatric brain tumours (PBTs), including high- and low-grade gliomas (HGG, LGG), medulloblastomas (MB), ependymomas, and rarer forms, pose diagnostic and therapeutic challenges. Deep learning (DL)-based segmentation offers promising tools for tumour delineation, yet its performance across heterogeneous PBT subtypes and MRI protocols remains uncertain. Methods A retrospective single-centre cohort of 174 paediatric patients with HGG, LGG, medulloblastomas (MB), ependymomas, and other rarer subtypes was used. MRI sequences included T1, T1 post-contrast (T1-C), T2, and FLAIR. Manual annotations were provided for four tumour subregions: whole tumour (WT), T2-hyperintensity (T2H), enhancing tumour (ET), and cystic component (CC). A 3D nnU-Net model was trained and tested (121/53 split), with segmentation performance assessed using the Dice similarity coefficient (DSC) and compared against intra- and inter-rater variability. Results The model achieved robust performance for WT and T2H (mean DSC: 0.85), comparable to human annotator variability (mean DSC: 0.86). ET segmentation was moderately accurate (mean DSC: 0.75), while CC performance was poor. Segmentation accuracy varied by tumour type, MRI sequence combination, and location. Notably, T1, T1-C, and T2 alone produced results nearly equivalent to the full protocol. Conclusions DL is feasible for PBTs, particularly for T2H and WT. Challenges remain for ET and CC segmentation, highlighting the need for further refinement. These findings support the potential for protocol simplification and automation to enhance volumetric assessment and streamline paediatric neuro-oncology workflows.

[77] Temporally Consistent Unsupervised Segmentation for Mobile Robot Perception

Christian Ellis, Maggie Wigness, Craig Lennon, Lance Fiondella

Main category: cs.CV

TL;DR: Frontier-Seg is a method for temporally consistent unsupervised terrain segmentation in mobile robot video streams, addressing the limitations of costly labeled data and frame-by-frame approaches.

Details

Motivation: Current supervised semantic segmentation methods rely on expensive labeled data and struggle in unstructured environments. Zero-shot approaches lack temporal consistency, which is crucial for robust perception.

Method: Frontier-Seg clusters superpixel-level features from DINOv2 foundation models and enforces temporal consistency to identify terrain boundaries without human supervision.

Result: Evaluated on RUGD and RELLIS-3D datasets, Frontier-Seg demonstrates effective unsupervised segmentation in unstructured off-road environments.

Conclusion: Frontier-Seg provides a scalable, unsupervised solution for terrain segmentation, overcoming the need for labeled data and ensuring temporal consistency.

Abstract: Rapid progress in terrain-aware autonomous ground navigation has been driven by advances in supervised semantic segmentation. However, these methods rely on costly data collection and labor-intensive ground truth labeling to train deep models. Furthermore, autonomous systems are increasingly deployed in unrehearsed, unstructured environments where no labeled data exists and semantic categories may be ambiguous or domain-specific. Recent zero-shot approaches to unsupervised segmentation have shown promise in such settings but typically operate on individual frames, lacking temporal consistency-a critical property for robust perception in unstructured environments. To address this gap we introduce Frontier-Seg, a method for temporally consistent unsupervised segmentation of terrain from mobile robot video streams. Frontier-Seg clusters superpixel-level features extracted from foundation model backbones-specifically DINOv2-and enforces temporal consistency across frames to identify persistent terrain boundaries or frontiers without human supervision. We evaluate Frontier-Seg on a diverse set of benchmark datasets-including RUGD and RELLIS-3D-demonstrating its ability to perform unsupervised segmentation across unstructured off-road environments.

[78] Anti-Inpainting: A Proactive Defense Approach against Malicious Diffusion-based Inpainters under Unknown Conditions

Yimao Guo, Zuomin Qu, Wei Lu, Xiangyang Luo

Main category: cs.CV

TL;DR: Anti-Inpainting is a proactive defense method against diffusion-based image manipulation, using multi-level feature extraction, semantic-preserving augmentation, and distribution deviation optimization.

Details

Motivation: Existing defenses fail under unknown conditions for diffusion-based image tampering, necessitating a robust solution.

Method: Proposes three modules: multi-level deep feature extraction, multi-scale semantic-preserving augmentation, and selection-based distribution deviation optimization.

Result: Effective defense against diffusion-based inpainters under unknown conditions, with robustness against purification methods and transferability across model versions.

Conclusion: Anti-Inpainting provides a reliable defense against unknown diffusion-based manipulations, validated by extensive experiments.

Abstract: With the increasing prevalence of diffusion-based malicious image manipulation, existing proactive defense methods struggle to safeguard images against tampering under unknown conditions. To address this, we propose Anti-Inpainting, a proactive defense approach that achieves protection comprising three novel modules. First, we introduce a multi-level deep feature extractor to obtain intricate features from the diffusion denoising process, enhancing protective effectiveness. Second, we design a multi-scale, semantic-preserving data augmentation technique to enhance the transferability of adversarial perturbations across unknown conditions. Finally, we propose a selection-based distribution deviation optimization strategy to bolster protection against manipulations guided by diverse random seeds. Extensive experiments on InpaintGuardBench and CelebA-HQ demonstrate that Anti-Inpainting effectively defends against diffusion-based inpainters under unknown conditions. Additionally, our approach demonstrates robustness against various image purification methods and transferability across different diffusion model versions.

[79] SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang

Main category: cs.CV

TL;DR: The paper addresses CLIP’s limitations in handling misaligned and entangled image-text data by proposing a novel framework for flexible, granular alignment and disentanglement of representations.

Details

Motivation: CLIP struggles with misaligned and entangled image-text data, limiting its generalization. The paper aims to improve alignment and disentanglement for better performance.

Method: The authors introduce a theoretical framework for flexible alignment and disentanglement, proposing a modular approach (SmartCLIP) to align relevant visual and textual representations.

Result: The proposed method outperforms existing approaches, demonstrating improved handling of misalignment and disentanglement.

Conclusion: The framework successfully addresses CLIP’s limitations, offering a scalable solution for better cross-modal alignment and generalization.

Abstract: Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts – ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emph{preserve} cross-modal semantic information in its entirety but also \emph{disentangle} visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory. The code is available at https://github.com/Mid-Push/SmartCLIP.

[80] HOG-CNN: Integrating Histogram of Oriented Gradients with Convolutional Neural Networks for Retinal Image Classification

Faisal Ahmed

Main category: cs.CV

TL;DR: The paper proposes HOG-CNN, a hybrid model combining HOG and CNN features for automated retinal disease diagnosis, achieving high accuracy on public datasets.

Details

Motivation: Manual diagnosis of retinal diseases is time- and resource-intensive; automation is needed for efficiency and scalability.

Method: A hybrid feature extraction model (HOG-CNN) integrates handcrafted HOG features with deep CNN representations for retinal image analysis.

Result: HOG-CNN achieves high performance: 98.5% accuracy for binary DR, 94.2 AUC for five-class DR, 92.8% accuracy for AMD, and 83.9% accuracy for Glaucoma.

Conclusion: HOG-CNN is a robust, interpretable, and scalable tool for automated retinal disease screening, suitable for resource-constrained settings.

Abstract: The analysis of fundus images is critical for the early detection and diagnosis of retinal diseases such as Diabetic Retinopathy (DR), Glaucoma, and Age-related Macular Degeneration (AMD). Traditional diagnostic workflows, however, often depend on manual interpretation and are both time- and resource-intensive. To address these limitations, we propose an automated and interpretable clinical decision support framework based on a hybrid feature extraction model called HOG-CNN. Our key contribution lies in the integration of handcrafted Histogram of Oriented Gradients (HOG) features with deep convolutional neural network (CNN) representations. This fusion enables our model to capture both local texture patterns and high-level semantic features from retinal fundus images. We evaluated our model on three public benchmark datasets: APTOS 2019 (for binary and multiclass DR classification), ORIGA (for Glaucoma detection), and IC-AMD (for AMD diagnosis); HOG-CNN demonstrates consistently high performance. It achieves 98.5% accuracy and 99.2 AUC for binary DR classification, and 94.2 AUC for five-class DR classification. On the IC-AMD dataset, it attains 92.8% accuracy, 94.8% precision, and 94.5 AUC, outperforming several state-of-the-art models. For Glaucoma detection on ORIGA, our model achieves 83.9% accuracy and 87.2 AUC, showing competitive performance despite dataset limitations. We show, through comprehensive appendix studies, the complementary strength of combining HOG and CNN features. The model’s lightweight and interpretable design makes it particularly suitable for deployment in resource-constrained clinical environments. These results position HOG-CNN as a robust and scalable tool for automated retinal disease screening.

[81] AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

Christopher F. Brown, Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, Pushmeet Kohli

Main category: cs.CV

TL;DR: AlphaEarth Foundations introduces a geospatial embedding model that outperforms existing methods in mapping tasks without retraining, leveraging sparse labels for global-scale applications.

Details

Motivation: High-quality labels for Earth observation data are scarce due to the effort required for physical measurements, necessitating efficient modeling to translate sparse labels into maps.

Method: AlphaEarth Foundations uses an embedding field model to assimilate spatial, temporal, and measurement contexts across multiple data sources.

Result: The model consistently outperforms previous featurization methods in diverse mapping evaluations and will release global embedding field layers from 2017 to 2024.

Conclusion: AlphaEarth Foundations provides a scalable and accurate solution for geospatial mapping, addressing the challenge of sparse labels in Earth observation.

Abstract: Unprecedented volumes of Earth observation data are continually collected around the world, but high-quality labels remain scarce given the effort required to make physical measurements and observations. This has led to considerable investment in bespoke modeling efforts translating sparse labels into maps. Here we introduce AlphaEarth Foundations, an embedding field model yielding a highly general, geospatial representation that assimilates spatial, temporal, and measurement contexts across multiple sources, enabling accurate and efficient production of maps and monitoring systems from local to global scales. The embeddings generated by AlphaEarth Foundations are the only to consistently outperform all previous featurization approaches tested on a diverse set of mapping evaluations without re-training. We will release a dataset of global, annual, analysis-ready embedding field layers from 2017 through 2024.

[82] Moiré Zero: An Efficient and High-Performance Neural Architecture for Moiré Removal

Seungryong Lee, Woojeong Baek, Younghyun Kim, Eunwoo Kim, Haru Moon, Donggon Yoo, Eunbyung Park

Main category: cs.CV

TL;DR: MZNet, a U-shaped network, effectively removes moiré patterns using multi-scale and multi-shape components, achieving state-of-the-art performance with low computational cost.

Details

Motivation: Moiré patterns hinder applications like photography and defect inspection; existing CNN-based methods struggle due to limited receptive fields.

Method: MZNet integrates Multi-Scale Dual Attention Block, Multi-Shape Large Kernel Convolution Block, and Feature Fusion-Based Skip Connection for moiré removal.

Result: MZNet outperforms on high-resolution datasets and is competitive on lower-resolution ones, with low computational cost.

Conclusion: MZNet is an efficient, practical solution for moiré pattern removal in real-world applications.

Abstract: Moir'e patterns, caused by frequency aliasing between fine repetitive structures and a camera sensor’s sampling process, have been a significant obstacle in various real-world applications, such as consumer photography and industrial defect inspection. With the advancements in deep learning algorithms, numerous studies-predominantly based on convolutional neural networks-have suggested various solutions to address this issue. Despite these efforts, existing approaches still struggle to effectively eliminate artifacts due to the diverse scales, orientations, and color shifts of moir'e patterns, primarily because the constrained receptive field of CNN-based architectures limits their ability to capture the complex characteristics of moir'e patterns. In this paper, we propose MZNet, a U-shaped network designed to bring images closer to a ‘Moire-Zero’ state by effectively removing moir'e patterns. It integrates three specialized components: Multi-Scale Dual Attention Block (MSDAB) for extracting and refining multi-scale features, Multi-Shape Large Kernel Convolution Block (MSLKB) for capturing diverse moir'e structures, and Feature Fusion-Based Skip Connection for enhancing information flow. Together, these components enhance local texture restoration and large-scale artifact suppression. Experiments on benchmark datasets demonstrate that MZNet achieves state-of-the-art performance on high-resolution datasets and delivers competitive results on lower-resolution dataset, while maintaining a low computational cost, suggesting that it is an efficient and practical solution for real-world applications. Project page: https://sngryonglee.github.io/MoireZero

[83] LAMA-Net: A Convergent Network Architecture for Dual-Domain Reconstruction

Chi Ding, Qingchao Zhang, Ge Wang, Xiaojing Ye, Yunmei Chen

Main category: cs.CV

TL;DR: The paper introduces LAMA, a learned alternating minimization algorithm for image reconstruction, and extends it to LAMA-Net/iLAMA-Net, proving convergence and demonstrating improved stability and performance.

Details

Motivation: To address nonconvex and nonsmooth optimization in image reconstruction by leveraging complementary information from image and measurement domains.

Method: Proposes LAMA, incorporating residual learning in a proximal alternating framework, and extends it to LAMA-Net/iLAMA-Net with a network for initial generation.

Result: Convergence proof for LAMA, showing accumulation points are Clarke stationary; LAMA-Net/iLAMA-Net outperforms state-of-the-art methods in Sparse-View CT.

Conclusion: LAMA-Net/iLAMA-Net is robust, stable, and superior in performance, validated by experiments on benchmark datasets.

Abstract: We propose a learnable variational model that learns the features and leverages complementary information from both image and measurement domains for image reconstruction. In particular, we introduce a learned alternating minimization algorithm (LAMA) from our prior work, which tackles two-block nonconvex and nonsmooth optimization problems by incorporating a residual learning architecture in a proximal alternating framework. In this work, our goal is to provide a complete and rigorous convergence proof of LAMA and show that all accumulation points of a specified subsequence of LAMA must be Clarke stationary points of the problem. LAMA directly yields a highly interpretable neural network architecture called LAMA-Net. Notably, in addition to the results shown in our prior work, we demonstrate that the convergence property of LAMA yields outstanding stability and robustness of LAMA-Net in this work. We also show that the performance of LAMA-Net can be further improved by integrating a properly designed network that generates suitable initials, which we call iLAMA-Net. To evaluate LAMA-Net/iLAMA-Net, we conduct several experiments and compare them with several state-of-the-art methods on popular benchmark datasets for Sparse-View Computed Tomography.

[84] TopoLiDM: Topology-Aware LiDAR Diffusion Models for Interpretable and Realistic LiDAR Point Cloud Generation

Jiuming Liu, Zheng Huang, Mengmeng Liu, Tianchen Deng, Francesco Nex, Hao Cheng, Hesheng Wang

Main category: cs.CV

TL;DR: TopoLiDM integrates GNNs with diffusion models under topological regularization for high-fidelity LiDAR scene generation, outperforming existing methods in realism and consistency.

Details

Motivation: Mitigating LiDAR data collection costs and enhancing perception task robustness in autonomous driving by addressing geometric realism and global topological consistency issues in existing methods.

Method: Combines a topological-preserving VAE for latent graph extraction with latent diffusion models, using 0-dimensional persistent homology constraints to ensure real-world topological adherence.

Result: Achieves 22.6% lower FRID and 9.2% lower MMD on KITTI-360, with fast generation speed (1.68 samples/s).

Conclusion: TopoLiDM offers scalable, high-fidelity LiDAR generation, advancing autonomous driving perception tasks.

Abstract: LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM’s superiority over state-of-the-art methods, achieving improvements of 22.6% lower Frechet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at https://github.com/IRMVLab/TopoLiDM.

[85] Learning from Heterogeneous Structural MRI via Collaborative Domain Adaptation for Late-Life Depression Assessment

Yuzhen Gao, Qianqian Wang, Yongheng Sun, Cui Wang, Yongquan Liang, Mingxia Liu

Main category: cs.CV

TL;DR: A Collaborative Domain Adaptation (CDA) framework using Vision Transformer (ViT) and CNN improves late-life depression (LLD) detection from T1-weighted MRIs by addressing domain heterogeneity and limited sample sizes.

Details

Motivation: Existing methods for LLD detection face challenges due to small sample sizes and domain heterogeneity, limiting model reliability and generalization.

Method: The CDA framework combines ViT for global context and CNN for local features, involving supervised source training, self-supervised target adaptation, and collaborative training with pseudo-labels and augmentation.

Result: CDA outperforms state-of-the-art unsupervised domain adaptation methods in multi-site T1-weighted MRI experiments.

Conclusion: The proposed CDA framework effectively addresses domain heterogeneity and improves LLD detection, demonstrating superior performance over existing methods.

Abstract: Accurate identification of late-life depression (LLD) using structural brain MRI is essential for monitoring disease progression and facilitating timely intervention. However, existing learning-based approaches for LLD detection are often constrained by limited sample sizes (e.g., tens), which poses significant challenges for reliable model training and generalization. Although incorporating auxiliary datasets can expand the training set, substantial domain heterogeneity, such as differences in imaging protocols, scanner hardware, and population demographics, often undermines cross-domain transferability. To address this issue, we propose a Collaborative Domain Adaptation (CDA) framework for LLD detection using T1-weighted MRIs. The CDA leverages a Vision Transformer (ViT) to capture global anatomical context and a Convolutional Neural Network (CNN) to extract local structural features, with each branch comprising an encoder and a classifier. The CDA framework consists of three stages: (a) supervised training on labeled source data, (b) self-supervised target feature adaptation and (c) collaborative training on unlabeled target data. We first train ViT and CNN on source data, followed by self-supervised target feature adaptation by minimizing the discrepancy between classifier outputs from two branches to make the categorical boundary clearer. The collaborative training stage employs pseudo-labeled and augmented target-domain MRIs, enforcing prediction consistency under strong and weak augmentation to enhance domain robustness and generalization. Extensive experiments conducted on multi-site T1-weighted MRI data demonstrate that the CDA consistently outperforms state-of-the-art unsupervised domain adaptation methods.

[86] DACA-Net: A Degradation-Aware Conditional Diffusion Network for Underwater Image Enhancement

Chang Huang, Jiahang Cao, Jun Ma, Kieren Yu, Cong Li, Huayong Yang, Kaishun Wu

Main category: cs.CV

TL;DR: A degradation-aware conditional diffusion model is proposed to enhance underwater images by predicting degradation levels and using adaptive noise scheduling and feature refinement.

Details

Motivation: Underwater images suffer from colour distortions, low visibility, and structural clarity issues due to scattering and absorption, limiting their usability. Existing methods fail to adapt to diverse degradation conditions or leverage underwater-specific priors effectively.

Method: A lightweight dual-stream network predicts degradation levels, guiding a conditional diffusion-based restoration network with a Swin UNet backbone. It includes adaptive feature fusion and a hybrid loss function.

Result: The method restores underwater images with superior colour fidelity, perceptual quality, and structural details, outperforming state-of-the-art approaches.

Conclusion: The proposed framework effectively addresses underwater image degradation, achieving significant improvements in both quantitative and qualitative assessments.

Abstract: Underwater images typically suffer from severe colour distortions, low visibility, and reduced structural clarity due to complex optical effects such as scattering and absorption, which greatly degrade their visual quality and limit the performance of downstream visual perception tasks. Existing enhancement methods often struggle to adaptively handle diverse degradation conditions and fail to leverage underwater-specific physical priors effectively. In this paper, we propose a degradation-aware conditional diffusion model to enhance underwater images adaptively and robustly. Given a degraded underwater image as input, we first predict its degradation level using a lightweight dual-stream convolutional network, generating a continuous degradation score as semantic guidance. Based on this score, we introduce a novel conditional diffusion-based restoration network with a Swin UNet backbone, enabling adaptive noise scheduling and hierarchical feature refinement. To incorporate underwater-specific physical priors, we further propose a degradation-guided adaptive feature fusion module and a hybrid loss function that combines perceptual consistency, histogram matching, and feature-level contrast. Comprehensive experiments on benchmark datasets demonstrate that our method effectively restores underwater images with superior colour fidelity, perceptual quality, and structural details. Compared with SOTA approaches, our framework achieves significant improvements in both quantitative metrics and qualitative visual assessments.

[87] UFV-Splatter: Pose-Free Feed-Forward 3D Gaussian Splatting Adapted to Unfavorable Views

Yuki Fujimura, Takahiro Kushida, Kazuya Kitano, Takuya Funatomi, Yasuhiro Mukaigawa

Main category: cs.CV

TL;DR: A pose-free, feed-forward 3D Gaussian Splatting (3DGS) framework is introduced to handle unfavorable input views by leveraging pretrained models and novel adaptation techniques.

Details

Motivation: Existing feed-forward 3DGS models are limited to favorable camera views, restricting real-world applicability. This work aims to overcome this by enabling models to handle varying and unknown camera poses.

Method: The framework uses pretrained models with LoRA layers, a Gaussian adapter module for geometric consistency, and Gaussian alignment for accurate rendering. Training leverages an off-the-shelf dataset of favorable images.

Result: Experiments on synthetic (Google Scanned Objects) and real (OmniObject3D) datasets confirm the method’s effectiveness in handling unfavorable views.

Conclusion: The proposed framework successfully extends 3DGS models to unfavorable views, enhancing their real-world utility.

Abstract: This paper presents a pose-free, feed-forward 3D Gaussian Splatting (3DGS) framework designed to handle unfavorable input views. A common rendering setup for training feed-forward approaches places a 3D object at the world origin and renders it from cameras pointed toward the origin – i.e., from favorable views, limiting the applicability of these models to real-world scenarios involving varying and unknown camera poses. To overcome this limitation, we introduce a novel adaptation framework that enables pretrained pose-free feed-forward 3DGS models to handle unfavorable views. We leverage priors learned from favorable images by feeding recentered images into a pretrained model augmented with low-rank adaptation (LoRA) layers. We further propose a Gaussian adapter module to enhance the geometric consistency of the Gaussians derived from the recentered inputs, along with a Gaussian alignment method to render accurate target views for training. Additionally, we introduce a new training strategy that utilizes an off-the-shelf dataset composed solely of favorable images. Experimental results on both synthetic images from the Google Scanned Objects dataset and real images from the OmniObject3D dataset validate the effectiveness of our method in handling unfavorable input views.

[88] DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception

Pei Deng, Wenqian Zhou, Hanlin Wu

Main category: cs.CV

TL;DR: The paper introduces RSICA, a new paradigm combining change detection and visual question answering for interactive analysis of bi-temporal remote sensing images, supported by the ChangeChat-105k dataset and the DeltaVLM model.

Details

Motivation: Existing methods for land-cover change analysis are limited to static outputs, lacking interactive, query-driven capabilities.

Method: Proposes DeltaVLM, an end-to-end architecture with a fine-tuned bi-temporal vision encoder, visual difference perception module, and instruction-guided Q-former, trained on the ChangeChat-105k dataset.

Result: DeltaVLM achieves state-of-the-art performance in single-turn captioning and multi-turn interactive change analysis.

Conclusion: The RSICA paradigm and DeltaVLM model effectively enable interactive, instruction-guided exploration of changes in remote sensing images.

Abstract: Accurate interpretation of land-cover changes in multi-temporal satellite imagery is critical for real-world scenarios. However, existing methods typically provide only one-shot change masks or static captions, limiting their ability to support interactive, query-driven analysis. In this work, we introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering to enable multi-turn, instruction-guided exploration of changes in bi-temporal remote sensing images. To support this task, we construct ChangeChat-105k, a large-scale instruction-following dataset, generated through a hybrid rule-based and GPT-assisted process, covering six interaction types: change captioning, classification, quantification, localization, open-ended question answering, and multi-turn dialogues. Building on this dataset, we propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA. DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring (CSRM) mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information from visual changes, aligning them with textual instructions. We train DeltaVLM on ChangeChat-105k using a frozen large language model, adapting only the vision and alignment modules to optimize efficiency. Extensive experiments and ablation studies demonstrate that DeltaVLM achieves state-of-the-art performance on both single-turn captioning and multi-turn interactive change analysis, outperforming existing multimodal large language models and remote sensing vision-language models. Code, dataset and pre-trained weights are available at https://github.com/hanlinwu/DeltaVLM.

[89] FaceGCD: Generalized Face Discovery via Dynamic Prefix Generation

Yunseok Oh, Dong-Wan Choi

Main category: cs.CV

TL;DR: The paper introduces Generalized Face Discovery (GFD), a novel open-world face recognition task combining traditional identification with generalized category discovery (GCD). It proposes FaceGCD, a dynamic method using lightweight prefixes for feature extraction, outperforming existing GCD methods and ArcFace.

Details

Motivation: To address the challenge of recognizing both labeled and unlabeled known identities while discovering new ones in face recognition, advancing toward artificial general intelligence (AGI).

Method: FaceGCD dynamically constructs instance-specific feature extractors using lightweight, layer-wise prefixes generated by a HyperNetwork conditioned on input images.

Result: FaceGCD significantly outperforms existing GCD methods and ArcFace, achieving state-of-the-art results on GFD.

Conclusion: FaceGCD advances open-world face recognition by effectively handling high cardinality and fine-grained face IDs, setting a new benchmark for the GFD task.

Abstract: Recognizing and differentiating among both familiar and unfamiliar faces is a critical capability for face recognition systems and a key step toward artificial general intelligence (AGI). Motivated by this ability, this paper introduces generalized face discovery (GFD), a novel open-world face recognition task that unifies traditional face identification with generalized category discovery (GCD). GFD requires recognizing both labeled and unlabeled known identities (IDs) while simultaneously discovering new, previously unseen IDs. Unlike typical GCD settings, GFD poses unique challenges due to the high cardinality and fine-grained nature of face IDs, rendering existing GCD approaches ineffective. To tackle this problem, we propose FaceGCD, a method that dynamically constructs instance-specific feature extractors using lightweight, layer-wise prefixes. These prefixes are generated on the fly by a HyperNetwork, which adaptively outputs a set of prefix generators conditioned on each input image. This dynamic design enables FaceGCD to capture subtle identity-specific cues without relying on high-capacity static models. Extensive experiments demonstrate that FaceGCD significantly outperforms existing GCD methods and a strong face recognition baseline, ArcFace, achieving state-of-the-art results on the GFD task and advancing toward open-world face recognition.

[90] GVD: Guiding Video Diffusion Model for Scalable Video Distillation

Kunyang Li, Jeffrey A Chan Santiago, Sarinda Dhanesh Samarasinghe, Gaowen Liu, Mubarak Shah

Main category: cs.CV

TL;DR: GVD (Guiding Video Diffusion) is a diffusion-based video distillation method that efficiently captures spatial and temporal features, outperforming previous methods on MiniUCF and HMDB51 datasets with minimal computational cost.

Details

Motivation: To reduce computation and storage demands of large video datasets while maintaining performance.

Method: Jointly distills spatial and temporal features using diffusion for high-fidelity video generation.

Result: Achieves 78.29% performance with 1.98% frames in MiniUCF and 73.83% with 3.30% frames in HMDB51.

Conclusion: GVD sets a new benchmark for video dataset distillation, offering high performance and efficiency.

Abstract: To address the larger computation and storage requirements associated with large video datasets, video dataset distillation aims to capture spatial and temporal information in a significantly smaller dataset, such that training on the distilled data has comparable performance to training on all of the data. We propose GVD: Guiding Video Diffusion, the first diffusion-based video distillation method. GVD jointly distills spatial and temporal features, ensuring high-fidelity video generation across diverse actions while capturing essential motion information. Our method’s diverse yet representative distillations significantly outperform previous state-of-the-art approaches on the MiniUCF and HMDB51 datasets across 5, 10, and 20 Instances Per Class (IPC). Specifically, our method achieves 78.29 percent of the original dataset’s performance using only 1.98 percent of the total number of frames in MiniUCF. Additionally, it reaches 73.83 percent of the performance with just 3.30 percent of the frames in HMDB51. Experimental results across benchmark video datasets demonstrate that GVD not only achieves state-of-the-art performance but can also generate higher resolution videos and higher IPC without significantly increasing computational cost.

[91] Object Recognition Datasets and Challenges: A Review

Aria Salari, Abtin Djavadifar, Xiangrui Liu, Homayoun Najjaran

Main category: cs.CV

TL;DR: A survey analyzing over 160 object recognition datasets, their characteristics, benchmarks, and evaluation metrics, highlighting their importance in deep learning and computer vision.

Details

Motivation: To understand the role of datasets in advancing object recognition research, especially with the rise of deep learning, and to provide a comprehensive resource for researchers.

Method: Detailed statistical analysis and descriptions of datasets, along with an overview of benchmarks, competitions, and evaluation metrics.

Result: Identification of key datasets and benchmarks, with insights into their utility and impact on object recognition research.

Conclusion: Datasets are crucial for benchmarking and advancing object recognition, and this survey serves as a valuable resource for researchers in the field.

Abstract: Object recognition is among the fundamental tasks in the computer vision applications, paving the path for all other image understanding operations. In every stage of progress in object recognition research, efforts have been made to collect and annotate new datasets to match the capacity of the state-of-the-art algorithms. In recent years, the importance of the size and quality of datasets has been intensified as the utility of the emerging deep network techniques heavily relies on training data. Furthermore, datasets lay a fair benchmarking means for competitions and have proved instrumental to the advancements of object recognition research by providing quantifiable benchmarks for the developed models. Taking a closer look at the characteristics of commonly-used public datasets seems to be an important first step for data-driven and machine learning researchers. In this survey, we provide a detailed analysis of datasets in the highly investigated object recognition areas. More than 160 datasets have been scrutinized through statistics and descriptions. Additionally, we present an overview of the prominent object recognition benchmarks and competitions, along with a description of the metrics widely adopted for evaluation purposes in the computer vision community. All introduced datasets and challenges can be found online at github.com/AbtinDjavadifar/ORDC.

[92] Exploring the Application of Visual Question Answering (VQA) for Classroom Activity Monitoring

Sinh Trong Vu, Hieu Trung Pham, Dung Manh Nguyen, Hieu Minh Hoang, Nhu Hoang Le, Thu Ha Pham, Tai Tan Mai

Main category: cs.CV

TL;DR: The paper explores using state-of-the-art VQA models (LLaMA2, LLaMA3, QWEN3, NVILA) for classroom behavior analysis, introducing the BAV-Classroom-VQA dataset for evaluation. Results show promising performance for future classroom analytics.

Details

Motivation: Classroom behavior monitoring is crucial for student engagement and learning outcomes. VQA models provide automated tools for analyzing classroom interactions from videos.

Method: Evaluated open-source VQA models on the BAV-Classroom-VQA dataset, detailing data collection, annotation, and benchmarking.

Result: All four VQA models showed promising performance in answering behavior-related visual questions.

Conclusion: VQA models have potential for future classroom analytics and intervention systems.

Abstract: Classroom behavior monitoring is a critical aspect of educational research, with significant implications for student engagement and learning outcomes. Recent advancements in Visual Question Answering (VQA) models offer promising tools for automatically analyzing complex classroom interactions from video recordings. In this paper, we investigate the applicability of several state-of-the-art open-source VQA models, including LLaMA2, LLaMA3, QWEN3, and NVILA, in the context of classroom behavior analysis. To facilitate rigorous evaluation, we introduce our BAV-Classroom-VQA dataset derived from real-world classroom video recordings at the Banking Academy of Vietnam. We present the methodology for data collection, annotation, and benchmark the performance of the selected VQA models on this dataset. Our initial experimental results demonstrate that all four models achieve promising performance levels in answering behavior-related visual questions, showcasing their potential in future classroom analytics and intervention systems.

[93] Gems: Group Emotion Profiling Through Multimodal Situational Understanding

Anubhav Kataria, Surbhi Madan, Shreya Ghosh, Tom Gedeon, Abhinav Dhall

Main category: cs.CV

TL;DR: GEMS introduces a multimodal framework for predicting fine-grained individual, group, and event-level emotions using a swin-transformer and S3Attention architecture, validated on the extended VGAF-GEMS dataset.

Details

Motivation: Understanding multi-person social situations requires analyzing emotions at individual, group, and event levels, which existing benchmarks lack.

Method: GEMS uses a multimodal swin-transformer and S3Attention architecture to process scenes, group members, and context for joint emotion predictions.

Result: GEMS outperforms adapted state-of-the-art models on the VGAF-GEMS benchmark, providing holistic emotion analysis.

Conclusion: GEMS advances multi-person emotion analysis and encourages further research, with code and data publicly available.

Abstract: Understanding individual, group and event level emotions along with contextual information is crucial for analyzing a multi-person social situation. To achieve this, we frame emotion comprehension as the task of predicting fine-grained individual emotion to coarse grained group and event level emotion. We introduce GEMS that leverages a multimodal swin-transformer and S3Attention based architecture, which processes an input scene, group members, and context information to generate joint predictions. Existing multi-person emotion related benchmarks mainly focus on atomic interactions primarily based on emotion perception over time and group level. To this end, we extend and propose VGAF-GEMS to provide more fine grained and holistic analysis on top of existing group level annotation of VGAF dataset. GEMS aims to predict basic discrete and continuous emotions (including valence and arousal) as well as individual, group and event level perceived emotions. Our benchmarking effort links individual, group and situational emotional responses holistically. The quantitative and qualitative comparisons with adapted state-of-the-art models demonstrate the effectiveness of GEMS framework on VGAF-GEMS benchmarking. We believe that it will pave the way of further research. The code and data is available at: https://github.com/katariaak579/GEMS

[94] On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

Jordan Vice, Naveed Akhtar, Yansong Gao, Richard Hartley, Ajmal Mian

Main category: cs.CV

TL;DR: The paper exposes vulnerabilities in Vision-Language Models (VLMs) under subtle frequency-domain perturbations, affecting DeepFake detection and image captioning.

Details

Motivation: To investigate and highlight the fragility of VLMs when exposed to structured frequency-domain perturbations, challenging their reliability in real-world applications.

Method: Design targeted frequency-domain image transformations to perturb VLMs and evaluate their impact on authenticity detection and captioning tasks across five state-of-the-art models.

Result: VLMs are sensitive to frequency-based cues, and their outputs can be systematically altered by imperceptible perturbations, undermining reliability.

Conclusion: The findings emphasize the need for more robust multimodal perception systems due to VLMs’ vulnerability to frequency-domain attacks.

Abstract: Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.

[95] MINR: Implicit Neural Representations with Masked Image Modelling

Sua Lee, Joonhun Lee, Myungjoo Kang

Main category: cs.CV

TL;DR: MINR combines implicit neural representations with masked image modeling for robust, generalizable image reconstructions, outperforming MAE in in-domain and out-of-distribution settings.

Details

Motivation: Address limitations of MAE, such as dependency on masking strategies and degraded performance on out-of-distribution data.

Method: Introduces MINR, a framework integrating implicit neural representations with masked image modeling to learn continuous image functions.

Result: MINR outperforms MAE in both in-domain and out-of-distribution scenarios while reducing model complexity.

Conclusion: MINR is a versatile, robust, and efficient alternative for self-supervised learning applications.

Abstract: Self-supervised learning methods like masked autoencoders (MAE) have shown significant promise in learning robust feature representations, particularly in image reconstruction-based pretraining task. However, their performance is often strongly dependent on the masking strategies used during training and can degrade when applied to out-of-distribution data. To address these limitations, we introduce the masked implicit neural representations (MINR) framework that synergizes implicit neural representations with masked image modeling. MINR learns a continuous function to represent images, enabling more robust and generalizable reconstructions irrespective of masking strategies. Our experiments demonstrate that MINR not only outperforms MAE in in-domain scenarios but also in out-of-distribution settings, while reducing model complexity. The versatility of MINR extends to various self-supervised learning applications, confirming its utility as a robust and efficient alternative to existing frameworks.

[96] OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing

Xiang Xiang, Zhuo Xu, Yao Deng, Qinhao Zhou, Yifan Liang, Ke Chen, Qingfang Zheng, Yaowei Wang, Xilin Chen, Wen Gao

Main category: cs.CV

TL;DR: The paper introduces OpenEarthSensing (OES), a large-scale benchmark for open-world remote sensing tasks, addressing semantic and covariate shifts in diverse data domains.

Details

Motivation: Remote sensing models face challenges adapting to new data with shifts from training data, lacking large-scale benchmarks for evaluation.

Method: OES includes 189 categories and five data domains (RGB satellite, RGB aerial, multispectral RGB, infrared) to test generalization.

Result: OES proves challenging for existing methods, serving as a robust benchmark for open-world remote sensing.

Conclusion: OES fills the gap in evaluating open-world tasks, offering a comprehensive testbed for future research.

Abstract: The advancement of remote sensing, including satellite systems, facilitates the continuous acquisition of remote sensing imagery globally, introducing novel challenges for achieving open-world tasks. Deployed models need to continuously adjust to a constant influx of new data, which frequently exhibits diverse shifts from the data encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update their parameters without forgetting learned knowledge, as has been considered in works on a variety of open-world tasks. However, existing studies are typically conducted within a single dataset to simulate realistic conditions, with a lack of large-scale benchmarks capable of evaluating multiple open-world tasks. In this paper, we introduce \textbf{OpenEarthSensing (OES)}, a large-scale fine-grained benchmark for open-world remote sensing. OES includes 189 scene and object categories, covering the vast majority of potential semantic shifts that may occur in the real world. Additionally, to provide a more comprehensive testbed for evaluating the generalization performance, OES encompasses five data domains with significant covariate shifts, including two RGB satellite domains, one RGB aerial domain, one multispectral RGB domain, and one infrared domain. We evaluate the baselines and existing methods for diverse tasks on OES, demonstrating that it serves as a meaningful and challenging benchmark for open-world remote sensing. The proposed dataset OES is available at https://haiv-lab.github.io/OES.

Sijie Wang, Siqi Li, Yawei Zhang, Shangshu Yu, Shenghai Yuan, Rui She, Quanjiang Guo, JinXuan Zheng, Ong Kang Howe, Leonrich Chandra, Shrivarshann Srijeyan, Aditya Sivadas, Toshan Aggarwal, Heyuan Liu, Hongming Zhang, Chujie Chen, Junyu Jiang, Lihua Xie, Wee Peng Tay

Main category: cs.CV

TL;DR: UAVScenes is a new multi-modal UAV dataset with frame-wise annotations for images and LiDAR, enabling high-level scene understanding tasks like segmentation and localization.

Details

Motivation: Existing UAV datasets lack frame-wise annotations, limiting their use for advanced perception tasks.

Method: Enhanced the MARS-LVIG dataset by adding manual semantic labels for images and LiDAR, plus 6-DoF poses.

Result: UAVScenes supports tasks like segmentation, depth estimation, localization, place recognition, and novel view synthesis.

Conclusion: UAVScenes fills a gap in UAV datasets, facilitating broader multi-modal perception research.

Abstract: Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs’ surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes

[98] The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?

Dinh Nam Pham, Eleftherios Avramidis

Main category: cs.CV

TL;DR: The paper investigates the role of non-manual facial features in automatic sign language recognition (ASLR), finding the mouth to be the most impactful feature.

Details

Motivation: Non-manual facial features are crucial in sign language but underexplored in ASLR. Prior work lacks systematic analysis of distinct facial regions.

Method: Uses two deep learning models (CNN and transformer) on an SLR dataset to evaluate contributions of eyes, mouth, and full face.

Result: Mouth is the most important non-manual feature, significantly boosting recognition accuracy.

Conclusion: Facial features, especially the mouth, are essential for improving ASLR systems.

Abstract: Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.

[99] Aleatoric Uncertainty Medical Image Segmentation Estimation via Flow Matching

Phi Van Nguyen, Ngoc Huynh Trinh, Duy Minh Lam Nguyen, Phu Loc Nguyen, Quoc Long Tran

Main category: cs.CV

TL;DR: The paper proposes a method using conditional flow matching to quantify aleatoric uncertainty in medical image segmentation, outperforming current diffusion-based approaches in accuracy and reliability.

Details

Motivation: Aleatoric uncertainty in medical image segmentation reflects natural variability among expert annotators, but current methods, including diffusion-based approaches, have limitations in accurately capturing this uncertainty.

Method: The proposed method leverages conditional flow matching, a simulation-free flow-based generative model, to learn exact densities and produce accurate segmentation samples. It samples multiple data points to reflect pixel-wise variance and inter-annotator differences.

Result: The method achieves competitive segmentation accuracy and generates reliable uncertainty maps, particularly in regions with ambiguous boundaries.

Conclusion: The approach effectively quantifies aleatoric uncertainty, offering deeper insights into segmentation reliability, and the code is publicly available.

Abstract: Quantifying aleatoric uncertainty in medical image segmentation is critical since it is a reflection of the natural variability observed among expert annotators. A conventional approach is to model the segmentation distribution using the generative model, but current methods limit the expression ability of generative models. While current diffusion-based approaches have demonstrated impressive performance in approximating the data distribution, their inherent stochastic sampling process and inability to model exact densities limit their effectiveness in accurately capturing uncertainty. In contrast, our proposed method leverages conditional flow matching, a simulation-free flow-based generative model that learns an exact density, to produce highly accurate segmentation results. By guiding the flow model on the input image and sampling multiple data points, our approach synthesizes segmentation samples whose pixel-wise variance reliably reflects the underlying data distribution. This sampling strategy captures uncertainties in regions with ambiguous boundaries, offering robust quantification that mirrors inter-annotator differences. Experimental results demonstrate that our method not only achieves competitive segmentation accuracy but also generates uncertainty maps that provide deeper insights into the reliability of the segmentation outcomes. The code for this paper is freely available at https://github.com/huynhspm/Data-Uncertainty

[100] Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking

Shahla John

Main category: cs.CV

TL;DR: A unified framework for real-time video analysis using spatial-temporal modeling and hierarchical attention achieves state-of-the-art performance with faster inference.

Details

Motivation: Balancing accuracy and speed in real-time video analysis is challenging, especially in resource-constrained environments.

Method: Leverages parallel sequence modeling and introduces a hierarchical attention mechanism for adaptive spatial-temporal focus.

Result: Improves action recognition by 3.2% and tracking precision by 2.8%, with 40% faster inference on UCF-101, HMDB-51, and MOT17 datasets.

Conclusion: The proposed framework effectively balances accuracy and speed, advancing real-time video analysis.

Abstract: Real-time video analysis remains a challenging problem in computer vision, requiring efficient processing of both spatial and temporal information while maintaining computational efficiency. Existing approaches often struggle to balance accuracy and speed, particularly in resource-constrained environments. In this work, we present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking. Our approach builds upon recent advances in parallel sequence modeling and introduces a novel hierarchical attention mechanism that adaptively focuses on relevant spatial regions across temporal sequences. We demonstrate that our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds. Extensive experiments on UCF-101, HMDB-51, and MOT17 datasets show improvements of 3.2% in action recognition accuracy and 2.8% in tracking precision compared to existing methods, with 40% faster inference time.

[101] HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

Zhixiang Wei, Guangting Wang, Xiaoxiao Ma, Ke Mei, Huaian Chen, Yi Jin, Fengyun Rao

Main category: cs.CV

TL;DR: The paper introduces an LVLM-driven data refinement pipeline to enhance image-text pair data, resulting in improved CLIP models like HQ-CLIP, which outperforms standard CLIP on benchmarks.

Details

Motivation: To explore whether LVLMs can reciprocally improve image-text data quality, creating a self-reinforcing cycle for continuous enhancement.

Method: Leverages LVLMs to generate multi-grained textual annotations (positive/negative descriptions and tags) from raw data, refining datasets like DFN-Large into VLM-150M. A training paradigm extends contrastive learning with these annotations.

Result: HQ-CLIP achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained tasks, even surpassing CLIP trained on 10x larger datasets.

Conclusion: The proposed pipeline and training paradigm effectively enhance data quality and model performance, demonstrating the potential for iterative improvement in vision-language models.

Abstract: Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on the DFN-2B dataset, which contains 10$\times$ more training data than ours. All code, data, and models are available at https://zxwei.site/hqclip.

[102] From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras

Youngho Kim, Hoonhee Cho, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: A novel domain adaptation approach for human pose estimation uses event cameras and a student-teacher framework to handle motion blur, outperforming conventional methods without needing target domain annotations.

Details

Motivation: Motion blur in rapid motion or low-light conditions degrades pose estimation, but most datasets lack blurred images, creating a domain gap.

Method: Leverages event cameras for high-resolution motion data and introduces event-based augmentation and a student-teacher framework with mutual uncertainty masking.

Result: Outperforms traditional domain-adaptive methods, enabling robust pose estimation in blurred environments without target domain annotations.

Conclusion: Event cameras and the proposed framework offer a scalable solution for domain adaptation in real-world motion blur scenarios.

Abstract: Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motion-aware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. Our project codes are available at https://github.com/kmax2001/EvSharp2Blur.

[103] Exploiting Diffusion Prior for Task-driven Image Restoration

Jaeha Kim, Junghun Oh, Kyoung Mu Lee

Main category: cs.CV

TL;DR: EDTR leverages diffusion prior to restore task-relevant details in images degraded by multiple factors, improving task performance and visual quality.

Details

Motivation: Address performance drops in high-level vision tasks due to low-quality inputs and the challenge of restoring images degraded by multiple complex factors.

Method: Proposes EDTR, which uses diffusion prior by generating from pixel-error-based pre-restored images with mild noise and employs few denoising steps to avoid redundant details.

Result: Effectively utilizes diffusion prior for TDIR, enhancing task performance and visual quality across diverse tasks with complex degradations.

Conclusion: EDTR successfully harnesses diffusion prior to restore task-relevant details, outperforming previous methods in handling complex degradations.

Abstract: Task-driven image restoration (TDIR) has recently emerged to address performance drops in high-level vision tasks caused by low-quality (LQ) inputs. Previous TDIR methods struggle to handle practical scenarios in which images are degraded by multiple complex factors, leaving minimal clues for restoration. This motivates us to leverage the diffusion prior, one of the most powerful natural image priors. However, while the diffusion prior can help generate visually plausible results, using it to restore task-relevant details remains challenging, even when combined with recent TDIR methods. To address this, we propose EDTR, which effectively harnesses the power of diffusion prior to restore task-relevant details. Specifically, we propose directly leveraging useful clues from LQ images in the diffusion process by generating from pixel-error-based pre-restored LQ images with mild noise added. Moreover, we employ a small number of denoising steps to prevent the generation of redundant details that dilute crucial task-related information. We demonstrate that our method effectively utilizes diffusion prior for TDIR, significantly enhancing task performance and visual quality across diverse tasks with multiple complex degradations.

[104] Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation

Zheng Xiangyu, He Songcheng, Li Wanyun, Li Xiaoqiang, Zhang Wei

Main category: cs.CV

TL;DR: The paper introduces HMHI-Net, a hierarchical memory architecture for UVOS, addressing the over-reliance on high-level semantic features by incorporating both shallow- and high-level features, achieving state-of-the-art results.

Details

Motivation: Existing UVOS methods rely too much on high-level semantic features, lacking fine-grained information, leading to marginal performance gains despite sophisticated memory designs.

Method: Proposes a hierarchical memory architecture with heterogeneous interaction (PLAM and SGIM modules) to integrate pixel and semantic information.

Result: HMHI-Net achieves state-of-the-art performance in UVOS and video saliency detection benchmarks, demonstrating robustness across different backbones.

Conclusion: The proposed hierarchical memory and interaction mechanism effectively addresses the limitations of existing UVOS methods, improving performance and robustness.

Abstract: Unsupervised Video Object Segmentation (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features. UVOS inherently suffers from the deficiency of lacking fine-grained information due to the absence of pixel-level prior knowledge. Consequently, memory design relying solely on high-level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel-semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel-guided Local Alignment Module (PLAM) and Semantic-guided Global Integration Module (SGIM), we achieve delicate integration of the fine-grained details in shallow-level memory and the semantic representations in high-level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI-Net) consistently achieves state-of-the-art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI-Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: https://github.com/ZhengxyFlow/HMHI-Net .

[105] Visual Language Models as Zero-Shot Deepfake Detectors

Viacheslav Pirogov

Main category: cs.CV

TL;DR: A novel VLM-based approach for deepfake detection outperforms traditional methods, leveraging zero-shot capabilities and a high-quality dataset.

Details

Motivation: Deepfakes pose a growing threat, but existing detection methods lack robustness by focusing solely on image classification without auxiliary tasks.

Method: Proposes a VLM-based approach, tested on a 60,000-image dataset and compared with traditional methods on DFDC-P in zero-shot and fine-tuning scenarios.

Result: VLMs, especially InstructBLIP, show superior performance over traditional classifiers in deepfake detection.

Conclusion: VLMs offer a robust and effective solution for deepfake detection, surpassing conventional methods.

Abstract: The contemporary phenomenon of deepfakes, utilizing GAN or diffusion models for face swapping, presents a substantial and evolving threat in digital media, identity verification, and a multitude of other systems. The majority of existing methods for detecting deepfakes rely on training specialized classifiers to distinguish between genuine and manipulated images, focusing only on the image domain without incorporating any auxiliary tasks that could enhance robustness. In this paper, inspired by the zero-shot capabilities of Vision Language Models, we propose a novel VLM-based approach to image classification and then evaluate it for deepfake detection. Specifically, we utilize a new high-quality deepfake dataset comprising 60,000 images, on which our zero-shot models demonstrate superior performance to almost all existing methods. Subsequently, we compare the performance of the best-performing architecture, InstructBLIP, on the popular deepfake dataset DFDC-P against traditional methods in two scenarios: zero-shot and in-domain fine-tuning. Our results demonstrate the superiority of VLMs over traditional classifiers.

[106] LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks

Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mengfei Shi, Xia Xie, Shengyong Chen

Main category: cs.CV

TL;DR: Proposes LIDAR, a lightweight network for efficient multimodal crack segmentation, outperforming SOTA methods with minimal computational cost.

Details

Motivation: Addressing the challenge of achieving pixel-level crack segmentation with low computational cost and adaptive cross-modal feature fusion.

Method: Introduces LIDAR with LacaVSS (adaptive cue modeling) and LD3CF (cross-modal fusion), using LDMK for efficient computation.

Result: Achieves 0.8204 F1 and 0.8465 mIoU on a dataset with only 5.35M parameters.

Conclusion: LIDAR is effective for crack segmentation, offering high performance with low computational overhead.

Abstract: Achieving pixel-level segmentation with low computational cost using multimodal data remains a key challenge in crack segmentation tasks. Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. Specifically, LIDAR is composed of a Lightweight Adaptive Cue-Aware Visual State Space module (LacaVSS) and a Lightweight Dual Domain Dynamic Collaborative Fusion module (LD3CF). LacaVSS adaptively models crack cues through the proposed mask-guided Efficient Dynamic Guided Scanning Strategy (EDG-SS), while LD3CF leverages an Adaptive Frequency Domain Perceptron (AFDP) and a dual-pooling fusion strategy to effectively capture spatial and frequency-domain cues across modalities. Moreover, we design a Lightweight Dynamically Modulated Multi-Kernel convolution (LDMK) to perceive complex morphological structures with minimal computational overhead, replacing most convolutional operations in LIDAR. Experiments on three datasets demonstrate that our method outperforms other state-of-the-art (SOTA) methods. On the light-field depth dataset, our method achieves 0.8204 in F1 and 0.8465 in mIoU with only 5.35M parameters. Code and datasets are available at https://github.com/Karl1109/LIDAR-Mamba.

[107] Estimating 2D Camera Motion with Hybrid Motion Basis

Haipeng Li, Tianhao Zhou, Zhanglei Yang, Yi Wu, Yan Chen, Zijing Mao, Shen Cheng, Bing Zeng, Shuaicheng Liu

Main category: cs.CV

TL;DR: CamFlow introduces a hybrid motion basis framework for 2D camera motion estimation, outperforming existing methods by combining physical and stochastic bases and using a robust probabilistic loss.

Details

Motivation: Current methods for 2D camera motion estimation are limited by planar scene assumptions or struggle with complex transformations. CamFlow addresses these limitations by leveraging hybrid motion patterns.

Method: CamFlow uses hybrid motion bases (physical and stochastic) and a Laplace-based probabilistic loss for robust training. A new benchmark isolates pure camera motion by masking dynamic objects.

Result: CamFlow outperforms state-of-the-art methods in diverse scenarios, showing superior robustness and generalization in zero-shot settings.

Conclusion: CamFlow provides a novel, effective solution for 2D camera motion estimation, validated by a new benchmark and superior performance.

Abstract: Estimating 2D camera motion is a fundamental computer vision task that models the projection of 3D camera movements onto the 2D image plane. Current methods rely on either homography-based approaches, limited to planar scenes, or meshflow techniques that use grid-based local homographies but struggle with complex non-linear transformations. A key insight of our work is that combining flow fields from different homographies creates motion patterns that cannot be represented by any single homography. We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. Our approach includes a hybrid probabilistic loss function based on the Laplace distribution that enhances training robustness. For evaluation, we create a new benchmark by masking dynamic objects in existing optical flow datasets to isolate pure camera motion. Experiments show CamFlow outperforms state-of-the-art methods across diverse scenarios, demonstrating superior robustness and generalization in zero-shot settings. Code and datasets are available at our project page: https://lhaippp.github.io/CamFlow/.

[108] Robust Adverse Weather Removal via Spectral-based Spatial Grouping

Yuhwan Jeong, Yunseo Yang, Youngjo Yoon, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: Proposes SSGformer, a spectral-based spatial grouping transformer for multi-weather image restoration, addressing localized distortions via spectral decomposition and group-wise attention.

Details

Motivation: Adverse weather conditions cause complex degradations, and existing All-in-One models struggle with localized distortions.

Method: Decomposes images into high/low-frequency features, uses multi-head linear attention, and introduces a grouping-mask with group-wise attention for restoration.

Result: Superior performance in handling diverse adverse weather degradations.

Conclusion: SSGformer effectively addresses varied weather distortions through spectral decomposition and attention mechanisms.

Abstract: Adverse weather conditions cause diverse and complex degradation patterns, driving the development of All-in-One (AiO) models. However, recent AiO solutions still struggle to capture diverse degradations, since global filtering methods like direct operations on the frequency domain fail to handle highly variable and localized distortions. To address these issue, we propose Spectral-based Spatial Grouping Transformer (SSGformer), a novel approach that leverages spectral decomposition and group-wise attention for multi-weather image restoration. SSGformer decomposes images into high-frequency edge features using conventional edge detection and low-frequency information via Singular Value Decomposition. We utilize multi-head linear attention to effectively model the relationship between these features. The fused features are integrated with the input to generate a grouping-mask that clusters regions based on the spatial similarity and image texture. To fully leverage this mask, we introduce a group-wise attention mechanism, enabling robust adverse weather removal and ensuring consistent performance across diverse weather conditions. We also propose a Spatial Grouping Transformer Block that uses both channel attention and spatial attention, effectively balancing feature-wise relationships and spatial dependencies. Extensive experiments show the superiority of our approach, validating its effectiveness in handling the varied and intricate adverse weather degradations.

[109] AlphaDent: A dataset for automated tooth pathology detection

Evgeniy I. Sosnin, Yuriy L. Vasilev, Roman A. Solovyev, Aleksandr L. Stempkovskiy, Dmitry V. Telpukhov, Artem A. Vasilev, Aleksandr A. Amerikanov, Aleksandr Y. Romanov

Main category: cs.CV

TL;DR: A new dental dataset, AlphaDent, with 1200+ images from 295 patients, labeled for instance segmentation into 9 classes, is introduced. Experiments show high-quality predictions, and all resources are open-source.

Details

Motivation: To provide a unique, labeled dataset for dental research, facilitating instance segmentation tasks in dentistry.

Method: The dataset was created using DSLR camera photographs, labeled for instance segmentation, and used to train neural networks.

Result: High-quality predictions were achieved in instance segmentation experiments.

Conclusion: AlphaDent is a valuable open-source resource for dental research, with proven effectiveness in instance segmentation tasks.

Abstract: In this article, we present a new unique dataset for dental research - AlphaDent. This dataset is based on the DSLR camera photographs of the teeth of 295 patients and contains over 1200 images. The dataset is labeled for solving the instance segmentation problem and is divided into 9 classes. The article provides a detailed description of the dataset and the labeling format. The article also provides the details of the experiment on neural network training for the Instance Segmentation problem using this dataset. The results obtained show high quality of predictions. The dataset is published under an open license; and the training/inference code and model weights are also available under open licenses.

[110] Recognizing Actions from Robotic View for Natural Human-Robot Interaction

Ziyi Wang, Peiming Li, Hong Liu, Zhichao Deng, Can Wang, Jun Liu, Junsong Yuan, Mengyuan Liu

Main category: cs.CV

TL;DR: The paper introduces ACTIVE, a dataset for natural human-robot interaction (N-HRI), addressing gaps in existing benchmarks. It also proposes ACTIVE-PC, a method for accurate long-distance action recognition.

Details

Motivation: Existing benchmarks lack the complexity needed for N-HRI, such as varied distances, modalities, and environments.

Method: ACTIVE dataset includes 30 action categories, 80 participants, and 46,868 videos with RGB and point cloud data. ACTIVE-PC uses Multilevel Neighborhood Sampling and other techniques for long-distance recognition.

Result: ACTIVE-PC effectively recognizes human actions at long distances.

Conclusion: ACTIVE and ACTIVE-PC advance N-HRI research by providing a comprehensive benchmark and robust method for action recognition.

Abstract: Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary. This setup is more flexible and practical than conventional human action recognition tasks. However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. ACTIVE comprises 30 composite action categories, 80 participants, and 46,868 annotated video instances, covering both RGB and point cloud modalities. Participants performed various human actions in diverse environments at distances ranging from 3m to 50m, while the camera platform was also mobile, simulating real-world scenarios of robot perception with varying camera heights due to uneven ground. This comprehensive and challenging benchmark aims to advance action and attribute recognition research in N-HRI. Furthermore, we propose ACTIVE-PC, a method that accurately perceives human actions at long distances using Multilevel Neighborhood Sampling, Layered Recognizers, Elastic Ellipse Query, and precise decoupling of kinematic interference from human actions. Experimental results demonstrate the effectiveness of ACTIVE-PC. Our code is available at: https://github.com/wangzy01/ACTIVE-Action-from-Robotic-View.

[111] HRVVS: A High-resolution Video Vasculature Segmentation Network via Hierarchical Autoregressive Residual Priors

Xincheng Yao, Yijun Yang, Kangwei Guo, Ruiqiang Xiao, Haipeng Zhou, Haisu Tao, Jian Yang, Lei Zhu

Main category: cs.CV

TL;DR: The paper introduces a high-quality annotated dataset for hepatic vasculature segmentation in surgical videos and proposes HRVVS, a novel segmentation network that outperforms state-of-the-art methods.

Details

Motivation: The lack of a suitable dataset and the complexity of hepatic vasculature segmentation in surgical videos motivated the creation of a new dataset and method.

Method: The authors introduce a high-resolution video vasculature segmentation network (HRVVS) with a pretrained VAR model embedded in the encoder and a dynamic memory decoder for multi-view segmentation.

Result: HRVVS significantly outperforms existing methods on surgical video datasets.

Conclusion: The proposed HRVVS and dataset advance hepatic vasculature segmentation, with code and data made publicly available.

Abstract: The segmentation of the hepatic vasculature in surgical videos holds substantial clinical significance in the context of hepatectomy procedures. However, owing to the dearth of an appropriate dataset and the inherently complex task characteristics, few researches have been reported in this domain. To address this issue, we first introduce a high quality frame-by-frame annotated hepatic vasculature dataset containing 35 long hepatectomy videos and 11442 high-resolution frames. On this basis, we propose a novel high-resolution video vasculature segmentation network, dubbed as HRVVS. We innovatively embed a pretrained visual autoregressive modeling (VAR) model into different layers of the hierarchical encoder as prior information to reduce the information degradation generated during the downsampling process. In addition, we designed a dynamic memory decoder on a multi-view segmentation network to minimize the transmission of redundant information while preserving more details between frames. Extensive experiments on surgical video datasets demonstrate that our proposed HRVVS significantly outperforms the state-of-the-art methods. The source code and dataset will be publicly available at \href{https://github.com/scott-yjyang/xx}{https://github.com/scott-yjyang/HRVVS}.

[112] RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning

Kiseong Hong, Gyeong-hyeon Kim, Eunwoo Kim

Main category: cs.CV

TL;DR: A prompt-evolving mechanism is proposed to enhance continual learning by adaptively aggregating task-specific prompts, improving performance over existing methods.

Details

Motivation: Existing prompt-based continual learning methods lack representational diversity due to fixed or entangled prompts, limiting task adaptation.

Method: Introduces a prompt-evolving mechanism to transform and align task-specific prompts, with a learnable probabilistic gate for layer activation.

Result: Achieves average gains of 9.07% and 7.40% in image classification and video action recognition tasks.

Conclusion: The proposed method effectively integrates task-specific knowledge, outperforming existing approaches in continual learning.

Abstract: Prompt-based continual learning provides a rehearsal-free solution by tuning small sets of parameters while keeping pre-trained models frozen. To meet the complex demands of sequential tasks, it is crucial to integrate task-specific knowledge within prompts effectively. However, existing works rely on either fixed learned prompts (i.e., prompts whose representations remain unchanged during new task learning) or on prompts generated from an entangled task-shared space, limiting the representational diversity of the integrated prompt. To address this issue, we propose a novel prompt-evolving mechanism to adaptively aggregate base prompts (i.e., task-specific prompts) into a unified prompt while ensuring diversity. By transforming and aligning base prompts, both previously learned and newly introduced, our approach continuously evolves accumulated knowledge to facilitate learning new tasks. We further introduce a learnable probabilistic gate that adaptively determines which layers to activate during the evolution process. We validate our method on image classification and video action recognition tasks in class-incremental learning, achieving average gains of 9.07% and 7.40% over existing methods across all scenarios.

[113] Subtyping Breast Lesions via Generative Augmentation based Long-tailed Recognition in Ultrasound

Shijing Chen, Xinrui Zhou, Yuhao Wang, Yuhao Huang, Ao Chang, Dong Ni, Ruobing Huang

Main category: cs.CV

TL;DR: A dual-phase framework for long-tailed breast lesion classification uses generative augmentation and reinforcement learning to balance data distribution and improve recognition accuracy.

Details

Motivation: Addressing the skewed long-tailed distribution of breast lesion subtypes in ultrasound imaging to enhance automated recognition for personalized treatment.

Method: Proposes a dual-phase framework with reinforcement learning-driven adaptive sampling and a class-controllable synthetic network leveraging anatomical priors.

Result: Achieves promising performance on long-tailed and imbalanced breast ultrasound datasets compared to state-of-the-art methods.

Conclusion: The framework effectively mitigates distributional bias and maintains discriminative capability, offering a robust solution for long-tailed classification in medical imaging.

Abstract: Accurate identification of breast lesion subtypes can facilitate personalized treatment and interventions. Ultrasound (US), as a safe and accessible imaging modality, is extensively employed in breast abnormality screening and diagnosis. However, the incidence of different subtypes exhibits a skewed long-tailed distribution, posing significant challenges for automated recognition. Generative augmentation provides a promising solution to rectify data distribution. Inspired by this, we propose a dual-phase framework for long-tailed classification that mitigates distributional bias through high-fidelity data synthesis while avoiding overuse that corrupts holistic performance. The framework incorporates a reinforcement learning-driven adaptive sampler, dynamically calibrating synthetic-real data ratios by training a strategic multi-agent to compensate for scarcities of real data while ensuring stable discriminative capability. Furthermore, our class-controllable synthetic network integrates a sketch-grounded perception branch that harnesses anatomical priors to maintain distinctive class features while enabling annotation-free inference. Extensive experiments on an in-house long-tailed and a public imbalanced breast US datasets demonstrate that our method achieves promising performance compared to state-of-the-art approaches. More synthetic images can be found at https://github.com/Stinalalala/Breast-LT-GenAug.

[114] VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Ruifeng Yuan, Chenghao Xiao, Sicong Leng, Jianyu Wang, Long Li, Weiwen Xu, Hou Pong Chan, Deli Zhao, Tingyang Xu, Zhongyu Wei, Hao Zhang, Yu Rong

Main category: cs.CV

TL;DR: VL-Cogito, a multimodal reasoning model, uses Progressive Curriculum Reinforcement Learning (PCuRL) to improve performance across diverse tasks by dynamically adjusting training difficulty and reasoning path length.

Details

Motivation: Existing models struggle with unstable performance in multimodal tasks due to their complexity and diversity.

Method: VL-Cogito employs PCuRL with two innovations: an online difficulty soft weighting mechanism and a dynamic length reward mechanism.

Result: VL-Cogito matches or outperforms existing models in multimodal benchmarks.

Conclusion: The PCuRL framework effectively enhances multimodal reasoning, balancing efficiency and correctness.

Abstract: Reinforcement learning has proven its effectiveness in enhancing the reasoning capabilities of large language models. Recent research efforts have progressively extended this paradigm to multimodal reasoning tasks. Due to the inherent complexity and diversity of multimodal tasks, especially in semantic content and problem formulations, existing models often exhibit unstable performance across various domains and difficulty levels. To address these limitations, we propose VL-Cogito, an advanced multimodal reasoning model trained via a novel multi-stage Progressive Curriculum Reinforcement Learning (PCuRL) framework. PCuRL systematically guides the model through tasks of gradually increasing difficulty, substantially improving its reasoning abilities across diverse multimodal contexts. The framework introduces two key innovations: (1) an online difficulty soft weighting mechanism, dynamically adjusting training difficulty across successive RL training stages; and (2) a dynamic length reward mechanism, which encourages the model to adaptively regulate its reasoning path length according to task complexity, thus balancing reasoning efficiency with correctness. Experimental evaluations demonstrate that VL-Cogito consistently matches or surpasses existing reasoning-oriented models across mainstream multimodal benchmarks spanning mathematics, science, logic, and general understanding, validating the effectiveness of our approach.

[115] COOkeD: Ensemble-based OOD detection in the era of zero-shot CLIP

Galadrielle Humblot-Renaux, Gianni Franchi, Sergio Escalera, Thomas B. Moeslund

Main category: cs.CV

TL;DR: COOkeD introduces a heterogeneous ensemble method combining closed-world, zero-shot CLIP, and linear probe classifiers for superior OOD detection, outperforming single-classifier approaches.

Details

Motivation: OOD detection is constrained by single-classifier limitations; COOkeD leverages diverse classifiers to improve robustness and performance.

Method: Combines a closed-world classifier, zero-shot CLIP classifier, and linear probe classifier trained on CLIP features.

Result: Achieves state-of-the-art OOD detection performance on CIFAR100 and ImageNet, with robustness in challenging settings.

Conclusion: COOkeD’s modular, post-hoc ensemble approach significantly enhances OOD detection without heavy overhead.

Abstract: Out-of-distribution (OOD) detection is an important building block in trustworthy image recognition systems as unknown classes may arise at test-time. OOD detection methods typically revolve around a single classifier, leading to a split in the research field between the classical supervised setting (e.g. ResNet18 classifier trained on CIFAR100) vs. the zero-shot setting (class names fed as prompts to CLIP). In both cases, an overarching challenge is that the OOD detection performance is implicitly constrained by the classifier’s capabilities on in-distribution (ID) data. In this work, we show that given a little open-mindedness from both ends, remarkable OOD detection can be achieved by instead creating a heterogeneous ensemble - COOkeD combines the predictions of a closed-world classifier trained end-to-end on a specific dataset, a zero-shot CLIP classifier, and a linear probe classifier trained on CLIP image features. While bulky at first sight, this approach is modular, post-hoc and leverages the availability of pre-trained VLMs, thus introduces little overhead compared to training a single standard classifier. We evaluate COOkeD on popular CIFAR100 and ImageNet benchmarks, but also consider more challenging, realistic settings ranging from training-time label noise, to test-time covariate shift, to zero-shot shift which has been previously overlooked. Despite its simplicity, COOkeD achieves state-of-the-art performance and greater robustness compared to both classical and CLIP-based OOD detection methods. Code is available at https://github.com/glhr/COOkeD

[116] Robust Deepfake Detection for Electronic Know Your Customer Systems Using Registered Images

Takuma Amada, Kazuya Kakizaki, Taiki Miyagawa, Akinori F. Ebihara, Kaede Shiohara, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: A deepfake detection algorithm for eKYC systems detects face swapping and reenactment by analyzing temporal inconsistencies and identity discrepancies, improving accuracy with a larger dataset-trained feature extractor.

Details

Motivation: To enhance eKYC system reliability against deepfake attacks by detecting face swapping and reenactment robustly, even with degraded images.

Method: Uses temporal inconsistencies in identity vectors, compares input video with a registered genuine image, and employs a robust face feature extractor.

Result: Accurately detects deepfakes and is robust against unseen image degradation.

Conclusion: The proposed method effectively safeguards eKYC systems from deepfake threats.

Abstract: In this paper, we present a deepfake detection algorithm specifically designed for electronic Know Your Customer (eKYC) systems. To ensure the reliability of eKYC systems against deepfake attacks, it is essential to develop a robust deepfake detector capable of identifying both face swapping and face reenactment, while also being robust to image degradation. We address these challenges through three key contributions: (1)~Our approach evaluates the video’s authenticity by detecting temporal inconsistencies in identity vectors extracted by face recognition models, leading to comprehensive detection of both face swapping and face reenactment. (2)~In addition to processing video input, the algorithm utilizes a registered image (assumed to be genuine) to calculate identity discrepancies between the input video and the registered image, significantly improving detection accuracy. (3)~We find that employing a face feature extractor trained on a larger dataset enhances both detection performance and robustness against image degradation. Our experimental results show that our proposed method accurately detects both face swapping and face reenactment comprehensively and is robust against various forms of unseen image degradation. Our source code is publicly available https://github.com/TaikiMiyagawa/DeepfakeDetection4eKYC.

[117] ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning

Xiefan Guo, Miaomiao Cui, Liefeng Bo, Di Huang

Main category: cs.CV

TL;DR: ShortFT improves diffusion model alignment with reward functions by using a shorter denoising chain, enhancing efficiency and performance.

Details

Motivation: Existing backpropagation-based methods face computational costs and gradient explosion risks due to lengthy denoising chains, leading to suboptimal results.

Method: ShortFT employs a trajectory-preserving few-step diffusion model to create a shorter denoising chain, optimizing fine-tuning efficiency.

Result: The method significantly improves alignment performance and outperforms state-of-the-art alternatives across various reward functions.

Conclusion: ShortFT offers a practical and effective solution for fine-tuning diffusion models, addressing computational and gradient challenges.

Abstract: Backpropagation-based approaches aim to align diffusion models with reward functions through end-to-end backpropagation of the reward gradient within the denoising chain, offering a promising perspective. However, due to the computational costs and the risk of gradient explosion associated with the lengthy denoising chain, existing approaches struggle to achieve complete gradient backpropagation, leading to suboptimal results. In this paper, we introduce Shortcut-based Fine-Tuning (ShortFT), an efficient fine-tuning strategy that utilizes the shorter denoising chain. More specifically, we employ the recently researched trajectory-preserving few-step diffusion model, which enables a shortcut over the original denoising chain, and construct a shortcut-based denoising chain of shorter length. The optimization on this chain notably enhances the efficiency and effectiveness of fine-tuning the foundational model. Our method has been rigorously tested and can be effectively applied to various reward functions, significantly improving alignment performance and surpassing state-of-the-art alternatives.

[118] Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model

Daehee Park, Monu Surana, Pranav Desai, Ashish Mehta, Reuben MV John, Kuk-Jin Yoon

Main category: cs.CV

TL;DR: GALTraj introduces generative active learning to improve trajectory prediction by augmenting rare tail samples during training, enhancing performance on both tail and head samples.

Details

Motivation: Address the challenge of rarely observed long-tail scenarios in trajectory prediction without modifying model architectures.

Method: Uses generative active learning (GALTraj) to identify and augment rare tail samples with a controllable diffusion model, ensuring diversity and realism.

Result: Significantly improves performance on tail samples and enhances accuracy on head samples across multiple datasets (WOMD, Argoverse2).

Conclusion: GALTraj successfully demonstrates the benefits of simulator-driven augmentation for long-tail learning in trajectory prediction.

Abstract: While data-driven trajectory prediction has enhanced the reliability of autonomous driving systems, it still struggles with rarely observed long-tail scenarios. Prior works addressed this by modifying model architectures, such as using hypernetworks. In contrast, we propose refining the training process to unlock each model’s potential without altering its structure. We introduce Generative Active Learning for Trajectory prediction (GALTraj), the first method to successfully deploy generative active learning into trajectory prediction. It actively identifies rare tail samples where the model fails and augments these samples with a controllable diffusion model during training. In our framework, generating scenarios that are diverse, realistic, and preserve tail-case characteristics is paramount. Accordingly, we design a tail-aware generation method that applies tailored diffusion guidance to generate trajectories that both capture rare behaviors and respect traffic rules. Unlike prior simulation methods focused solely on scenario diversity, GALTraj is the first to show how simulator-driven augmentation benefits long-tail learning in trajectory prediction. Experiments on multiple trajectory datasets (WOMD, Argoverse2) with popular backbones (QCNet, MTR) confirm that our method significantly boosts performance on tail samples and also enhances accuracy on head samples.

[119] Bridging the Gap in Missing Modalities: Leveraging Knowledge Distillation and Style Matching for Brain Tumor Segmentation

Shenghao Zhu, Yifei Chen, Weihong Chen, Yuanhan Wang, Chang Liu, Shuo Jiang, Feiwei Qin, Changmiao Wang

Main category: cs.CV

TL;DR: MST-KDNet improves brain tumor segmentation, especially with missing modalities, using multi-scale transformers, dual-mode logit distillation, and global style matching.

Details

Motivation: Addressing unresolved challenges in tumor boundary segmentation and feature transfer with missing imaging modalities.

Method: Multi-Scale Transformer Knowledge Distillation, Dual-Mode Logit Distillation, and Global Style Matching Module.

Result: Outperforms leading methods on BraTS and FeTS 2024 datasets in Dice and HD95 scores, especially with modality loss.

Conclusion: MST-KDNet is robust and generalizable, suitable for real-world clinical use.

Abstract: Accurate and reliable brain tumor segmentation, particularly when dealing with missing modalities, remains a critical challenge in medical image analysis. Previous studies have not fully resolved the challenges of tumor boundary segmentation insensitivity and feature transfer in the absence of key imaging modalities. In this study, we introduce MST-KDNet, aimed at addressing these critical issues. Our model features Multi-Scale Transformer Knowledge Distillation to effectively capture attention weights at various resolutions, Dual-Mode Logit Distillation to improve the transfer of knowledge, and a Global Style Matching Module that integrates feature matching with adversarial learning. Comprehensive experiments conducted on the BraTS and FeTS 2024 datasets demonstrate that MST-KDNet surpasses current leading methods in both Dice and HD95 scores, particularly in conditions with substantial modality loss. Our approach shows exceptional robustness and generalization potential, making it a promising candidate for real-world clinical applications. Our source code is available at https://github.com/Quanato607/MST-KDNet.

[120] LOTS of Fashion! Multi-Conditioning for Image Generation via Sketch-Text Pairing

Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, Marco Cristani

Main category: cs.CV

TL;DR: LOTS introduces a method for generating fashion images using localized sketches and text, achieving state-of-the-art results.

Details

Motivation: Fashion design combines visual and textual elements, but existing methods lack fine-grained control. LOTS addresses this by integrating localized sketch-text pairs for detailed customization.

Method: LOTS uses a Modularized Pair-Centric representation for sketches and text, followed by Diffusion Pair Guidance to merge local and global features in a diffusion model.

Result: LOTS outperforms existing methods in both global and localized metrics, validated by quantitative results and human evaluation.

Conclusion: LOTS enables unprecedented design customization in fashion image generation, demonstrated by its performance and the new Sketchy dataset.

Abstract: Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.

[121] SpectraSentinel: LightWeight Dual-Stream Real-Time Drone Detection, Tracking and Payload Identification

Shahriar Kabir, Istiak Ahmmed Rifti, H. M. Shadman Tabib, Mushfiqur Rahman, Sadatul Islam Sadi, Hasnaen Adil, Ahmed Mahir Sultan Rumi, Ch Md Rakin Haider

Main category: cs.CV

TL;DR: A dual-stream drone monitoring framework using YOLOv11n on RGB and IR data streams achieves high accuracy and real-time performance for drone detection, tracking, and payload identification.

Details

Motivation: Addressing security concerns from drone proliferation by improving real-time surveillance for small aerial objects in diverse conditions.

Method: Independent YOLOv11n detectors on RGB and IR streams, with domain-specific preprocessing, augmentation, and hyperparameter tuning.

Result: Lightweight models achieve high accuracy in distinguishing drones from birds and classifying payloads, maintaining real-time performance.

Conclusion: The dual-modality design and specialized training enable efficient and accurate drone surveillance across RGB and IR channels.

Abstract: The proliferation of drones in civilian airspace has raised urgent security concerns, necessitating robust real-time surveillance systems. In response to the 2025 VIP Cup challenge tasks - drone detection, tracking, and payload identification - we propose a dual-stream drone monitoring framework. Our approach deploys independent You Only Look Once v11-nano (YOLOv11n) object detectors on parallel infrared (thermal) and visible (RGB) data streams, deliberately avoiding early fusion. This separation allows each model to be specifically optimized for the distinct characteristics of its input modality, addressing the unique challenges posed by small aerial objects in diverse environmental conditions. We customize data preprocessing and augmentation strategies per domain - such as limiting color jitter for IR imagery - and fine-tune training hyperparameters to enhance detection performance under conditions of heavy noise, low light, and motion blur. The resulting lightweight YOLOv11n models demonstrate high accuracy in distinguishing drones from birds and in classifying payload types, all while maintaining real-time performance. This report details the rationale for a dual-modality design, the specialized training pipelines, and the architectural optimizations that collectively enable efficient and accurate drone surveillance across RGB and IR channels.

[122] Graph-Guided Dual-Level Augmentation for 3D Scene Segmentation

Hongbin Lin, Yifan Jiang, Juangui Xu, Jesse Jiaxi Xu, Yi Lu, Zhengyu Hu, Ying-Cong Chen, Hao Wang

Main category: cs.CV

TL;DR: A graph-guided data augmentation framework for 3D point cloud segmentation improves scene synthesis by incorporating global structural dependencies and dual-level constraints.

Details

Motivation: Existing augmentation methods for 3D point cloud segmentation lack global structural considerations, limiting their effectiveness.

Method: Proposes a framework using guiding graphs derived from real-world data, with local (geometric/semantic) and global (topological) constraints for scene generation.

Result: Generates diverse, high-quality augmented scenes, improving segmentation performance on indoor and outdoor datasets.

Conclusion: The framework enhances 3D segmentation by addressing global structural dependencies, outperforming traditional augmentation methods.

Abstract: 3D point cloud segmentation aims to assign semantic labels to individual points in a scene for fine-grained spatial understanding. Existing methods typically adopt data augmentation to alleviate the burden of large-scale annotation. However, most augmentation strategies only focus on local transformations or semantic recomposition, lacking the consideration of global structural dependencies within scenes. To address this limitation, we propose a graph-guided data augmentation framework with dual-level constraints for realistic 3D scene synthesis. Our method learns object relationship statistics from real-world data to construct guiding graphs for scene generation. Local-level constraints enforce geometric plausibility and semantic consistency between objects, while global-level constraints maintain the topological structure of the scene by aligning the generated layout with the guiding graph. Extensive experiments on indoor and outdoor datasets demonstrate that our framework generates diverse and high-quality augmented scenes, leading to consistent improvements in point cloud segmentation performance across various models.

[123] MergeSAM: Unsupervised change detection of remote sensing images based on the Segment Anything Model

Meiqi Hu, Lingzhi Lu, Chengxi Han, Xiaoping Liu

Main category: cs.CV

TL;DR: MergeSAM introduces an unsupervised change detection method for remote sensing imagery using SAM, with MaskMatching and MaskSplitting strategies to handle complex changes.

Details

Motivation: To enhance unsupervised change detection by leveraging SAM's segmentation capabilities for complex real-world scenarios.

Method: Uses SAM for multitemporal mask construction, with MaskMatching and MaskSplitting to address object splitting and merging.

Result: Effectively captures intricate changes in high-resolution remote sensing imagery.

Conclusion: MergeSAM improves practical applicability of change detection by integrating SAM’s segmentation with novel strategies.

Abstract: Recently, large foundation models trained on vast datasets have demonstrated exceptional capabilities in feature extraction and general feature representation. The ongoing advancements in deep learning-driven large models have shown great promise in accelerating unsupervised change detection methods, thereby enhancing the practical applicability of change detection technologies. Building on this progress, this paper introduces MergeSAM, an innovative unsupervised change detection method for high-resolution remote sensing imagery, based on the Segment Anything Model (SAM). Two novel strategies, MaskMatching and MaskSplitting, are designed to address real-world complexities such as object splitting, merging, and other intricate changes. The proposed method fully leverages SAM’s object segmentation capabilities to construct multitemporal masks that capture complex changes, embedding the spatial structure of land cover into the change detection process.

Yimeng Liu, Maolin Gan, Yidong Ren, Gen Li, Jingkai Lin, Younsuk Dong, Zhichao Cao

Main category: cs.CV

TL;DR: A new multi-modal dataset for leaf wetness detection is introduced, combining mmWave, SAR, and RGB data, with benchmarks using the Hydra model.

Details

Motivation: Existing leaf wetness sensing systems lack robustness and accuracy in real-world conditions, necessitating a better dataset for machine learning evaluation.

Method: A multi-modal dataset with synchronized mmWave, SAR, and RGB images was collected over six months from diverse plant species in controlled and outdoor environments. Benchmarks were performed using the Hydra model.

Result: The dataset provides benchmarks for leaf wetness detection, comparing single-modality baselines and fusion strategies, and evaluates performance under varying scan distances.

Conclusion: The dataset advances leaf wetness detection and serves as a benchmark for SAR imaging algorithm optimization under diverse conditions.

Abstract: Leaf wetness detection is a crucial task in agricultural monitoring, as it directly impacts the prediction and protection of plant diseases. However, existing sensing systems suffer from limitations in robustness, accuracy, and environmental resilience when applied to natural leaves under dynamic real-world conditions. To address these challenges, we introduce a new multi-modal dataset specifically designed for evaluating and advancing machine learning algorithms in leaf wetness detection. Our dataset comprises synchronized mmWave raw data, Synthetic Aperture Radar (SAR) images, and RGB images collected over six months from five diverse plant species in both controlled and outdoor field environments. We provide detailed benchmarks using the Hydra model, including comparisons against single modality baselines and multiple fusion strategies, as well as performance under varying scan distances. Additionally, our dataset can serve as a benchmark for future SAR imaging algorithm optimization, enabling a systematic evaluation of detection accuracy under diverse conditions.

[125] Zero-Shot Image Anomaly Detection Using Generative Foundation Models

Lemar Abdi, Amaan Valiuddin, Francisco Caetano, Christiaan Viviers, Fons van der Sommen

Main category: cs.CV

TL;DR: The paper proposes using diffusion models as perceptual templates for out-of-distribution (OOD) detection, leveraging denoising trajectories and Stein score errors for anomaly identification without dataset-specific retraining.

Details

Motivation: Deploying safe vision systems in open-world environments requires robust OOD detection. The study explores diffusion models for this purpose, moving beyond their traditional generative role.

Method: Uses Denoising Diffusion Models (DDMs) to analyze Stein score errors, amplified by SSIM, for detecting anomalies. Trains a single model on CelebA as a base distribution.

Result: Achieves near-perfect performance on some benchmarks, outperforming methods using datasets like ImageNet, with room for improvement on others.

Conclusion: Generative foundation models, particularly diffusion models, show strong potential for anomaly detection, offering a scalable and effective approach.

Abstract: Detecting out-of-distribution (OOD) inputs is pivotal for deploying safe vision systems in open-world environments. We revisit diffusion models, not as generators, but as universal perceptual templates for OOD detection. This research explores the use of score-based generative models as foundational tools for semantic anomaly detection across unseen datasets. Specifically, we leverage the denoising trajectories of Denoising Diffusion Models (DDMs) as a rich source of texture and semantic information. By analyzing Stein score errors, amplified through the Structural Similarity Index Metric (SSIM), we introduce a novel method for identifying anomalous samples without requiring re-training on each target dataset. Our approach improves over state-of-the-art and relies on training a single model on one dataset – CelebA – which we find to be an effective base distribution, even outperforming more commonly used datasets like ImageNet in several settings. Experimental results show near-perfect performance on some benchmarks, with notable headroom on others, highlighting both the strength and future potential of generative foundation models in anomaly detection.

[126] Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints

Thuy Tran, Ruochen Chen, Shaifali Parashar

Main category: cs.CV

TL;DR: The paper proposes an unsupervised Shape-from-Template (SfT) method that reconstructs 3D shapes from images using color features, gradients, silhouettes, and mesh inextensibility, achieving 400x faster performance and better handling of occlusions and details.

Details

Motivation: Traditional SfT methods rely on point correspondences and degrade with occlusions, while modern methods require large supervised datasets. This work aims to overcome these limitations with an unsupervised approach.

Method: The method uses image observations (color, gradients, silhouettes) and a mesh inextensibility constraint to reconstruct 3D shapes without correspondences or supervision.

Result: The approach is 400x faster than the best unsupervised SfT and outperforms existing methods in handling occlusions and fine details.

Conclusion: The proposed unsupervised SfT method is efficient, robust to occlusions, and superior in detail reconstruction, with publicly available code.

Abstract: Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a $400\times$ faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code is available at https://github.com/dvttran/nsft.

[127] A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks

Hang Su, Yunlong Feng, Daniel Gehrig, Panfeng Jiang, Ling Gao, Xavier Lagorce, Laurent Kneip

Main category: cs.CV

TL;DR: A unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, applicable to various camera types.

Details

Motivation: Existing algorithms like the 5-point or 8-point methods are limited to synchronized views, failing for asynchronous data from rolling shutter or event cameras.

Method: Formulates the problem using first-order dynamics and a constant velocity motion model, deriving a linear point incidence relation for efficient recovery of velocity and 3D points.

Result: Validated on simulated and real-world data, showing consistent improvement over recent approaches across all modalities.

Conclusion: The method enables efficient structure and motion estimation from asynchronous data, with potential applications in diverse sensing modalities.

Abstract: Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data. Code can be found at https://github.com/suhang99/AsyncTrack-Motion-Solver.

Yang Gao, Saeed Saadatnejad, Alexandre Alahi

Main category: cs.CV

TL;DR: The paper introduces ‘Social-pose’, an attention-based pose encoder for human trajectory prediction, improving accuracy by leveraging body poses and social relations. It outperforms existing models across multiple datasets.

Details

Motivation: Existing trajectory prediction models often miss visual cues from human body poses, which are crucial for accurate and safe autonomous driving.

Method: Proposes ‘Social-pose’, an attention-based pose encoder that captures human poses and social relations, integrable into various architectures (LSTM, GAN, MLP, Transformer).

Result: Shows improvements over state-of-the-art models on synthetic and real datasets (Joint Track Auto, Human3.6M, Pedestrians and Cyclists in Road Traffic, JRDB). Also explores 2D vs. 3D poses and noise effects.

Conclusion: Using body poses enhances trajectory prediction, with ‘Social-pose’ proving effective across diverse scenarios, including robot navigation.

Abstract: Accurate human trajectory prediction is one of the most crucial tasks for autonomous driving, ensuring its safety. Yet, existing models often fail to fully leverage the visual cues that humans subconsciously communicate when navigating the space. In this work, we study the benefits of predicting human trajectories using human body poses instead of solely their Cartesian space locations in time. We propose `Social-pose’, an attention-based pose encoder that effectively captures the poses of all humans in a scene and their social relations. Our method can be integrated into various trajectory prediction architectures. We have conducted extensive experiments on state-of-the-art models (based on LSTM, GAN, MLP, and Transformer), and showed improvements over all of them on synthetic (Joint Track Auto) and real (Human3.6M, Pedestrians and Cyclists in Road Traffic, and JRDB) datasets. We also explored the advantages of using 2D versus 3D poses, as well as the effect of noisy poses and the application of our pose-based predictor in robot navigation scenarios.

[129] HOLA: Enhancing Audio-visual Deepfake Detection via Hierarchical Contextual Aggregations and Efficient Pre-training

Xuecheng Wu, Danlei Huang, Heli Sun, Xinyi Yin, Yifan Wang, Hao Wang, Jia Zhang, Fei Wang, Peihao Guo, Suyu Xing, Junxiao Xue, Liang He

Main category: cs.CV

TL;DR: HOLA is a novel framework for video-level deepfake detection, leveraging large-scale pre-training and innovative modules like cross-modal learning and hierarchical modeling, achieving top performance in the 2025 challenge.

Details

Motivation: Current deepfake detection techniques struggle with advanced generative AI, necessitating more robust solutions.

Method: HOLA uses audio-visual self-supervised pre-training, iterative cross-modal learning, hierarchical contextual modeling, and a pyramid refiner, enhanced by pseudo-supervised signal injection.

Result: HOLA outperforms competitors, ranking 1st with a 0.0476 AUC lead on the TestA set.

Conclusion: HOLA’s innovative design and pre-training approach significantly advance deepfake detection, setting a new benchmark.

Abstract: Advances in Generative AI have made video-level deepfake detection increasingly challenging, exposing the limitations of current detection techniques. In this paper, we present HOLA, our solution to the Video-Level Deepfake Detection track of 2025 1M-Deepfakes Detection Challenge. Inspired by the success of large-scale pre-training in the general domain, we first scale audio-visual self-supervised pre-training in the multimodal video-level deepfake detection, which leverages our self-built dataset of 1.81M samples, thereby leading to a unified two-stage framework. To be specific, HOLA features an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. Moreover, we propose the pseudo supervised singal injection strategy to further boost model performance. Extensive experiments across expert models and MLLMs impressivly demonstrate the effectiveness of our proposed HOLA. We also conduct a series of ablation studies to explore the crucial design factors of our introduced components. Remarkably, our HOLA ranks 1st, outperforming the second by 0.0476 AUC on the TestA set.

[130] Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques

Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin

Main category: cs.CV

TL;DR: A survey on feature matching in computer vision, comparing traditional handcrafted methods with modern deep learning approaches across various modalities.

Details

Motivation: Feature matching is critical for tasks like image retrieval, 3D reconstruction, and SLAM, but traditional methods struggle with modality gaps.

Method: Reviews handcrafted methods (e.g., Harris, SIFT, ORB) and deep learning approaches (e.g., SuperPoint, LoFTR) for RGB, depth, 3D point clouds, LiDAR, medical images, and vision-language tasks.

Result: Deep learning methods outperform traditional ones in robustness and adaptability, especially for cross-modal tasks.

Conclusion: Feature matching has evolved significantly with deep learning, enabling better handling of diverse modalities and applications.

Abstract: Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.

[131] Advancing Fetal Ultrasound Image Quality Assessment in Low-Resource Settings

Dongli He, Hu Wang, Mohammad Yaqub

Main category: cs.CV

TL;DR: FetalCLIP, a vision-language model, is adapted for automated fetal ultrasound image quality assessment (IQA) using LoRA, achieving high F1 scores and improving prenatal care in resource-limited settings.

Details

Motivation: The scarcity of trained sonographers in low-income countries makes high-quality fetal ultrasound images hard to obtain, necessitating automated solutions.

Method: FetalCLIP is adapted for IQA using LoRA, and a segmentation model is repurposed for classification. Evaluated on ACOUSLIC-AI dataset against CNN and Transformer baselines.

Result: FetalCLIP$_{CLS}$ achieves an F1 score of 0.757; repurposed segmentation model improves it to 0.771.

Conclusion: Parameter-efficient fine-tuning of fetal ultrasound models enables task-specific adaptations, advancing prenatal care in resource-limited settings.

Abstract: Accurate fetal biometric measurements, such as abdominal circumference, play a vital role in prenatal care. However, obtaining high-quality ultrasound images for these measurements heavily depends on the expertise of sonographers, posing a significant challenge in low-income countries due to the scarcity of trained personnel. To address this issue, we leverage FetalCLIP, a vision-language model pretrained on a curated dataset of over 210,000 fetal ultrasound image-caption pairs, to perform automated fetal ultrasound image quality assessment (IQA) on blind-sweep ultrasound data. We introduce FetalCLIP${CLS}$, an IQA model adapted from FetalCLIP using Low-Rank Adaptation (LoRA), and evaluate it on the ACOUSLIC-AI dataset against six CNN and Transformer baselines. FetalCLIP${CLS}$ achieves the highest F1 score of 0.757. Moreover, we show that an adapted segmentation model, when repurposed for classification, further improves performance, achieving an F1 score of 0.771. Our work demonstrates how parameter-efficient fine-tuning of fetal ultrasound foundation models can enable task-specific adaptations, advancing prenatal care in resource-limited settings. The experimental code is available at: https://github.com/donglihe-hub/FetalCLIP-IQA.

[132] Segment Anything for Video: A Comprehensive Review of Video Object Segmentation and Tracking from Past to Future

Guoping Xu, Jayaram K. Udupa, Yajun Yu, Hua-Chieh Shao, Songlin Zhao, Wei Liu, You Zhang

Main category: cs.CV

TL;DR: This survey reviews SAM/SAM2-based methods for Video Object Segmentation and Tracking (VOST), focusing on past, present, and future temporal dimensions. It highlights advancements like motion-aware memory and trajectory-guided prompting, while addressing challenges like memory redundancy and prompt inefficiency.

Details

Motivation: The complexity of VOST, including domain generalization and temporal consistency, motivates leveraging foundation models like SAM/SAM2 for robust solutions.

Method: The survey structures its review along three temporal dimensions: past (historical information), present (current frame features), and future (object dynamics prediction). It examines memory-based architectures, real-time segmentation, and innovations like motion-aware memory.

Result: The survey highlights advancements in VOST, such as improved accuracy and efficiency through SAM/SAM2, but notes challenges like error accumulation and memory redundancy.

Conclusion: The paper provides a structured overview of VOST advancements using foundation models, identifies key challenges, and suggests future research directions to further improve the field.

Abstract: Video Object Segmentation and Tracking (VOST) presents a complex yet critical challenge in computer vision, requiring robust integration of segmentation and tracking across temporally dynamic frames. Traditional methods have struggled with domain generalization, temporal consistency, and computational efficiency. The emergence of foundation models like the Segment Anything Model (SAM) and its successor, SAM2, has introduced a paradigm shift, enabling prompt-driven segmentation with strong generalization capabilities. Building upon these advances, this survey provides a comprehensive review of SAM/SAM2-based methods for VOST, structured along three temporal dimensions: past, present, and future. We examine strategies for retaining and updating historical information (past), approaches for extracting and optimizing discriminative features from the current frame (present), and motion prediction and trajectory estimation mechanisms for anticipating object dynamics in subsequent frames (future). In doing so, we highlight the evolution from early memory-based architectures to the streaming memory and real-time segmentation capabilities of SAM2. We also discuss recent innovations such as motion-aware memory selection and trajectory-guided prompting, which aim to enhance both accuracy and efficiency. Finally, we identify remaining challenges including memory redundancy, error accumulation, and prompt inefficiency, and suggest promising directions for future research. This survey offers a timely and structured overview of the field, aiming to guide researchers and practitioners in advancing the state of VOST through the lens of foundation models.

[133] MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Yuqi Pang, Bowen Yang, Yun Cao, Fan Rong, Xiaoyu Li, Chen He

Main category: cs.CV

TL;DR: MoCHA is a novel VLLM framework integrating multiple vision backbones and a sparse Mixture of Experts Connectors (MoECs) to efficiently handle visual details and reduce costs, outperforming state-of-the-art models.

Details

Motivation: Addressing high training/inference costs and challenges in extracting visual details and bridging modalities in VLLMs.

Method: Integrates CLIP, SigLIP, DINOv2, and ConvNeXt backbones with MoECs and Hierarchical Group Attention (HGA) for dynamic feature selection and efficient visual encoding.

Result: Outperforms open-weight models, e.g., 3.25% improvement in POPE and 153-point rise on MME.

Conclusion: MoCHA’s MoECs and HGA enhance performance and robustness, validated by ablation studies.

Abstract: Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.

[134] DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion

Hossein Mirzaei, Zeinab Taghavi, Sepehr Rezaee, Masoud Hadi, Moein Madadi, Mackenzie W. Mathis

Main category: cs.CV

TL;DR: Proposes DISTIL, a data-free, zero-shot trigger-inversion method for detecting Trojan attacks in deep neural networks, outperforming existing methods.

Details

Motivation: Address vulnerabilities of deep neural networks to Trojan attacks, ensuring safety in critical applications.

Method: Uses a diffusion-based generator guided by the target classifier to iteratively produce candidate triggers.

Result: Achieves up to 7.1% higher accuracy on BackdoorBench and 9.4% improvement in trojaned model scanning.

Conclusion: DISTIL offers a reliable, assumption-free defense against backdoor attacks without needing extensive data.

Abstract: Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to Trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion – reconstructing malicious “shortcut” patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus Trojaned models. DISTIL surpasses alternative methods by high margins, achieving up to 7.1% higher accuracy on the BackdoorBench dataset and a 9.4% improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense without reliance on extensive data or strong prior assumptions about triggers. The code is available at https://github.com/AdaptiveMotorControlLab/DISTIL.

[135] CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models

Kedong Xiu, Saiqian Zhang

Main category: cs.CV

TL;DR: CapRecover is a framework to recover high-level semantic content (e.g., labels, captions) from intermediate features in split-DNN configurations, addressing privacy risks without image reconstruction.

Details

Motivation: Privacy risks from semantic information leakage in split-DNN configurations, where existing methods produce blurry images, necessitate direct semantic recovery.

Method: CapRecover uses a cross-modality inversion framework to extract semantics from intermediate features, avoiding image reconstruction. It also introduces noise addition for protection.

Result: Achieves 92.71% Top-1 label accuracy on CIFAR-10 and fluent captions on COCO2017 (ROUGE-L 0.52). Deeper layers encode more semantics. Noise addition effectively prevents leakage.

Conclusion: CapRecover successfully recovers semantics from features and offers a practical protection method, balancing privacy and performance.

Abstract: As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations–with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud–there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs.

[136] Wall Shear Stress Estimation in Abdominal Aortic Aneurysms: Towards Generalisable Neural Surrogate Models

Patryk Rygiel, Julian Suk, Christoph Brune, Kak Khee Yeung, Jelmer M. Wolterink

Main category: cs.CV

TL;DR: A geometric deep learning model is proposed to estimate hemodynamic parameters in AAA patients, showing strong generalization across real-world variations like geometry remodeling and boundary condition changes.

Details

Motivation: Traditional CFD simulations for AAA hemodynamics are computationally expensive, prompting the need for faster, accurate alternatives like geometric deep learning.

Method: An E(3)-equivariant deep learning model using robust geometrical descriptors and projective geometric algebra is trained on CT scans of 100 AAA patients with reference CFD data.

Result: The model generalizes well within and outside the training distribution, handling geometry changes, boundary conditions, and even new artery branches during inference.

Conclusion: The proposed model offers accurate, efficient hemodynamic estimation, with potential clinical applications due to its robustness and generalizability.

Abstract: Abdominal aortic aneurysms (AAAs) are pathologic dilatations of the abdominal aorta posing a high fatality risk upon rupture. Studying AAA progression and rupture risk often involves in-silico blood flow modelling with computational fluid dynamics (CFD) and extraction of hemodynamic factors like time-averaged wall shear stress (TAWSS) or oscillatory shear index (OSI). However, CFD simulations are known to be computationally demanding. Hence, in recent years, geometric deep learning methods, operating directly on 3D shapes, have been proposed as compelling surrogates, estimating hemodynamic parameters in just a few seconds. In this work, we propose a geometric deep learning approach to estimating hemodynamics in AAA patients, and study its generalisability to common factors of real-world variation. We propose an E(3)-equivariant deep learning model utilising novel robust geometrical descriptors and projective geometric algebra. Our model is trained to estimate transient WSS using a dataset of CT scans of 100 AAA patients, from which lumen geometries are extracted and reference CFD simulations with varying boundary conditions are obtained. Results show that the model generalizes well within the distribution, as well as to the external test set. Moreover, the model can accurately estimate hemodynamics across geometry remodelling and changes in boundary conditions. Furthermore, we find that a trained model can be applied to different artery tree topologies, where new and unseen branches are added during inference. Finally, we find that the model is to a large extent agnostic to mesh resolution. These results show the accuracy and generalisation of the proposed model, and highlight its potential to contribute to hemodynamic parameter estimation in clinical practice.

[137] Bi-Level Optimization for Self-Supervised AI-Generated Face Detection

Mian Zou, Nan Zhong, Baosheng Yu, Yibing Zhan, Kede Ma

Main category: cs.CV

TL;DR: A self-supervised method using bi-level optimization improves AI-generated face detection by pretraining on photographic images and optimizing pretext tasks for better generalization.

Details

Motivation: Existing AI-generated face detectors lack generalization to new generative techniques due to reliance on specific synthesized images.

Method: Bi-level optimization: inner loop pretrains a vision encoder on photographic images with weighted pretext tasks; outer loop optimizes task weights for manipulated face detection.

Result: Detectors outperform existing methods in one-class and binary classification, generalizing well to unseen generators.

Conclusion: The self-supervised approach enhances AI-generated face detection by aligning pretext tasks with the ultimate goal, improving generalization.

Abstract: AI-generated face detectors trained via supervised learning typically rely on synthesized images from specific generators, limiting their generalization to emerging generative techniques. To overcome this limitation, we introduce a self-supervised method based on bi-level optimization. In the inner loop, we pretrain a vision encoder only on photographic face images using a set of linearly weighted pretext tasks: classification of categorical exchangeable image file format (EXIF) tags, ranking of ordinal EXIF tags, and detection of artificial face manipulations. The outer loop then optimizes the relative weights of these pretext tasks to enhance the coarse-grained detection of manipulated faces, serving as a proxy task for identifying AI-generated faces. In doing so, it aligns self-supervised learning more closely with the ultimate goal of AI-generated face detection. Once pretrained, the encoder remains fixed, and AI-generated faces are detected either as anomalies under a Gaussian mixture model fitted to photographic face features or by a lightweight two-layer perceptron serving as a binary classifier. Extensive experiments demonstrate that our detectors significantly outperform existing approaches in both one-class and binary classification settings, exhibiting strong generalization to unseen generators.

[138] DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion

Qingcheng Zhao, Xiang Zhang, Haiyang Xu, Zeyuan Chen, Jianwen Xie, Yuan Gao, Zhuowen Tu

Main category: cs.CV

TL;DR: DepR is a depth-guided framework for single-view scene reconstruction, using instance-level diffusion and depth throughout training and inference for improved performance.

Details

Motivation: Existing methods underutilize depth information, limiting reconstruction quality. DepR aims to fully exploit depth for better scene reconstruction.

Method: DepR integrates depth-guided conditioning into diffusion models, using depth for DDIM sampling and layout optimization during inference.

Result: DepR achieves state-of-the-art performance and strong generalization on synthetic and real-world datasets.

Conclusion: Depth-guided conditioning and compositional reconstruction in DepR significantly improve single-view scene reconstruction.

Abstract: We propose DepR, a depth-guided single-view scene reconstruction framework that integrates instance-level diffusion within a compositional paradigm. Instead of reconstructing the entire scene holistically, DepR generates individual objects and subsequently composes them into a coherent 3D layout. Unlike previous methods that use depth solely for object layout estimation during inference and therefore fail to fully exploit its rich geometric information, DepR leverages depth throughout both training and inference. Specifically, we introduce depth-guided conditioning to effectively encode shape priors into diffusion models. During inference, depth further guides DDIM sampling and layout optimization, enhancing alignment between the reconstruction and the input image. Despite being trained on limited synthetic data, DepR achieves state-of-the-art performance and demonstrates strong generalization in single-view scene reconstruction, as shown through evaluations on both synthetic and real-world datasets.

[139] ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

Main category: cs.CV

TL;DR: A modular multi-agent framework for UI-to-code generation improves robustness and interpretability over black-box methods, achieving state-of-the-art performance.

Details

Motivation: Automating UI-to-code transformation can accelerate development and democratize design workflows, but existing LLM-based approaches lack spatial and visual design capture.

Method: A three-stage framework: grounding (detects UI components), planning (constructs hierarchical layout), and generation (produces HTML/CSS code). Extended into a scalable data engine for synthetic image-code pairs.

Result: State-of-the-art performance in layout accuracy, structural coherence, and code correctness. Fine-tuned open-source VLM shows improved UI understanding and code quality.

Conclusion: The framework enhances UI-to-code generation by combining modularity, interpretability, and scalability, with publicly available code.

Abstract: Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While recent large language models (LLMs) have demonstrated progress in text-to-code generation, many existing approaches rely solely on natural language prompts, limiting their effectiveness in capturing spatial layout and visual design intent. In contrast, UI development in practice is inherently multimodal, often starting from visual sketches or mockups. To address this gap, we introduce a modular multi-agent framework that performs UI-to-code generation in three interpretable stages: grounding, planning, and generation. The grounding agent uses a vision-language model to detect and label UI components, the planning agent constructs a hierarchical layout using front-end engineering priors, and the generation agent produces HTML/CSS code via adaptive prompt-based synthesis. This design improves robustness, interpretability, and fidelity over end-to-end black-box methods. Furthermore, we extend the framework into a scalable data engine that automatically produces large-scale image-code pairs. Using these synthetic examples, we fine-tune and reinforce an open-source VLM, yielding notable gains in UI understanding and code quality. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

[140] TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning

Siqi Luo, Haoran Yang, Yi Xin, Mingyang Yi, Guangyang Wu, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: TR-PTS is a task-driven framework for efficient fine-tuning of large models by selecting task-relevant parameters and tokens, improving both performance and computational efficiency.

Details

Motivation: Large pre-trained models are costly to fine-tune, and existing PEFT methods are task-agnostic, leading to suboptimal efficiency and performance.

Method: TR-PTS uses Fisher Information Matrix for layer-wise parameter selection and dynamic token selection to focus on task-discriminative information.

Result: TR-PTS outperforms full fine-tuning by 3.40% on FGVC and 10.35% on VTAB-1k, achieving state-of-the-art performance.

Conclusion: TR-PTS effectively balances efficiency and accuracy by optimizing task-relevant parameters and tokens, setting a new benchmark for PEFT methods.

Abstract: Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. Specifically, we introduce Task-Relevant Parameter Selection, which utilizes the Fisher Information Matrix (FIM) to identify and fine-tune only the most informative parameters in a layer-wise manner, while keeping the remaining parameters frozen. Simultaneously, Task-Relevant Token Selection dynamically preserves the most informative tokens and merges redundant ones, reducing computational overhead. By jointly optimizing parameters and tokens, TR-PTS enables the model to concentrate on task-discriminative information. We evaluate TR-PTS on benchmark, including FGVC and VTAB-1k, where it achieves state-of-the-art performance, surpassing full fine-tuning by 3.40% and 10.35%, respectively. The code are available at https://github.com/synbol/TR-PTS.

[141] LCS: An AI-based Low-Complexity Scaler for Power-Efficient Super-Resolution of Game Content

Simon Pochinda, Momen K. Tageldeen, Mark Thompson, Tony Rinaldi, Troy Giorshev, Keith Lee, Jie Zhou, Frederick Walls

Main category: cs.CV

TL;DR: Proposes an AI-based low-complexity scaler (LCS) to offload GPU workload to NPUs, achieving better perceptual quality than existing methods.

Details

Motivation: Address the growing GPU workload in modern games by leveraging AI for efficient upscaling.

Method: Train LCS on GameIR image pairs using adversarial training, reparameterization, and quantization for reduced complexity.

Result: LCS outperforms AMD EASF and FSR1 in perceptual quality, showing promise for resource-constrained devices.

Conclusion: ESR models like LCS are viable for efficient upscaling, reducing GPU dependency.

Abstract: The increasing complexity of content rendering in modern games has led to a problematic growth in the workload of the GPU. In this paper, we propose an AI-based low-complexity scaler (LCS) inspired by state-of-the-art efficient super-resolution (ESR) models which could offload the workload on the GPU to a low-power device such as a neural processing unit (NPU). The LCS is trained on GameIR image pairs natively rendered at low and high resolution. We utilize adversarial training to encourage reconstruction of perceptually important details, and apply reparameterization and quantization techniques to reduce model complexity and size. In our comparative analysis we evaluate the LCS alongside the publicly available AMD hardware-based Edge Adaptive Scaling Function (EASF) and AMD FidelityFX Super Resolution 1 (FSR1) on five different metrics, and find that the LCS achieves better perceptual quality, demonstrating the potential of ESR models for upscaling on resource-constrained devices.

[142] Viser: Imperative, Web-based 3D Visualization in Python

Brent Yi, Chung Min Kim, Justin Kerr, Gina Wu, Rebecca Feng, Anthony Zhang, Jonas Kulhanek, Hongsuk Choi, Yi Ma, Matthew Tancik, Angjoo Kanazawa

Main category: cs.CV

TL;DR: Viser is a 3D visualization library for Python, designed for computer vision and robotics, offering easy-to-use and extensible 3D scene and 2D GUI primitives.

Details

Motivation: To simplify and enhance 3D visualization in Python for computer vision and robotics applications.

Method: Provides an imperative-style API and a web-based viewer, along with a comprehensive set of 3D scene and 2D GUI primitives.

Result: A flexible and compatible library that supports modern programming patterns and workflows.

Conclusion: Viser successfully bridges the gap between ease of use and extensibility in 3D visualization for Python.

Abstract: We present Viser, a 3D visualization library for computer vision and robotics. Viser aims to bring easy and extensible 3D visualization to Python: we provide a comprehensive set of 3D scene and 2D GUI primitives, which can be used independently with minimal setup or composed to build specialized interfaces. This technical report describes Viser’s features, interface, and implementation. Key design choices include an imperative-style API and a web-based viewer, which improve compatibility with modern programming patterns and workflows.

[143] Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying, Henghui Ding, Guanquan Jie, Yu-Gang Jiang

Main category: cs.CV

TL;DR: OmniAVS introduces a new dataset and OISA, a method for multimodal reasoning in audio-visual segmentation, outperforming existing methods.

Details

Motivation: To address challenges in integrating multimodal information and reasoning about audiovisual content in RAVS.

Method: Proposes OmniAVS dataset with diverse multimodal expressions and OISA, leveraging MLLM for reasoning-based segmentation.

Result: OISA outperforms existing methods on OmniAVS and achieves competitive results on related tasks.

Conclusion: OmniAVS and OISA advance RAVS by enabling deeper multimodal understanding and reasoning.

Abstract: Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce Omnimodal Instructed Segmentation Assistant (OISA), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.

[144] Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation

Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag

Main category: cs.CV

TL;DR: CLoRA is a training-free method for multi-concept image generation using LoRA models, addressing attention overlap issues by updating attention maps and fusing latent representations.

Details

Motivation: Existing LoRA methods struggle with multi-concept generation due to overlapping attention mechanisms, leading to ignored or incorrectly combined concepts.

Method: CLoRA updates attention maps of multiple LoRA models at test-time and uses semantic masks to fuse latent representations.

Result: CLoRA outperforms existing methods in generating accurate multi-concept images, as shown in qualitative and quantitative evaluations.

Conclusion: CLoRA effectively addresses the limitations of current LoRA-based methods, enabling precise multi-concept image generation.

Abstract: Low-Rank Adaptation (LoRA) has emerged as a powerful and popular technique for personalization, enabling efficient adaptation of pre-trained image generation models for specific tasks without comprehensive retraining. While employing individual pre-trained LoRA models excels at representing single concepts, such as those representing a specific dog or a cat, utilizing multiple LoRA models to capture a variety of concepts in a single image still poses a significant challenge. Existing methods often fall short, primarily because the attention mechanisms within different LoRA models overlap, leading to scenarios where one concept may be completely ignored (e.g., omitting the dog) or where concepts are incorrectly combined (e.g., producing an image of two cats instead of one cat and one dog). We introduce CLoRA, a training-free approach that addresses these limitations by updating the attention maps of multiple LoRA models at test-time, and leveraging the attention maps to create semantic masks for fusing latent representations. This enables the generation of composite images that accurately reflect the characteristics of each LoRA. Our comprehensive qualitative and quantitative evaluations demonstrate that CLoRA significantly outperforms existing methods in multi-concept image generation using LoRAs.

[145] Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha

Main category: cs.CV

TL;DR: A survey paper on Vision-Language Models (VLMs) categorizes them into three types, analyzes their architectures, training data, and performance, and suggests future research directions.

Details

Motivation: To address the limitation of Large Language Models (LLMs) in processing only textual data by integrating visual capabilities, leading to the development of VLMs for complex tasks like image captioning and visual question answering.

Method: Classifies VLMs into three categories based on capabilities, analyzes their architectures, training data, strengths, limitations, and benchmark performance.

Result: Provides a detailed understanding of VLM components and their performance, highlighting their diverse functionalities.

Conclusion: The survey offers insights into VLMs’ advancements, their limitations, and potential future research directions for further breakthroughs.

Abstract: The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.

[146] Metric Convolutions: A Unifying Theory to Adaptive Image Convolutions

Thomas Dagès, Michael Lindenbaum, Alfred M. Bruckstein

Main category: cs.CV

TL;DR: The paper introduces metric convolutions, a novel approach unifying kernel deformations under a metric framework, offering better adaptability and generalization with fewer parameters.

Details

Motivation: Standard convolutions lack adaptability due to fixed kernels, and existing deformation strategies lack a unified theoretical framework.

Method: The paper proposes metric convolutions, sampling unit balls from explicit signal-dependent metrics, providing interpretable operators with geometric regularization.

Result: Metric convolutions show competitive performance in denoising and classification tasks, requiring fewer parameters and offering better generalization.

Conclusion: The metric convolution framework unifies kernel deformations, enhances interpretability, and improves performance while being compatible with gradient-based optimization.

Abstract: Standard convolutions are prevalent in image processing and deep learning, but their fixed kernels limits adaptability. Several deformation strategies of the reference kernel grid have been proposed. Yet, they lack a unified theoretical framework. By returning to a metric perspective for images, now seen as two-dimensional manifolds equipped with notions of local and geodesic distances, either symmetric (Riemannian) or not (Finsler), we provide a unifying principle: the kernel positions are samples of unit balls of implicit metrics. With this new perspective, we also propose metric convolutions, a novel approach that samples unit balls from explicit signal-dependent metrics, providing interpretable operators with geometric regularisation. This framework, compatible with gradient-based optimisation, can directly replace existing convolutions applied to either input images or deep features of neural networks. Metric convolutions typically require fewer parameters and provide better generalisation. Our approach shows competitive performance in standard denoising and classification tasks.

[147] The Cooperative Network Architecture: Learning Structured Networks as Representation of Sensory Patterns

Pascal J. Sager, Jan M. Deriu, Benjamin F. Grewe, Thilo Stadelmann, Christoph von der Malsburg

Main category: cs.CV

TL;DR: The paper introduces the Cooperative Network Architecture (CNA), a model using recurrently connected neuron networks (’nets’) for robust sensory signal representation, addressing noise and out-of-distribution challenges in vision systems.

Details

Motivation: To address robustness issues like noise, deformation, and out-of-distribution data in current vision systems by proposing a novel neural representation model.

Method: CNA dynamically assembles nets from learned net fragments based on sensory input regularities, enabling unsupervised learning and flexible recombination for novel patterns.

Result: Demonstrates that net fragments can be learned unsupervised, enabling figure completion and noise resilience, integrating local and global processing.

Conclusion: CNA is a promising paradigm for invariant object recognition, combining local feature processing with global structure formation.

Abstract: We introduce the Cooperative Network Architecture (CNA), a model that represents sensory signals using structured, recurrently connected networks of neurons, termed “nets.” Nets are dynamically assembled from overlapping net fragments, which are learned based on statistical regularities in sensory input. This architecture offers robustness to noise, deformation, and out-of-distribution data, addressing challenges in current vision systems from a novel perspective. We demonstrate that net fragments can be learned without supervision and flexibly recombined to encode novel patterns, enabling figure completion and resilience to noise. Our findings establish CNA as a promising paradigm for developing neural representations that integrate local feature processing with global structure formation, providing a foundation for future research on invariant object recognition.

[148] SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation

Fuchen Zheng, Xuhang Chen, Weihuang Liu, Haolun Li, Yingtie Lei, Jiahui He, Chi-Man Pun, Shounjun Zhou

Main category: cs.CV

TL;DR: SMAFormer is a Transformer-based architecture combining multiple attention mechanisms for improved segmentation of small, irregular tumors in medical images.

Details

Motivation: Existing models struggle with small, irregularly shaped tumors, prompting the need for a more effective solution.

Method: SMAFormer integrates Synergistic Multi-Attention (SMA) Transformer blocks and a Feature Fusion Modulator to enhance feature capture and fusion.

Result: Achieves state-of-the-art performance in tasks like multi-organ, liver tumor, and bladder tumor segmentation.

Conclusion: SMAFormer effectively addresses challenges in medical image segmentation, particularly for small tumors, with superior results.

Abstract: In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture that fuses multiple attention mechanisms for enhanced segmentation of small tumors and organs. SMAFormer can capture both local and global features for medical image segmentation. The architecture comprises two pivotal components. First, a Synergistic Multi-Attention (SMA) Transformer block is proposed, which has the benefits of Pixel Attention, Channel Attention, and Spatial Attention for feature enrichment. Second, addressing the challenge of information loss incurred during attention mechanism transitions and feature fusion, we design a Feature Fusion Modulator. This module bolsters the integration between the channel and spatial attention by mitigating reshaping-induced information attrition. To evaluate our method, we conduct extensive experiments on various medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, achieving state-of-the-art results. Code and models are available at: https://github.com/CXH-Research/SMAFormer.

[149] StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

Yichen He, Yuan Lin, Jianchao Wu, Hanchong Zhang, Yuchen Zhang, Ruicheng Le

Main category: cs.CV

TL;DR: StoryTeller improves long video descriptions by integrating audio-visual character identification and multimodal inputs, outperforming baselines like Gemini-1.5-pro and GPT-4o.

Details

Motivation: Existing LVLMs struggle with long videos, lacking consistency in character identification and plot-level descriptions.

Method: Proposes StoryTeller, a multimodal system using audio-visual character identification and a LVLM for enhanced video descriptions.

Result: Outperforms baselines by 9.5% in accuracy on StoryQA and improves other models by 5.5-13.0%.

Conclusion: StoryTeller effectively addresses long video description challenges, enhancing performance and consistency.

Abstract: Existing large vision-language models (LVLMs) are largely limited to processing short, seconds-long videos and struggle with generating coherent descriptions for extended video spanning minutes or more. Long video description introduces new challenges, such as consistent character identification and plot-level descriptions incorporating both visual and audio information. To address these, we figure out audio-visual character identification, matching character names to each dialogue, as a key factor. We propose StoryTeller, a system for generating dense descriptions of long videos, incorporating both low-level visual concepts and high-level plot information. StoryTeller uses a multimodal large language model that integrates visual, audio, and text modalities to perform audio-visual character identification on minute-long video clips. The results are then fed into a LVLM to enhance consistency of video description. We validate our approach on movie description tasks and introduce MovieStory101, a dataset with dense descriptions for three-minute movie clips. To evaluate long video descriptions, we create StoryQA, a large set of multiple-choice questions for MovieStory101 test set. We assess descriptions by inputting them into GPT-4 to answer these questions, using accuracy as an automatic evaluation metric. Experiments show that StoryTeller outperforms all open and closed-source baselines on StoryQA, achieving 9.5% higher accuracy than the strongest baseline, Gemini-1.5-pro, and demonstrating a +15.56% advantage in human side-by-side evaluations. Additionally, incorporating audio-visual character identification from StoryTeller improves the performance of all video description models, with Gemini-1.5-pro and GPT-4o showing relative improvement of 5.5% and 13.0%, respectively, in accuracy on StoryQA.

[150] FastTrackTr:Towards Fast Multi-Object Tracking with Transformers

Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Wenhui Zhao, Dingwen Zhang

Main category: cs.CV

TL;DR: The paper introduces FastTrackTr, a fast and novel Joint Detection and Tracking (JDT) framework for multi-object tracking (MOT) using Transformer-based methods, addressing slow inference speeds while maintaining accuracy.

Details

Motivation: Transformer-based MOT methods are popular but suffer from slow inference speeds. The paper revisits JDT to improve efficiency without compromising tracking accuracy.

Method: Integrates original JDT with advanced theories, using efficient information transfer between frames on DETR to reduce query numbers and avoid excessive network complexity.

Result: Achieves potential real-time tracking with competitive accuracy across multiple datasets.

Conclusion: FastTrackTr offers a simpler, faster solution for MOT while maintaining high tracking performance.

Abstract: Transformer-based multi-object tracking (MOT) methods have captured the attention of many researchers in recent years. However, these models often suffer from slow inference speeds due to their structure or other issues. To address this problem, we revisited the Joint Detection and Tracking (JDT) method by looking back at past approaches. By integrating the original JDT approach with some advanced theories, this paper employs an efficient method of information transfer between frames on the DETR, constructing a fast and novel JDT-type MOT framework: FastTrackTr. Thanks to the superiority of this information transfer method, our approach not only reduces the number of queries required during tracking but also avoids the excessive introduction of network structures, ensuring model simplicity. Experimental results indicate that our method has the potential to achieve real-time tracking and exhibits competitive tracking accuracy across multiple datasets.

[151] Language Driven Occupancy Prediction

Zhu Yu, Bowen Pang, Lizhe Liu, Runmin Zhang, Qiang Li, Si-Yuan Cao, Maochun Luo, Mingxia Chen, Sheng Yang, Hui-Liang Shen

Main category: cs.CV

TL;DR: LOcc is a framework for open-vocabulary occupancy prediction, using a semantic transitive labeling pipeline to generate accurate 3D language ground truth, outperforming existing methods.

Details

Motivation: To address inaccurate supervision in previous approaches for open-vocabulary occupancy prediction by generating dense, fine-grained 3D language ground truth.

Method: Proposes a semantic transitive labeling pipeline to transfer text labels from images to LiDAR point clouds and voxels, replacing prediction heads with geometry and language heads.

Result: Produces more accurate pseudo-labeled ground truth, reducing human annotation efforts, and outperforms state-of-the-art methods on the Occ3D-nuScenes dataset.

Conclusion: LOcc effectively improves open-vocabulary occupancy prediction with accurate supervision and generalizes well across architectures.

Abstract: We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and fine-grained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and ultimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our transitive semantic labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-of-the-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset.

[152] I2VControl: Disentangled and Unified Video Motion Synthesis Control

Wanquan Feng, Tianhao Qi, Jiawei Liu, Mingzhen Sun, Pengqi Tu, Tianxiang Ma, Fei Dai, Songtao Zhao, Siyu Zhou, Qian He

Main category: cs.CV

TL;DR: I2VControl is a disentangled and unified framework for video synthesis that resolves logical conflicts in combining multiple motion control types.

Details

Motivation: Existing methods for video synthesis are limited to single control types, and combining them often leads to logical conflicts.

Method: The framework reformulates tasks into consistent point trajectory representations, uses a spatial partitioning strategy, and includes an adapter for pre-trained models.

Result: The method achieves excellent performance on various control tasks and enables user-driven creative combinations.

Conclusion: I2VControl enhances innovation and creativity in video synthesis by dynamically orchestrating diverse control types without conflicts.

Abstract: Motion controllability is crucial in video synthesis. However, most previous methods are limited to single control types, and combining them often results in logical conflicts. In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. We rethink camera control, object dragging, and motion brush, reformulating all tasks into a consistent representation based on point trajectories, each managed by a dedicated formulation. Accordingly, we propose a spatial partitioning strategy, where each unit is assigned to a concomitant control category, enabling diverse control types to be dynamically orchestrated within a single synthesis pipeline without conflicts. Furthermore, we design an adapter structure that functions as a plug-in for pre-trained models and is agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. Project page: https://wanquanf.github.io/I2VControl .

Md. Mithun Hossain, Md. Shakil Hossain, Sudipto Chaki, M. F. Mridha

Main category: cs.CV

TL;DR: Co-AttenDWG enhances multi-modal learning by combining co-attention, dimension-wise gating, and expert fusion for better cross-modal interactions and performance.

Details

Motivation: Existing multi-modal approaches lack sufficient cross-modal interactions and rigid fusion strategies, limiting their ability to leverage complementary strengths of modalities.

Method: Projects textual and visual features into a shared space, uses co-attention and dimension-wise gating, refines features with dual-path encoders, and aggregates via expert fusion.

Result: Achieves state-of-the-art performance on MIMIC and SemEval Memotion 1.0 datasets, demonstrating superior cross-modal alignment.

Conclusion: Co-AttenDWG is effective for diverse multi-modal applications, offering robust unified representations.

Abstract: Multi-modal learning has emerged as a crucial research direction, as integrating textual and visual information can substantially enhance performance in tasks such as classification, retrieval, and scene understanding. Despite advances with large pre-trained models, existing approaches often suffer from insufficient cross-modal interactions and rigid fusion strategies, failing to fully harness the complementary strengths of different modalities. To address these limitations, we propose Co-AttenDWG, co-attention with dimension-wise gating, and expert fusion. Our approach first projects textual and visual features into a shared embedding space, where a dedicated co-attention mechanism enables simultaneous, fine-grained interactions between modalities. This is further strengthened by a dimension-wise gating network, which adaptively modulates feature contributions at the channel level to emphasize salient information. In parallel, dual-path encoders independently refine modality-specific representations, while an additional cross-attention layer aligns the modalities further. The resulting features are aggregated via an expert fusion module that integrates learned gating and self-attention, yielding a robust unified representation. Experimental results on the MIMIC and SemEval Memotion 1.0 datasets show that Co-AttenDWG achieves state-of-the-art performance and superior cross-modal alignment, highlighting its effectiveness for diverse multi-modal applications.

[154] Counting Stacked Objects

Corentin Dumery, Noa Etté, Aoxiang Fan, Ren Li, Jingyi Xu, Hieu Le, Pascal Fua

Main category: cs.CV

TL;DR: A novel 3D counting method for stacked objects using multi-view images, combining geometric reconstruction and deep learning.

Details

Motivation: Existing methods fail to count stacked 3D objects where most are hidden, limiting applications like biomedicine and traffic monitoring.

Method: Decomposes counting into 3D geometry estimation and occupancy ratio analysis from multi-view images, integrating geometric reconstruction and deep learning.

Result: Accurately counts identical objects in containers, even with irregular stacking, validated on real-world and synthetic datasets.

Conclusion: The proposed 3D counting pipeline addresses a critical gap and will release datasets to support future research.

Abstract: Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learning-based depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on diverse real-world and large-scale synthetic datasets, which we will release publicly to facilitate further research.

[155] FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image

Qiao Feng, Yuanwang Yang, Yebin Liu, Yu-Kun Lai, Jingyu Yang, Kun Li

Main category: cs.CV

TL;DR: FOF-X introduces an efficient 3D representation (Fourier Occupancy Field) for real-time human geometry reconstruction from a single image, balancing speed and quality.

Details

Motivation: Existing 3D representations are computationally demanding, hindering real-time performance. FOF-X aims to bridge the gap between 3D and 2D domains for robust, high-quality reconstruction.

Method: FOF-X uses Fourier Occupancy Field (FOF), a factorized 3D representation compatible with 2D CNNs, and integrates human parametric models. It includes Laplacian constraints and discontinuity matchers for quality.

Result: FOF-X achieves state-of-the-art results on various datasets and real-captured data, handling domain gaps and improving robustness.

Conclusion: FOF-X offers a novel, efficient solution for real-time human geometry reconstruction, validated by superior performance and released code.

Abstract: We introduce FOF-X for real-time reconstruction of detailed human geometry from a single image. Balancing real-time speed against high-quality results is a persistent challenge, mainly due to the high computational demands of existing 3D representations. To address this, we propose Fourier Occupancy Field (FOF), an efficient 3D representation by learning the Fourier series. The core of FOF is to factorize a 3D occupancy field into a 2D vector field, retaining topology and spatial relationships within the 3D domain while facilitating compatibility with 2D convolutional neural networks. Such a representation bridges the gap between 3D and 2D domains, enabling the integration of human parametric models as priors and enhancing the reconstruction robustness. Based on FOF, we design a new reconstruction framework, FOF-X, to avoid the performance degradation caused by texture and lighting. This enables our real-time reconstruction system to better handle the domain gap between training images and real images. Additionally, in FOF-X, we enhance the inter-conversion algorithms between FOF and mesh representations with a Laplacian constraint and an automaton-based discontinuity matcher, improving both quality and robustness. We validate the strengths of our approach on different datasets and real-captured data, where FOF-X achieves new state-of-the-art results. The code has already been released for research purposes at https://cic.tju.edu.cn/faculty/likun/projects/FOFX/index.html.

[156] SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, Jun Wang

Main category: cs.CV

TL;DR: SpatialViz-Bench is introduced as a benchmark to evaluate spatial visualization in MLLMs, revealing performance gaps and counter-intuitive behaviors.

Details

Motivation: Existing evaluations for spatial visualization in MLLMs are inadequate, often overlapping with training data, leading to unreliable assessments.

Method: A multi-modal benchmark (SpatialViz-Bench) with 12 tasks across 4 sub-abilities and 1,180 problems is created and tested on 33 MLLMs.

Result: Performance varies widely; models show misaligned difficulty perception, 2D-to-3D cliffs, formulaic defaults, and degraded performance with Chain-of-Thought prompting.

Conclusion: MLLMs still struggle with spatial visualization, highlighting the need for better benchmarks like SpatialViz-Bench.

Abstract: Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models show difficulty perception misaligned with human intuition, exhibit dramatic 2Dto-3D performance cliffs, default to formulaic derivation over visualization, and paradoxically suffer performance degradation from Chain-of-Thought prompting in open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.

[157] RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations

Mingshu Zhao, Yi Luo, Yong Ouyang

Main category: cs.CV

TL;DR: RecConv introduces a recursive decomposition strategy to efficiently expand the effective receptive field in vision transformers, avoiding quadratic parameter growth and maintaining computational efficiency.

Details

Motivation: The quadratic scaling of parameters and FLOPs with kernel size in vision transformers poses efficiency and optimization challenges.

Method: RecConv uses recursive decomposition with small-kernel convolutions to linearly scale parameters and maintain constant FLOPs while expanding the effective receptive field.

Result: RecNeXt-M3 outperforms RepViT-M1.1 by 1.9 AP on COCO with similar FLOPs, demonstrating efficiency and performance gains.

Conclusion: RecConv offers a scalable and efficient solution for designing compact networks, applicable across various modalities.

Abstract: Recent advances in vision transformers (ViTs) have demonstrated the advantage of global modeling capabilities, prompting widespread integration of large-kernel convolutions for enlarging the effective receptive field (ERF). However, the quadratic scaling of parameter count and computational complexity (FLOPs) with respect to kernel size poses significant efficiency and optimization challenges. This paper introduces RecConv, a recursive decomposition strategy that efficiently constructs multi-frequency representations using small-kernel convolutions. RecConv establishes a linear relationship between parameter growth and decomposing levels which determines the effective receptive field $k\times 2^\ell$ for a base kernel $k$ and $\ell$ levels of decomposition, while maintaining constant FLOPs regardless of the ERF expansion. Specifically, RecConv achieves a parameter expansion of only $\ell+2$ times and a maximum FLOPs increase of $5/3$ times, compared to the exponential growth ($4^\ell$) of standard and depthwise convolutions. RecNeXt-M3 outperforms RepViT-M1.1 by 1.9 $AP^{box}$ on COCO with similar FLOPs. This innovation provides a promising avenue towards designing efficient and compact networks across various modalities. Codes and models can be found at https://github.com/suous/RecNeXt.

[158] Scaling RL to Long Videos

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

Main category: cs.CV

TL;DR: A framework for scaling vision-language models (VLMs) to long videos using reinforcement learning, featuring a dataset, training pipeline, and efficient infrastructure.

Details

Motivation: Addressing the challenges of reasoning in long videos by integrating large-scale datasets, advanced training methods, and optimized infrastructure.

Method: Combines a dataset (LongVideo-Reason), a two-stage training pipeline (CoT-SFT and RL), and infrastructure (MR-SP) for efficient long video processing.

Result: LongVILA-R1-7B achieves high accuracy on benchmarks (65.1%-71.1%) and supports up to 8,192 frames per video, with MR-SP providing 2.1x speedup.

Conclusion: The framework advances long video reasoning, offering public tools for RL training across modalities and models.

Abstract: We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).

[159] SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis

Wenkun He, Yun Liu, Ruitao Liu, Li Yi

Main category: cs.CV

TL;DR: SyncDiff is a novel method for synthesizing multi-body human-object interactions using synchronized motion diffusion, outperforming existing methods.

Details

Motivation: Addressing the complexity of synchronizing motions in multi-body interactions involving humans, hands, and objects, which introduces high correlations and mutual influences.

Method: SyncDiff uses a single diffusion model to capture joint motion distributions, employs frequency-domain decomposition, and introduces alignment scores for synchronization.

Result: Outperforms state-of-the-art methods across four datasets with diverse multi-body configurations.

Conclusion: SyncDiff effectively synthesizes realistic multi-body interactions by jointly optimizing motion and alignment likelihoods.

Abstract: Synthesizing realistic human-object interaction motions is a critical problem in VR/AR and human animation. Unlike the commonly studied scenarios involving a single human or hand interacting with one object, we address a more generic multi-body setting with arbitrary numbers of humans, hands, and objects. This complexity introduces significant challenges in synchronizing motions due to the high correlations and mutual influences among bodies. To address these challenges, we introduce SyncDiff, a novel method for multi-body interaction synthesis using a synchronized motion diffusion strategy. SyncDiff employs a single diffusion model to capture the joint distribution of multi-body motions. To enhance motion fidelity, we propose a frequency-domain motion decomposition scheme. Additionally, we introduce a new set of alignment scores to emphasize the synchronization of different body motions. SyncDiff jointly optimizes both data sample likelihood and alignment likelihood through an explicit synchronization strategy. Extensive experiments across four datasets with various multi-body configurations demonstrate the superiority of SyncDiff over existing state-of-the-art motion synthesis methods.

[160] Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving

Haoxiang Gao, Li Zhang, Yu Zhao, Zhou Yang, Jinghan Cao

Main category: cs.CV

TL;DR: A knowledge distillation method transfers vision-language model (VLM) knowledge to efficient vision networks for pedestrian behavior prediction and scene understanding, improving autonomous driving tasks.

Details

Motivation: Addressing the gap in applying VLMs to complex pedestrian interactions and efficient vehicle deployment in autonomous driving.

Method: Proposes knowledge distillation from large-scale VLMs to efficient vision networks, using pre-trained models and ensemble techniques.

Result: Achieves diverse semantic attributes and significant metric improvements in open-vocabulary perception and trajectory prediction.

Conclusion: The method enhances autonomous driving performance by improving perception and prediction tasks.

Abstract: Vision-language models (VLMs) have become a promising approach to enhancing perception and decision-making in autonomous driving. The gap remains in applying VLMs to understand complex scenarios interacting with pedestrians and efficient vehicle deployment. In this paper, we propose a knowledge distillation method that transfers knowledge from large-scale vision-language foundation models to efficient vision networks, and we apply it to pedestrian behavior prediction and scene understanding tasks, achieving promising results in generating more diverse and comprehensive semantic attributes. We also utilize multiple pre-trained models and ensemble techniques to boost the model’s performance. We further examined the effectiveness of the model after knowledge distillation; the results show significant metric improvements in open-vocabulary perception and trajectory prediction tasks, which can potentially enhance the end-to-end performance of autonomous driving.

[161] Calibrated Multi-Preference Optimization for Aligning Diffusion Models

Kyungmin Lee, Xiaohang Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, Yinxiao Li

Main category: cs.CV

TL;DR: CaPO aligns T2I diffusion models using multi-reward models without human data, outperforming DPO in benchmarks.

Details

Motivation: Manual preference data collection is costly; existing methods lack generalization and reward consistency.

Method: CaPO uses reward calibration and frontier-based pair selection to align models.

Result: CaPO outperforms DPO in single and multi-reward settings on benchmarks.

Conclusion: CaPO effectively aligns T2I models without human data, improving scalability and performance.

Abstract: Aligning text-to-image (T2I) diffusion models with preference optimization is valuable for human-annotated datasets, but the heavy cost of manual data collection limits scalability. Using reward models offers an alternative, however, current preference optimization methods fall short in exploiting the rich information, as they only consider pairwise preference distribution. Furthermore, they lack generalization to multi-preference scenarios and struggle to handle inconsistencies between rewards. To address this, we present Calibrated Preference Optimization (CaPO), a novel method to align T2I diffusion models by incorporating the general preference from multiple reward models without human annotated data. The core of our approach involves a reward calibration method to approximate the general preference by computing the expected win-rate against the samples generated by the pretrained models. Additionally, we propose a frontier-based pair selection method that effectively manages the multi-preference distribution by selecting pairs from Pareto frontiers. Finally, we use regression loss to fine-tune diffusion models to match the difference between calibrated rewards of a selected pair. Experimental results show that CaPO consistently outperforms prior methods, such as Direct Preference Optimization (DPO), in both single and multi-reward settings validated by evaluation on T2I benchmarks, including GenEval and T2I-Compbench.

[162] Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen

Main category: cs.CV

TL;DR: LLaVA-Reward is a reward model for evaluating text-to-image generations using MLLMs, improving efficiency and accuracy with a SkipCA module and diverse preference data.

Details

Motivation: Existing MLLM-based methods are time-consuming and hard to train, requiring instruction-following data. LLaVA-Reward aims to simplify and enhance evaluation.

Method: LLaVA-Reward uses hidden states of MLLMs for text-image pairs, introduces SkipCA for better interaction, and supports various preference data types for fine-tuning.

Result: LLaVA-Reward outperforms conventional and MLLM-based methods in human-aligned evaluations and inference-time scaling.

Conclusion: LLaVA-Reward offers an efficient, accurate solution for automatic evaluation of text-to-image generations.

Abstract: We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.

[163] AstroLoc: Robust Space to Ground Image Localizer

Gabriele Berton, Alex Stoken, Carlo Masone

Main category: cs.CV

TL;DR: AstroLoc is a new APL pipeline leveraging astronaut photos for training, achieving 35% better recall@1 than previous methods.

Details

Motivation: Manual localization of astronaut photos is inefficient; existing APL methods don't use astronaut photos for training.

Method: AstroLoc uses weakly labeled astronaut photos and two losses: pairwise matching and unsupervised clustering.

Result: 35% improvement in recall@1; recall@100 consistently over 99%.

Conclusion: AstroLoc excels in APL and related tasks without fine-tuning.

Abstract: Astronauts take thousands of photos of Earth per day from the International Space Station, which, once localized on Earth’s surface, are used for a multitude of tasks, ranging from climate change research to disaster management. The localization process, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, find its most similar match among a large database of geo-tagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth’s surface features through two losses: astronaut photos paired with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography via unsupervised mining. We find that AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, pushing the limits of existing datasets with a recall@100 consistently over 99%. Finally, we note that AstroLoc, without any fine-tuning, provides excellent results for related tasks like the lost-in-space satellite problem and historical space imagery localization.

[164] Differential Contrastive Training for Gaze Estimation

Lin Zhang, Yi Tian, XiYun Wang, Wanru Xu, Yi Jin, Yaping Huang

Main category: cs.CV

TL;DR: The paper introduces DCGaze, a gaze estimation method leveraging CLIP’s capabilities through Differential Contrastive Training, improving performance with visual and semantic branches.

Details

Motivation: Addressing the need for precise and generalizable gaze estimation in complex scenarios by exploiting CLIP's untapped potential.

Method: Proposes DCGaze with two branches: Visual Appearance-aware (with AFU and DGR) and Semantic Differential-aware (using CLIP’s text encoder).

Result: Demonstrates effectiveness on four datasets for within and cross-domain tasks.

Conclusion: DCGaze successfully integrates CLIP for improved gaze estimation, validated by extensive experiments.

Abstract: The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel Differential Contrastive Training strategy, which boosts gaze estimation performance with the help of the CLIP. Accordingly, a Differential Contrastive Gaze Estimation network (DCGaze) composed of a Visual Appearance-aware branch and a Semantic Differential-aware branch is introduced. The Visual Appearance-aware branch is essentially a primary gaze estimation network and it incorporates an Adaptive Feature-refinement Unit (AFU) and a Double-head Gaze Regressor (DGR), which both help the primary network to extract informative and gaze-related appearance features. Moreover, the Semantic Difference-aware branch is designed on the basis of the CLIP’s text encoder to reveal the semantic difference of gazes. This branch could further empower the Visual Appearance-aware branch with the capability of characterizing the gaze-related semantic information. Extensive experimental results on four challenging datasets over within and cross-domain tasks demonstrate the effectiveness of our DCGaze.The code is available at https://github.com/LinZhang-bjtu/DCGaze.

[165] ComicsPAP: understanding comic strips by picking the correct panel

Emanuele Vivoli, Artemis Llabrés, Mohamed Ali Souibgui, Marco Bertini, Ernest Valveny Llobet, Dimosthenis Karatzas

Main category: cs.CV

TL;DR: ComicsPAP is a new benchmark for comic strip understanding, revealing limitations in current LMMs and showing improved performance with adapted models.

Details

Motivation: Current LMMs struggle with temporal and spatial cues in comics, necessitating a dedicated benchmark for evaluation and improvement.

Method: Introduced ComicsPAP, a large-scale benchmark with 100k+ samples and 5 subtasks under a Pick-a-Panel framework, evaluated under multi-image and single-image protocols.

Result: State-of-the-art LMMs performed near chance, while adapted LMMs outperformed larger models.

Conclusion: ComicsPAP is a valuable resource for advancing multimodal comic comprehension research.

Abstract: Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, ComicsPAP demands models to identify the missing panel in a sequence. Our evaluations, conducted under both multi-image and single-image protocols, reveal that current state-of-the-art LMMs perform near chance on these tasks, underscoring significant limitations in capturing sequential and contextual dependencies. To close the gap, we adapted LMMs for comic strip understanding, obtaining better results on ComicsPAP than 10x bigger models, demonstrating that ComicsPAP offers a robust resource to drive future research in multimodal comic comprehension.

[166] Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization

Huiyi Chen, Jiawei Peng, Kaihua Tang, Xin Geng, Xu Yang

Main category: cs.CV

TL;DR: KeCO is a framework for efficient in-context learning in LVLMs by optimizing coreset selection with visual features, reducing costs and improving performance.

Details

Motivation: High computational and memory costs in selecting informative demonstrations for LVLMs, along with information loss and inefficiency in existing methods, especially for image classification.

Method: Proposes Key-based Coreset Optimization (KeCO), using visual features as keys to update coreset samples from untapped data, evolving the coreset efficiently.

Result: Achieves over 20% average improvement in image classification benchmarks and performs well in simulated online scenarios.

Conclusion: KeCO effectively enhances ICL performance for LVLMs with low computational cost, making it practical for resource-constrained settings.

Abstract: In-context learning (ICL) enables Large Vision-Language Models (LVLMs) to adapt to new tasks without parameter updates, using a few demonstrations from a large support set. However, selecting informative demonstrations leads to high computational and memory costs. While some methods explore selecting a small and representative coreset in the text classification, evaluating all support set samples remains costly, and discarded samples lead to unnecessary information loss. These methods may also be less effective for image classification due to differences in feature spaces. Given these limitations, we propose Key-based Coreset Optimization (KeCO), a novel framework that leverages untapped data to construct a compact and informative coreset. We introduce visual features as keys within the coreset, which serve as the anchor for identifying samples to be updated through different selection strategies. By leveraging untapped samples from the support set, we update the keys of selected coreset samples, enabling the randomly initialized coreset to evolve into a more informative coreset under low computational cost. Through extensive experiments on coarse-grained and fine-grained image classification benchmarks, we demonstrate that KeCO effectively enhances ICL performance for image classification task, achieving an average improvement of more than 20%. Notably, we evaluate KeCO under a simulated online scenario, and the strong performance in this scenario highlights the practical value of our framework for resource-constrained real-world scenarios.

[167] ViM-VQ: Efficient Post-Training Vector Quantization for Visual Mamba

Juncan Deng, Shuaiting Li, Zeyu Wang, Kedong Xu, Hong Gu, Kejie Huang

Main category: cs.CV

TL;DR: ViM-VQ is an efficient post-training vector quantization method for Visual Mamba networks (ViMs), addressing challenges like outliers and memory issues to achieve state-of-the-art low-bit quantization performance.

Details

Motivation: Existing VQ methods perform poorly on ViMs due to outliers and inefficiencies, prompting the need for a tailored solution.

Method: ViM-VQ introduces a fast convex combination optimization algorithm and an incremental vector quantization strategy to optimize codeword search and reduce errors.

Result: ViM-VQ outperforms existing methods in low-bit quantization for ViMs across visual tasks.

Conclusion: ViM-VQ effectively addresses ViM-specific challenges, enabling efficient deployment on edge devices with high accuracy.

Abstract: Visual Mamba networks (ViMs) extend the selective state space model (Mamba) to various vision tasks and demonstrate significant potential. As a promising compression technique, vector quantization (VQ) decomposes network weights into codebooks and assignments, significantly reducing memory usage and computational latency, thereby enabling the deployment of ViMs on edge devices. Although existing VQ methods have achieved extremely low-bit quantization (e.g., 3-bit, 2-bit, and 1-bit) in convolutional neural networks and Transformer-based networks, directly applying these methods to ViMs results in unsatisfactory accuracy. We identify several key challenges: 1) The weights of Mamba-based blocks in ViMs contain numerous outliers, significantly amplifying quantization errors. 2) When applied to ViMs, the latest VQ methods suffer from excessive memory consumption, lengthy calibration procedures, and suboptimal performance in the search for optimal codewords. In this paper, we propose ViM-VQ, an efficient post-training vector quantization method tailored for ViMs. ViM-VQ consists of two innovative components: 1) a fast convex combination optimization algorithm that efficiently updates both the convex combinations and the convex hulls to search for optimal codewords, and 2) an incremental vector quantization strategy that incrementally confirms optimal codewords to mitigate truncation errors. Experimental results demonstrate that ViM-VQ achieves state-of-the-art performance in low-bit quantization across various visual tasks.

[168] MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, Wenhan Luo

Main category: cs.CV

TL;DR: MaterialMVP is a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, using Reference Attention and Consistency-Regularized Training for stable, high-quality results.

Details

Motivation: Addressing challenges in multi-view material synthesis for realistic PBR texture generation in 3D scenes.

Method: Leverages Reference Attention, Consistency-Regularized Training, and Dual-Channel Material Generation with Multi-Channel Aligned Attention.

Result: Generates illumination-invariant, geometrically consistent PBR textures, outperforming existing methods.

Conclusion: MaterialMVP offers scalable, high-quality 3D asset creation with realistic material behavior.

Abstract: Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.

[169] SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering

Byeongjun Park, Hyojun Go, Hyelin Nam, Byung-Hoon Kim, Hyungjin Chung, Changick Kim

Main category: cs.CV

TL;DR: SteerX is a zero-shot inference-time steering method unifying scene reconstruction into generation for better geometric alignment in 3D/4D scenes.

Details

Motivation: Existing methods address alignment separately at each stage, leading to subtle misalignments. SteerX aims to unify and improve alignment.

Method: Introduces two geometric reward functions using pose-free feed-forward scene reconstruction models.

Result: Demonstrates effectiveness in improving 3D/4D scene generation through experiments.

Conclusion: SteerX successfully unifies scene reconstruction with generation, enhancing geometric alignment.

Abstract: Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inference-time steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.

[170] Gaussian On-the-Fly Splatting: A Progressive Framework for Robust Near Real-Time 3DGS Optimization

Yiwei Xu, Yifei Yu, Wentian Gan, Tengfei Wang, Zongqian Zhan, Hao Cheng, Xin Wang

Main category: cs.CV

TL;DR: On-the-Fly GS enables near real-time 3D Gaussian Splatting optimization during image capture, reducing training time significantly with minimal rendering loss.

Details

Motivation: Existing 3DGS methods require offline training after full SfM processing, limiting real-time applicability. This work aims to enable progressive optimization during image capture.

Method: Progressive Local & Semi-Global optimization prioritizes new images and neighbors based on overlapping relationships. An adaptive learning rate schedule stabilizes training.

Result: On-the-Fly GS optimizes each new image in seconds with minimal rendering loss, significantly reducing training time.

Conclusion: The framework offers a practical step toward rapid, progressive 3DGS reconstruction, enabling near real-time performance.

Abstract: 3D Gaussian Splatting (3DGS) achieves high-fidelity rendering with fast real-time performance, but existing methods rely on offline training after full Structure-from-Motion (SfM) processing. In contrast, this work introduces Gaussian on-the-fly Splatting (abbreviated as On-the-Fly GS), a progressive framework enabling near real-time 3DGS optimization during image capture. As each image arrives, its pose and sparse points are updated via On-the-Fly SfM, and newly optimized Gaussians are immediately integrated into the 3DGS field. To achieve this, we propose a progressive Local & Semi-Global optimization to prioritize the new image and its neighbors by their corresponding overlapping relationship, allowing the new image and its overlapping images to get more training. To further stabilize training across previous and new images, an adaptive learning rate schedule balances the iterations and the learning rate. Extensive experiments on multiple benchmarks show that our On-the-Fly GS reduces training time significantly, optimizing each new image in seconds with minimal rendering loss, offering one of the first practical steps toward rapid, progressive 3DGS reconstruction.

[171] R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception

Jonas Mirlach, Lei Wan, Andreas Wiedholz, Hannan Ejaz Keen, Andreas Eich

Main category: cs.CV

TL;DR: R-LiViT is a novel dataset combining LiDAR, RGB, and thermal imaging from a roadside perspective, focusing on VRU detection in diverse conditions.

Details

Motivation: Addressing the underrepresentation of thermal imaging in datasets for VRU detection, especially in extreme lighting conditions.

Method: Created R-LiViT dataset with 10,000 LiDAR frames and 2,400 aligned RGB/thermal images from three intersections, day and night.

Result: Provides a comprehensive resource with 7-8 annotated classes for tasks like object detection and tracking.

Conclusion: R-LiViT fills a gap in multimodal datasets and is publicly available for research.

Abstract: In autonomous driving, the integration of roadside perception systems is essential for overcoming occlusion challenges and enhancing the safety of Vulnerable Road Users(VRUs). While LiDAR and visual (RGB) sensors are commonly used, thermal imaging remains underrepresented in datasets, despite its acknowledged advantages for VRU detection in extreme lighting conditions. In this paper, we present R-LiViT, the first dataset to combine LiDAR, RGB, and thermal imaging from a roadside perspective, with a strong focus on VRUs. R-LiViT captures three intersections during both day and night, ensuring a diverse dataset. It includes 10,000 LiDAR frames and 2,400 temporally and spatially aligned RGB and thermal images across 150 traffic scenarios, with 7 and 8 annotated classes respectively, providing a comprehensive resource for tasks such as object detection and tracking. The dataset and the code for reproducing our evaluation results are made publicly available.

[172] Equivariant Flow Matching for Point Cloud Assembly

Ziming Wang, Nan Xue, Rebecka Jörnsten

Main category: cs.CV

TL;DR: A novel equivariant solver, Eda, for point cloud assembly uses flow matching models to align pieces efficiently, even with non-overlapping inputs.

Details

Motivation: To reconstruct complete 3D shapes by aligning multiple point cloud pieces, addressing challenges like non-overlapping inputs.

Method: Proposes Eda, an equivariant diffusion assembly model, learning vector fields conditioned on input pieces and constructing an equivariant path for efficient training.

Result: Eda performs competitively on practical datasets and handles non-overlapping input pieces effectively.

Conclusion: Eda is a robust and efficient solution for point cloud assembly, excelling in challenging scenarios.

Abstract: The goal of point cloud assembly is to reconstruct a complete 3D shape by aligning multiple point cloud pieces. This work presents a novel equivariant solver for assembly tasks based on flow matching models. We first theoretically show that the key to learning equivariant distributions via flow matching is to learn related vector fields. Based on this result, we propose an assembly model, called equivariant diffusion assembly (Eda), which learns related vector fields conditioned on the input pieces. We further construct an equivariant path for Eda, which guarantees high data efficiency of the training process. Our numerical results show that Eda is highly competitive on practical datasets, and it can even handle the challenging situation where the input pieces are non-overlapped.

[173] Exploring Textual Semantics Diversity for Image Transmission in Semantic Communication Systems using Visual Language Model

Peishan Huang, Dong Li

Main category: cs.CV

TL;DR: A Multi-SC system using VLM and LLaVA improves image reconstruction accuracy by extracting diverse text semantics alongside segmentation tags.

Details

Motivation: Traditional semantic communication systems struggle with low reconstruction accuracy due to insufficient semantic feature extraction.

Method: The proposed Multi-SC system divides images into blocks, extracts multiple text features using LLaVA, and combines them with segmentation tags for recovery.

Result: Simulations show the Multi-SC system significantly outperforms existing methods in reconstruction accuracy.

Conclusion: The Multi-SC system effectively addresses the challenge of low reconstruction accuracy in semantic communication for image transmission.

Abstract: In recent years, the rapid development of machine learning has brought reforms and challenges to traditional communication systems. Semantic communication has appeared as an effective strategy to effectively extract relevant semantic signals semantic segmentation labels and image features for image transmission. However, the insufficient number of extracted semantic features of images will potentially result in a low reconstruction accuracy, which hinders the practical applications and still remains challenging for solving. In order to fill this gap, this letter proposes a multi-text transmission semantic communication (Multi-SC) system, which uses the visual language model (VLM) to assist in the transmission of image semantic signals. Unlike previous image transmission semantic communication systems, the proposed system divides the image into multiple blocks and extracts multiple text information from the image using a modified large language and visual assistant (LLaVA), and combines semantic segmentation tags with semantic text for image recovery. Simulation results show that the proposed text semantics diversity scheme can significantly improve the reconstruction accuracy compared with related works.

[174] Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation

Jiwoo Chung, Sangeek Hyun, Hyunjun Kim, Eunseo Koh, MinKyu Lee, Jae-Pil Heo

Main category: cs.CV

TL;DR: The paper introduces a VAR-based method for subject-driven image generation, addressing computational overhead and language drift with selective layer tuning and prior distillation, and improving subject focus with scale-wise weighted tuning.

Details

Motivation: To overcome the computational inefficiency of diffusion-based models and the limitations of naive VAR fine-tuning for subject-driven generation.

Method: Proposes selective layer tuning, prior distillation, and scale-wise weighted tuning to enhance VAR models for subject-driven generation.

Result: Outperforms diffusion-based baselines in various metrics and demonstrates practical applicability.

Conclusion: The VAR-based approach with proposed enhancements is efficient and effective for subject-driven image generation.

Abstract: Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive (VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, naive fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize minor details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.

[175] STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints

Xiaohang Yang, Qing Wang, Jiahao Yang, Gregory Slabaugh, Shanxin Yuan

Main category: cs.CV

TL;DR: STaR is a novel sequence-to-sequence model for motion retargeting, balancing geometric plausibility and temporal consistency to reduce interpenetration and motion jitter.

Details

Motivation: Existing motion retargeting methods often neglect either geometric plausibility (causing interpenetration) or temporal consistency (leading to motion jitter). STaR aims to address both issues.

Method: STaR combines a spatial module (with dense shape representation and limb penetration constraint) and a temporal module (using a temporal transformer and consistency constraint) to ensure geometric and temporal coherence.

Result: Experiments on Mixamo and ScanRet datasets show STaR produces plausible, coherent motions with reduced interpenetration rates.

Conclusion: STaR effectively balances motion semantics, geometric plausibility, and temporal consistency, outperforming existing methods.

Abstract: Motion retargeting seeks to faithfully replicate the spatio-temporal motion characteristics of a source character onto a target character with a different body shape. Apart from motion semantics preservation, ensuring geometric plausibility and maintaining temporal consistency are also crucial for effective motion retargeting. However, many existing methods prioritize either geometric plausibility or temporal consistency. Neglecting geometric plausibility results in interpenetration while neglecting temporal consistency leads to motion jitter. In this paper, we propose a novel sequence-to-sequence model for seamless Spatial-Temporal aware motion Retargeting (STaR), with penetration and consistency constraints. STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a novel temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness. The seamless combination of the two modules helps us achieve a good balance between the semantic, geometric, and temporal targets. Extensive experiments on the Mixamo and ScanRet datasets demonstrate that our method produces plausible and coherent motions while significantly reducing interpenetration rates compared with other approaches. Code page: https://github.com/XiaohangYang829/STaR.

[176] Can GPT-4o mini and Gemini 2.0 Flash Predict Fine-Grained Fashion Product Attributes? A Zero-Shot Analysis

Shubham Shukla, Kunal Sonalkar

Main category: cs.CV

TL;DR: The paper evaluates zero-shot performance of LLMs (GPT-4o-mini and Gemini 2.0 Flash) on fine-grained fashion attribute recognition using the DeepFashion-MultiModal dataset. Gemini 2.0 Flash outperforms GPT-4o-mini, scoring 56.79% vs. 43.28% macro F1.

Details

Motivation: To assess LLMs' capabilities in fashion attribute recognition, impacting e-commerce product discovery and catalog organization.

Method: Zero-shot evaluation of LLMs using the DeepFashion-MultiModal dataset, focusing on 18 fashion attributes with image-only input.

Result: Gemini 2.0 Flash achieved a macro F1 score of 56.79%, outperforming GPT-4o-mini (43.28%).

Conclusion: LLMs show promise for fashion attribute recognition but require domain-specific fine-tuning for practical e-commerce deployment.

Abstract: The fashion retail business is centered around the capacity to comprehend products. Product attribution helps in comprehending products depending on the business process. Quality attribution improves the customer experience as they navigate through millions of products offered by a retail website. It leads to well-organized product catalogs. In the end, product attribution directly impacts the ‘discovery experience’ of the customer. Although large language models (LLMs) have shown remarkable capabilities in understanding multimodal data, their performance on fine-grained fashion attribute recognition remains under-explored. This paper presents a zero-shot evaluation of state-of-the-art LLMs that balance performance with speed and cost efficiency, mainly GPT-4o-mini and Gemini 2.0 Flash. We have used the dataset DeepFashion-MultiModal (https://github.com/yumingj/DeepFashion-MultiModal) to evaluate these models in the attribution tasks of fashion products. Our study evaluates these models across 18 categories of fashion attributes, offering insight into where these models excel. We only use images as the sole input for product information to create a constrained environment. Our analysis shows that Gemini 2.0 Flash demonstrates the strongest overall performance with a macro F1 score of 56.79% across all attributes, while GPT-4o-mini scored a macro F1 score of 43.28%. Through detailed error analysis, our findings provide practical insights for deploying these LLMs in production e-commerce product attribution-related tasks and highlight the need for domain-specific fine-tuning approaches. This work also lays the groundwork for future research in fashion AI and multimodal attribute extraction.

[177] FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

Yasser Benigmim, Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

Main category: cs.CV

TL;DR: FLOSS introduces class-experts for Open-Vocabulary Semantic Segmentation, outperforming averaged classifiers without extra training or labels.

Details

Motivation: Challenge the conventional use of averaged class-wise text embeddings in OVSS by identifying superior single-template classifiers (class-experts).

Method: Propose FLOSS: a method to identify class-experts via entropy-based selection and fuse their outputs without labeled data or training.

Result: FLOSS consistently improves state-of-the-art OVSS models, generalizes across datasets, and excels in low-data scenarios.

Conclusion: FLOSS is a plug-and-play solution that enhances OVSS performance without additional labels or training.

Abstract: In this paper, we challenge the conventional practice in Open-Vocabulary Semantic Segmentation (OVSS) of using averaged class-wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of , a sketch of a ). We investigate the impact of templates for OVSS, and find that for each class, there exist single-template classifiers–which we refer to as class-experts–that significantly outperform the conventional averaged classifier. First, to identify these class-experts, we introduce a novel approach that estimates them without any labeled data or training. By leveraging the class-wise prediction entropy of single-template classifiers, we select those yielding the lowest entropy as the most reliable class-experts. Second, we combine the outputs of class-experts in a new fusion process. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering an improvement without the need for additional labels or training. Extensive experiments show that FLOSS consistently enhances state-of-the-art OVSS models, generalizes well across datasets with different distribution shifts, and delivers substantial improvements in low-data scenarios where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS .

[178] $S^2M^2$: Scalable Stereo Matching Model for Reliable Depth Estimation

Junhong Min, Youngpil Jeon, Jimin Kim, Minyong Choi

Main category: cs.CV

TL;DR: $S^2M^2$ introduces a global stereo matching architecture using multi-resolution transformers and a novel loss function, achieving state-of-the-art accuracy and efficiency without fine-tuning.

Details

Motivation: Addressing the trade-off between local search methods (limited global consistency) and global architectures (historically impractical due to computational costs) in stereo matching.

Method: Uses a multi-resolution transformer for long-range correspondence and a novel loss function focusing on feasible matches, avoiding cost volume filtering or deep refinement.

Result: Achieves state-of-the-art accuracy on Middlebury v3 and ETH3D benchmarks, outperforming prior methods in most metrics with competitive efficiency.

Conclusion: $S^2M^2$ resolves the generalization dilemma in stereo matching, offering robust disparity, occlusion, and confidence estimation with high efficiency.

Abstract: The pursuit of a generalizable stereo matching model, capable of performing well across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. However, global matching architectures, while theoretically more robust, have historically been rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with $S^2M^2$: a global matching architecture that achieves state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. $S^2M^2$ establishes a new state of the art on Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods in most metrics while reconstructing high-quality details with competitive efficiency.

[179] VistaDepth: Frequency Modulation with Bias Reweighting for Enhanced Far-range Depth Estimation

Mingxia Zhan, Li Zhang, Xiaomeng Chu, Beibei Wang, Yanyong Zhang

Main category: cs.CV

TL;DR: VistaDepth improves monocular depth estimation by addressing challenges in far-range depth reconstruction using adaptive frequency-domain processing and loss-balancing.

Details

Motivation: Standard diffusion models struggle with far-range depth due to uniform diffusion objectives and long-tail depth distributions, biasing toward near-range regions.

Method: Introduces VistaDepth with Latent Frequency Modulation for spectral refinement and BiasMap for adaptive loss-balancing in latent space.

Result: Achieves state-of-the-art performance, excelling in far-range depth accuracy and detail preservation.

Conclusion: VistaDepth effectively addresses limitations in diffusion-based MDE, enhancing depth perception across all ranges.

Abstract: Monocular depth estimation predicts per-pixel depth from a single RGB image. While recent methods have shown promise by leveraging diffusion models, they often struggle to accurately reconstruct far-range regions. This difficulty stems from two compounding factors. First, the standard spatially uniform diffusion objective fails to adapt to the varying frequency content across a depth map. Second, the long-tail depth distribution heavily biases models toward near-range regions. To address these limitations, we introduce VistaDepth, a novel framework named for its ability to accurately reconstruct far-range vistas, which integrates adaptive frequency-domain feature processing with an adaptive loss-balancing mechanism into the diffusion pipeline. Central to our approach is the Latent Frequency Modulation module, which dynamically refines spectral responses in the latent feature space, effectively preserving structural detail. Additionally, we introduce BiasMap, a mechanism that applies adaptive weights directly to the diffusion loss in the latent space, focusing supervision on under-represented far-range regions. These innovations collectively achieve superior depth perception performance across near- and far-range depths while preserving fine detail. Experiments show that VistaDepth achieves state-of-the-art performance for diffusion-based MDE, particularly excelling in reconstructing detailed and accurate depth in far-range regions.

[180] CLIP-IT: CLIP-based Pairing for Histology Images Classification

Banafsheh Karimian, Giulia Avanzato, Soufian Belharbi, Alexis Guichemerre, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger

Main category: cs.CV

TL;DR: CLIP-IT is a framework leveraging unpaired text reports to enhance medical image analysis without needing paired datasets, improving classification accuracy efficiently.

Details

Motivation: Multimodal learning in medical imaging faces challenges like high annotation costs, privacy issues, and computational demands. CLIP-IT addresses these by utilizing unpaired text reports.

Method: CLIP-IT uses a pre-trained CLIP model to retrieve relevant unpaired text reports for images, creating pseudo-pairs. Knowledge is distilled into the vision model using LoRA-based adaptation.

Result: CLIP-IT improves classification accuracy over unimodal and multimodal baselines without paired data training or inference-time complexity.

Conclusion: CLIP-IT offers a practical, efficient solution for multimodal medical image analysis by leveraging unpaired text, reducing reliance on costly paired datasets.

Abstract: Multimodal learning has shown promise in medical image analysis, combining complementary modalities like histology images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports, eliminating paired data requirement. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the target unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference time, only the improved vision model is used, with minimal computational overhead, enabling efficient pairing-free multimodal deployment. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of paired data training or inference-time complexity.

[181] Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin

Main category: cs.CV

TL;DR: Ultra3D is an efficient 3D generation framework using VecSet and Part Attention to speed up sparse voxel modeling without quality loss.

Details

Motivation: Existing 3D generation frameworks are computationally inefficient due to quadratic complexity in attention mechanisms.

Method: Uses VecSet for coarse layout and Part Attention for localized feature refinement, reducing token count and global attention.

Result: Achieves 6.7x speed-up in latent generation and supports high-resolution 3D modeling at 1024 resolution.

Conclusion: Ultra3D outperforms in visual fidelity and user preference, offering efficient high-quality 3D generation.

Abstract: Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

[182] Diffusion-based Adversarial Identity Manipulation for Facial Privacy Protection

Liqin Wang, Qianyue Hu, Wei Lu, Xiangyang Luo

Main category: cs.CV

TL;DR: DiffAIM is a diffusion-based method to generate natural adversarial faces for privacy protection against FR systems, outperforming existing methods in transferability and visual quality.

Details

Motivation: Address privacy concerns in face recognition by creating natural adversarial faces to prevent unauthorized surveillance and tracking.

Method: Manipulate facial identity in a diffusion model’s latent space using gradient-based adversarial guidance during reverse diffusion, with structure-preserving regularization.

Result: DiffAIM achieves stronger black-box attack transferability and superior visual quality, validated on face verification/identification tasks and commercial APIs.

Conclusion: DiffAIM effectively enhances privacy by generating natural adversarial faces, proving robust against state-of-the-art FR systems.

Abstract: The success of face recognition (FR) systems has led to serious privacy concerns due to potential unauthorized surveillance and user tracking on social networks. Existing methods for enhancing privacy fail to generate natural face images that can protect facial privacy. In this paper, we propose diffusion-based adversarial identity manipulation (DiffAIM) to generate natural and highly transferable adversarial faces against malicious FR systems. To be specific, we manipulate facial identity within the low-dimensional latent space of a diffusion model. This involves iteratively injecting gradient-based adversarial identity guidance during the reverse diffusion process, progressively steering the generation toward the desired adversarial faces. The guidance is optimized for identity convergence towards a target while promoting semantic divergence from the source, facilitating effective impersonation while maintaining visual naturalness. We further incorporate structure-preserving regularization to preserve facial structure consistency during manipulation. Extensive experiments on both face verification and identification tasks demonstrate that compared with the state-of-the-art, DiffAIM achieves stronger black-box attack transferability while maintaining superior visual quality. We also demonstrate the effectiveness of the proposed approach for commercial FR APIs, including Face++ and Aliyun.

[183] TextSAM-EUS: Text Prompt Learning for SAM to Accurately Segment Pancreatic Tumor in Endoscopic Ultrasound

Pascal Spiegler, Taha Koleilat, Arash Harirpoush, Corey S. Miller, Hassan Rivaz, Marta Kersten-Oertel, Yiming Xiao

Main category: cs.CV

TL;DR: TextSAM-EUS is a lightweight, text-driven adaptation of SAM for automatic pancreatic tumor segmentation in EUS, outperforming SOTA models with minimal parameter tuning.

Details

Motivation: EUS images have speckle noise and low contrast, making supervised DL models error-prone and annotation-dependent.

Method: Uses BiomedCLIP text encoder and LoRA-based SAM adaptation for text prompt learning, requiring no manual prompts at inference.

Result: Achieves 82.69% Dice and 85.28% NSD with automatic prompts, surpassing SOTA models.

Conclusion: TextSAM-EUS is efficient and robust for EUS segmentation, pioneering prompt learning in SAM-based medical imaging.

Abstract: Pancreatic cancer carries a poor prognosis and relies on endoscopic ultrasound (EUS) for targeted biopsy and radiotherapy. However, the speckle noise, low contrast, and unintuitive appearance of EUS make segmentation of pancreatic tumors with fully supervised deep learning (DL) models both error-prone and dependent on large, expert-curated annotation datasets. To address these challenges, we present TextSAM-EUS, a novel, lightweight, text-driven adaptation of the Segment Anything Model (SAM) that requires no manual geometric prompts at inference. Our approach leverages text prompt learning (context optimization) through the BiomedCLIP text encoder in conjunction with a LoRA-based adaptation of SAM’s architecture to enable automatic pancreatic tumor segmentation in EUS, tuning only 0.86% of the total parameters. On the public Endoscopic Ultrasound Database of the Pancreas, TextSAM-EUS with automatic prompts attains 82.69% Dice and 85.28% normalized surface distance (NSD), and with manual geometric prompts reaches 83.10% Dice and 85.70% NSD, outperforming both existing state-of-the-art (SOTA) supervised DL models and foundation models (e.g., SAM and its variants). As the first attempt to incorporate prompt learning in SAM-based medical image segmentation, TextSAM-EUS offers a practical option for efficient and robust automatic EUS segmentation. Code is available at https://github.com/HealthX-Lab/TextSAM-EUS .

[184] PolyPose: Localizing Deformable Anatomy in 3D from Sparse 2D X-ray Images using Polyrigid Transforms

Vivek Gopalakrishnan, Neel Dey, Polina Golland

Main category: cs.CV

TL;DR: PolyPose is a method for 2D/3D deformable registration in medical imaging, using rigid transforms to model bone movement, enabling accurate 3D guidance from sparse X-ray images.

Details

Motivation: Integrate volumetric guidance into intraoperative procedures where only 2D X-ray images are available, overcoming limitations of existing methods.

Method: Parameterizes 3D deformation fields as a composition of rigid transforms, respecting the piecewise rigid nature of human movement.

Result: Successfully aligns preoperative volumes to as few as two X-ray images, outperforming current methods in sparse-view and limited-angle settings.

Conclusion: PolyPose provides robust 3D guidance in challenging scenarios, eliminating the need for complex deformation regularizers.

Abstract: Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient’s preoperative volume to as few as two X-ray images, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail.

[185] RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection

Xiaokai Bai, Chenxu Zhou, Lianqing Zheng, Si-Yuan Cao, Jianan Liu, Xiaohan Zhang, Zhengzhuang Zhang, Hui-liang Shen

Main category: cs.CV

TL;DR: RaGS is a novel framework using 3D Gaussian Splatting to fuse 4D radar and monocular images for 3D object detection, outperforming existing methods.

Details

Motivation: Current fusion approaches lack holistic scene understanding or are constrained by rigid grid structures, limiting 3D object detection performance.

Method: RaGS employs a cascaded pipeline: Frustum-based Localization Initiation (FLI), Iterative Multimodal Aggregation (IMA), and Multi-level Gaussian Fusion (MGF) to dynamically refine and render Gaussians for detection.

Result: RaGS achieves state-of-the-art performance on benchmarks like View-of-Delft, TJ4DRadSet, and OmniHD-Scenes.

Conclusion: RaGS provides a flexible, resource-efficient solution for 3D object detection by focusing on sparse objects while maintaining scene perception.

Abstract: 4D millimeter-wave radar has emerged as a promising sensor for autonomous driving, but effective 3D object detection from both 4D radar and monocular images remains a challenge. Existing fusion approaches typically rely on either instance-based proposals or dense BEV grids, which either lack holistic scene understanding or are limited by rigid grid structures. To address these, we propose RaGS, the first framework to leverage 3D Gaussian Splatting (GS) as representation for fusing 4D radar and monocular cues in 3D object detection. 3D GS naturally suits 3D object detection by modeling the scene as a field of Gaussians, dynamically allocating resources on foreground objects and providing a flexible, resource-efficient solution. RaGS uses a cascaded pipeline to construct and refine the Gaussian field. It starts with the Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse 3D Gaussians positions. Then, the Iterative Multimodal Aggregation (IMA) fuses semantics and geometry, refining the limited Gaussians to the regions of interest. Finally, the Multi-level Gaussian Fusion (MGF) renders the Gaussians into multi-level BEV features for 3D object detection. By dynamically focusing on sparse objects within scenes, RaGS enable object concentrating while offering comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes benchmarks demonstrate its state-of-the-art performance. Code will be released.

[186] Seed Selection for Human-Oriented Image Reconstruction via Guided Diffusion

Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe

Main category: cs.CV

TL;DR: Proposes a seed selection method for diffusion-based image coding to improve quality without extra bitrate by choosing the optimal seed from multiple candidates.

Details

Motivation: Current methods either require extra bitrate for scalability or use a single random seed, which may degrade image quality.

Method: Selects the optimal seed from multiple candidates based on intermediate outputs from early steps of the reverse diffusion process to reduce computational cost.

Result: Outperforms the baseline (single random seed) in multiple evaluation metrics without increasing bitrate.

Conclusion: The proposed seed selection method enhances image quality efficiently in diffusion-based scalable image coding.

Abstract: Conventional methods for scalable image coding for humans and machines require the transmission of additional information to achieve scalability. A recent diffusion-based approach avoids this by generating human-oriented images from machine-oriented images without extra bitrate. However, it utilizes a single random seed, which may lead to suboptimal image quality. In this paper, we propose a seed selection method that identifies the optimal seed from multiple candidates to improve image quality without increasing the bitrate. To reduce the computational cost, selection is performed based on intermediate outputs obtained from early steps of the reverse diffusion process. Experimental results demonstrate that our proposed method outperforms the baseline, which uses a single random seed without selection, across multiple evaluation metrics.

[187] Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention

Drandreb Earl O. Juanico, Rowel O. Atienza, Jeffrey Kenneth Go

Main category: cs.CV

TL;DR: RCA enhances object localization in vision-language transformers by reweighting attention, improving performance without retraining.

Details

Motivation: To improve object localization in vision-language transformers by addressing extreme attention values and amplifying mid-level activations.

Method: Reweights final-layer attention by suppressing extremes and amplifying mid-level activations. Evaluated on OV-RefOD using FitAP metric.

Result: Improves FitAP in 11 out of 15 VLMs, with gains up to +26.6%. Effective for late-fusion models and some others like DeepSeek-VL2.

Conclusion: RCA provides interpretability and performance gains for multimodal transformers, with potential applications in various models.

Abstract: We propose Reverse Contrast Attention (RCA), a plug-in method that enhances object localization in vision-language transformers without retraining. RCA reweights final-layer attention by suppressing extremes and amplifying mid-level activations to let semantically relevant but subdued tokens guide predictions. We evaluate it on Open Vocabulary Referring Object Detection (OV-RefOD), introducing FitAP, a confidence-free average precision metric based on IoU and box area. RCA improves FitAP in 11 out of 15 open-source VLMs, with gains up to $+26.6%$. Effectiveness aligns with attention sharpness and fusion timing; while late-fusion models benefit consistently, models like $\texttt{DeepSeek-VL2}$ also improve, pointing to capacity and disentanglement as key factors. RCA offers both interpretability and performance gains for multimodal transformers. Codes and dataset are available from https://github.com/earl-juanico/rca

[188] CLIP-HandID: Vision-Language Model for Hand-Based Person Identification

Nathanael L. Baisa, Babu Pallam, Amudhavel Jayavel

Main category: cs.CV

TL;DR: CLIP-HandID uses CLIP and textual prompts for person identification from hand images, excelling in criminal investigations where hands are the only evidence.

Details

Motivation: Hand images are often the sole evidence in serious crimes like sexual abuse, necessitating robust identification methods.

Method: CLIP-HandID leverages CLIP’s vision-language model, using textual prompts and pseudo-tokens for feature learning from hand images.

Result: Outperforms existing methods on large, multi-ethnic hand datasets.

Conclusion: CLIP-HandID is a highly effective solution for person identification in forensic contexts.

Abstract: This paper introduces a novel approach to person identification using hand images, designed specifically for criminal investigations. The method is particularly valuable in serious crimes such as sexual abuse, where hand images are often the only identifiable evidence available. Our proposed method, CLIP-HandID, leverages a pre-trained foundational vision-language model - CLIP

to efficiently learn discriminative deep feature representations from hand images (input to CLIP’s image encoder) using textual prompts as semantic guidance. Since hand images are labeled with indexes rather than text descriptions, we employ a textual inversion network to learn pseudo-tokens that encode specific visual contexts or appearance attributes. These learned pseudo-tokens are then incorporated into textual prompts, which are fed into CLIP’s text encoder to leverage its multi-modal reasoning and enhance generalization for identification. Through extensive evaluations on two large, publicly available hand datasets with multi-ethnic representation, we demonstrate that our method significantly outperforms existing approaches.

[189] FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models

Xinting Liao, Weiming Liu, Jiaming Qian, Pengyang Zhou, Jiahe Xu, Wenjie Wang, Chaochao Chen, Xiaolin Zheng, Tat-Seng Chua

Main category: cs.CV

TL;DR: FOCoOp is a federated learning framework that improves robustness and performance in vision-language models by using ID and OOD prompts, addressing data heterogeneity and OOD shifts.

Details

Motivation: Existing federated prompt learning (FPL) approaches struggle with balancing performance and robustness, especially in OOD scenarios, due to data heterogeneity among clients.

Method: FOCoOp uses ID global prompts, local prompts, and OOD prompts to create class- and distribution-level separations, optimized via bi-level distributionally robust optimization and semi-unbalanced optimal transport.

Result: Experiments show FOCoOp effectively handles decentralized heterogeneous distributions and improves robustness against OOD shifts.

Conclusion: FOCoOp successfully addresses the trade-off in FPL, enhancing reliability in real-world scenarios.

Abstract: Federated prompt learning (FPL) for vision-language models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distributionally robust optimization. Additionally, FOCoOp improves the discrimination consistency among clients, i.e., calibrating global prompts, seemingly OOD prompts, and OOD prompts by semi-unbalanced optimal transport. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts. The project is available at GitHub.

[190] Learning to See in the Extremely Dark

Hai Jiang, Binhao Guan, Zhen Liu, Xiaohong Liu, Jian Yu, Zheng Liu, Songchen Han, Shuaicheng Liu

Main category: cs.CV

TL;DR: The paper introduces a dataset (SIED) for extremely low-light RAW image enhancement and proposes a diffusion-based framework with an Adaptive Illumination Correction Module (AICM) and color consistency loss for improved results.

Details

Motivation: Existing datasets lack extremely low-light conditions (as low as 0.0001 lux), limiting the exploration of learning-based methods in such scenarios.

Method: A paired-to-paired data synthesis pipeline creates the SIED dataset, and a diffusion-based framework with AICM and color consistency loss is proposed for enhancement.

Result: The method effectively restores visually pleasing results from extremely low-SNR RAW inputs, validated on SIED and public benchmarks.

Conclusion: The SIED dataset and proposed framework advance low-light RAW image enhancement, especially in extremely dark conditions.

Abstract: Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.

[191] ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions

Donglu Yang, Liang Zhang, Zihao Yue, Liangyu Chen, Yichen Xu, Wenxuan Wang, Qin Jin

Main category: cs.CV

TL;DR: The paper introduces a multimodal approach for chart editing, combining natural language and visual indicators, and presents the ChartM3 benchmark to evaluate and improve MLLMs for this task.

Details

Motivation: Existing chart editing methods rely on ambiguous natural language instructions, limiting fine-grained editing. A multimodal approach is proposed to address this.

Method: The authors introduce ChartM3, a benchmark with 1,000 samples of varying difficulty, and ChartM3-Train, a 24,000-sample training set for fine-tuning MLLMs.

Result: Current MLLMs, including GPT-4o, struggle with visual indicators. Fine-tuning on ChartM3-Train significantly improves performance.

Conclusion: Multimodal supervision is crucial for practical chart editing systems, and the ChartM3 benchmark facilitates progress in this area.

Abstract: Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$\text{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$\text{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$\text{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$\text{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.

[192] StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning

Chuxin Wang, Yixin Zha, Wenfei Yang, Tianzhu Zhang

Main category: cs.CV

TL;DR: StruMamba3D improves Mamba-based point cloud learning by preserving spatial dependencies and enhancing SSM with state-wise updates, achieving SOTA results.

Details

Motivation: Existing Mamba-based methods disrupt point adjacency and struggle with long-sequence memory in downstream tasks.

Method: Proposes StruMamba3D with spatial states, state-wise SSM updates, and a length-adaptive strategy.

Result: Achieves 95.1% accuracy on ModelNet40 and 92.75% on ScanObjectNN without voting.

Conclusion: StruMamba3D effectively addresses SSM limitations and outperforms existing methods.

Abstract: Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40 and 92.75% accuracy on the most challenging split of ScanObjectNN without voting strategy.

[193] RTMap: Real-Time Recursive Mapping with Change Detection and Localization

Yuheng Du, Sheng Yang, Lingxuan Wang, Zhenghua Hou, Chengying Cai, Zhitao Tan, Mingxia Chen, Shi-Sheng Huang, Qiang Li

Main category: cs.CV

TL;DR: RTMap enhances single-traversal HD mapping by crowdsourcing a multi-traversal HD map, addressing uncertainty, localization, and real-time road changes.

Details

Motivation: To overcome perceptual inaccuracies, occlusion, and lack of multi-agent fusion in existing online HD mapping methods.

Method: RTMap uses uncertainty-aware positional modeling, probabilistic-aware localization, and real-time detection for road changes in an end-to-end fashion.

Result: Demonstrates improved map quality and localization accuracy on public datasets, benefiting downstream tasks.

Conclusion: RTMap effectively improves map accuracy and freshness while robustly supporting prediction and planning.

Abstract: While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously. Our source-code will be made publicly available at https://github.com/CN-ADLab/RTMap.

[194] TurboReg: TurboClique for Robust and Efficient Point Cloud Registration

Shaocheng Yan, Pengcheng Shi, Zhenjun Zhao, Kaixin Wang, Kuang Cao, Ji Wu, Jiayuan Li

Main category: cs.CV

TL;DR: TurboReg introduces a fast, robust estimator for point cloud registration using TurboClique and Pivot-Guided Search, achieving high recall and speed.

Details

Motivation: Existing methods for robust estimation in point cloud registration are slow due to exponential time complexity, limiting real-time applications.

Method: TurboReg uses TurboClique (a 3-clique in a constrained compatibility graph) and Pivot-Guided Search (PGS) for efficient, parallelizable robust estimation.

Result: TurboReg outperforms state-of-the-art methods, e.g., 208.22× faster than 3DMAC on 3DMatch+FCGF, with higher recall.

Conclusion: TurboReg offers a scalable, efficient solution for robust point cloud registration, suitable for time-sensitive applications.

Abstract: Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highly-constrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC$^2$ scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) operates $208.22\times$ faster than 3DMAC while also achieving higher recall. Our code is accessible at \href{https://github.com/Laka-3DV/TurboReg}{\texttt{TurboReg}}.

[195] Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

Ziyu Zhu, Xilin Wang, Yixuan Li, Zhuofan Zhang, Xiaojian Ma, Yixin Chen, Baoxiong Jia, Wei Liang, Qian Yu, Zhidong Deng, Siyuan Huang, Qing Li

Main category: cs.CV

TL;DR: MTU3D is a framework integrating active perception with 3D vision-language learning, enabling embodied agents to explore and understand environments without explicit 3D reconstruction. It outperforms existing methods in benchmarks.

Details

Motivation: Existing 3D-VL models lack active perception and exploration capabilities, limiting embodied scene understanding.

Method: MTU3D uses online query-based representation learning, a unified grounding-exploration objective, and end-to-end trajectory learning.

Result: MTU3D outperforms state-of-the-art methods by 14-23% in benchmarks like HM3D-OVON and GOAT-Bench.

Conclusion: Bridging visual grounding and exploration is crucial for embodied intelligence, as demonstrated by MTU3D’s success.

Abstract: Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{\model}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14%, 23%, 9%, and 2% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. \model’s versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.

[196] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Hanspeter Pfister, Shijian Lu, Fangneng Zhan

Main category: cs.CV

TL;DR: A survey on feed-forward deep learning techniques for 3D reconstruction and view synthesis, covering representations like NeRF and 3DGS, applications, datasets, and future challenges.

Details

Motivation: Traditional methods for 3D reconstruction and view synthesis are computationally intensive, limiting real-world use. Feed-forward deep learning approaches offer faster, generalizable solutions.

Method: The paper reviews feed-forward techniques, categorizing them by representation architectures (e.g., point clouds, NeRF, 3DGS) and tasks like pose-free and dynamic reconstruction.

Result: The survey highlights advancements in speed and generalization, applications in AR/VR, robotics, and digital humans, and provides dataset and evaluation insights.

Conclusion: Feed-forward approaches show promise for advancing 3D vision, but challenges remain, pointing to future research directions.

Abstract: 3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.

[197] PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image

Hyeongjin Nam, Donghwan Kim, Gyeongsik Moon, Kyoung Mu Lee

Main category: cs.CV

TL;DR: PARTE improves 3D human reconstruction by using part segmentation priors to align textures, avoiding blending issues.

Details

Motivation: Existing methods misalign textures across human parts; PARTE leverages part segmentation for better texture coherence.

Method: Uses a PartSegmenter for 3D part segmentation and a PartTexturer for part-guided texture reconstruction.

Result: Achieves state-of-the-art quality in 3D human reconstruction.

Conclusion: PARTE effectively addresses texture misalignment by integrating part segmentation priors.

Abstract: The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction. The project page is available at https://hygenie1228.github.io/PARTE/.

[198] GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting

Baijun Ye, Minghui Qin, Saining Zhang, Moonjun Gong, Shaoting Zhu, Zebang Shen, Luan Zhang, Lu Zhang, Hao Zhao, Hang Zhao

Main category: cs.CV

TL;DR: GS-Occ3D is a vision-only framework for scalable occupancy reconstruction in autonomous driving, overcoming challenges like sparse viewpoints and occlusions with an Octree-based Gaussian Surfel method.

Details

Motivation: Existing LiDAR-based occupancy methods limit scalability and exclude crowdsourced data. Vision-only approaches face challenges like incomplete geometry and post-processing needs.

Method: GS-Occ3D uses an Octree-based Gaussian Surfel formulation for explicit occupancy representation, decomposing scenes into static background, ground, and dynamic objects for tailored modeling.

Result: Achieves state-of-the-art geometry reconstruction on Waymo dataset and shows superior zero-shot generalization on Occ3D-nuScenes.

Conclusion: Demonstrates the potential of vision-based occupancy reconstruction for scalable auto-labeling in autonomous driving.

Abstract: Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representation, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octree-based Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-the-art geometry reconstruction results. By curating vision-only binary occupancy labels from diverse urban scenes, we show their effectiveness for downstream occupancy models on Occ3D-Waymo and superior zero-shot generalization on Occ3D-nuScenes. It highlights the potential of large-scale vision-based occupancy reconstruction as a new paradigm for scalable auto-labeling. Project Page: https://gs-occ3d.github.io/

[199] DriveIndia: An Object Detection Dataset for Diverse Indian Traffic Scenes

Rishav Kumar, D. Santhosh Reddy, P. Rajalakshmi

Main category: cs.CV

TL;DR: DriveIndia is a large-scale dataset for object detection in Indian traffic, featuring 66,986 images across 24 categories, diverse conditions, and baseline results (78.7% mAP50).

Details

Motivation: To address the complexity and unpredictability of Indian traffic environments for autonomous driving research.

Method: Dataset creation with 66,986 high-resolution images annotated in YOLO format, covering varied conditions and locations. Baseline results using YOLO models.

Result: Top-performing YOLO variant achieved 78.7% mAP50.

Conclusion: DriveIndia serves as a benchmark for robust object detection in uncertain road conditions and will be publicly available.

Abstract: We introduce DriveIndia, a large-scale object detection dataset purpose-built to capture the complexity and unpredictability of Indian traffic environments. The dataset contains 66,986 high-resolution images annotated in YOLO format across 24 traffic-relevant object categories, encompassing diverse conditions such as varied weather (fog, rain), illumination changes, heterogeneous road infrastructure, and dense, mixed traffic patterns and collected over 120+ hours and covering 3,400+ kilometers across urban, rural, and highway routes. DriveIndia offers a comprehensive benchmark for real-world autonomous driving challenges. We provide baseline results using state-of-the-art YOLO family models, with the top-performing variant achieving a mAP50 of 78.7%. Designed to support research in robust, generalizable object detection under uncertain road conditions, DriveIndia will be publicly available via the TiHAN-IIT Hyderabad dataset repository (https://tihan.iith.ac.in/tiand-datasets/).

[200] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang

Main category: cs.CV

TL;DR: A survey on multimodal long context token compression, categorizing methods by modality (image, video, audio) and mechanism (transformation, similarity, attention, query-based), aiming to consolidate progress and inspire future research.

Details

Motivation: Address computational challenges in MLLMs caused by quadratic complexity of self-attention with long contexts, by exploring token compression methods.

Method: Systematic survey and categorization of token compression approaches by modality (image, video, audio) and underlying mechanisms (transformation, similarity, attention, query-based).

Result: Comprehensive overview of existing methods, highlighting modality-specific redundancies and compression strategies.

Conclusion: The survey consolidates progress, identifies challenges, and aims to inspire future research in multimodal token compression, with a maintained public repository for updates.

Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.

[201] Harnessing Diffusion-Yielded Score Priors for Image Restoration

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S. Ren, Jinjin Gu, Chao Dong

Main category: cs.CV

TL;DR: HYPIR is a novel image restoration method combining pre-trained diffusion models with adversarial training, achieving high-quality results efficiently.

Details

Motivation: Existing methods (MSE-based, GAN-based, diffusion-based) struggle to balance restoration quality, fidelity, and speed.

Method: Initializes with a pre-trained diffusion model, fine-tuned via adversarial training, avoiding diffusion loss and iterative sampling.

Result: HYPIR improves stability, avoids mode collapse, accelerates convergence, and outperforms state-of-the-art methods.

Conclusion: HYPIR offers efficient, high-quality restoration with user control, faster than diffusion-based approaches.

Abstract: Deep image restoration models aim to learn a mapping from degraded image space to natural image space. However, they face several critical challenges: removing degradation, generating realistic details, and ensuring pixel-level consistency. Over time, three major classes of methods have emerged, including MSE-based, GAN-based, and diffusion-based methods. However, they fail to achieve a good balance between restoration quality, fidelity, and speed. We propose a novel method, HYPIR, to address these challenges. Our solution pipeline is straightforward: it involves initializing the image restoration model with a pre-trained diffusion model and then fine-tuning it with adversarial training. This approach does not rely on diffusion loss, iterative sampling, or additional adapters. We theoretically demonstrate that initializing adversarial training from a pre-trained diffusion model positions the initial restoration model very close to the natural image distribution. Consequently, this initialization improves numerical stability, avoids mode collapse, and substantially accelerates the convergence of adversarial training. Moreover, HYPIR inherits the capabilities of diffusion models with rich user control, enabling text-guided restoration and adjustable texture richness. Requiring only a single forward pass, it achieves faster convergence and inference speed than diffusion-based methods. Extensive experiments show that HYPIR outperforms previous state-of-the-art methods, achieving efficient and high-quality image restoration.

[202] Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, Yu Qiao

Main category: cs.CV

TL;DR: RRVF framework reduces reliance on curated image-text supervision for MLLMs by using raw images and reinforcement learning, outperforming existing methods.

Details

Motivation: Address the bottleneck of MLLMs' heavy reliance on curated image-text supervision for deep visual reasoning.

Method: Introduces RRVF, a framework using the “Asymmetry of Verification” principle and reinforcement learning for self-correction via reasoning, rendering, and visual feedback.

Result: RRVF-trained model outperforms open-source MLLMs and supervised baselines, showing superior generalization.

Conclusion: RRVF offers a self-improvement paradigm for robust, generalizable models without explicit supervision.

Abstract: Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework termed Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the Asymmetry of Verification’’ principle to train MLLMs, i.e., verifying the rendered output against a source image is easier than generating it. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL) training, reducing reliance on the image-text supervision. Guided by the above principle, RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform self-correction through multi-turn interactions, while this pipeline can be optimized end-to-end by the GRPO algorithm. Extensive evaluations are conducted on image-to-code generation across two diverse domains: data charts and web interfaces. The RRVF-trained model not only outperforms existing open-source MLLMs and supervised fine-tuning baselines but also exhibits superior generalization to unseen datasets. Critically, the model’s performance surpasses that of the more advanced MLLM used to provide the feedback signal during training. This work establishes a self-improvement paradigm that offers a viable path to robust, generalizable models without reliance on explicit supervision. Code will be available at https://github.com/L-O-I/RRVF.

[203] Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy

Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc

Main category: cs.CV

TL;DR: CoP introduces a multi-task learning framework for BEV 3D object detection, leveraging spatial occupancy to improve environmental context perception and outperforming existing methods.

Details

Motivation: Existing BEV methods lack intrinsic environmental context, hindering comprehensive perception. CoP addresses this by integrating spatial occupancy as auxiliary information.

Method: CoP uses a pipeline for dense occupancy ground truths (LDO), voxel-height-guided sampling (VHS), and a global-local feature fusion (CFF) module.

Result: CoP achieves 49.5% mAP and 59.2% NDS on nuScenes, outperforming vision-based frameworks.

Conclusion: CoP enhances BEV representations by combining 3D object detection and occupancy prediction, demonstrating superior performance.

Abstract: Vision-based bird’s-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5% mAP and 59.2% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.

[204] Predict Patient Self-reported Race from Skin Histological Images

Shengjia Chen, Ruchika Verma, Kevin Clare, Jannes Jegminat, Eugenia Alleva, Kuan-lin Huang, Brandon Veremis, Thomas Fuchs, Gabriele Campanella

Main category: cs.CV

TL;DR: AI in computational pathology can predict race from dermatopathology slides, revealing unintended biases and the need for careful data curation.

Details

Motivation: To investigate if deep learning models can predict self-reported race from pathology slides and identify potential biases.

Method: Used an attention-based mechanism on a racially diverse dataset, evaluated three curation strategies, and analyzed morphological features.

Result: Models predicted race with high AUC for White and Black groups (0.799, 0.762), but overall performance dropped to 0.663. Epidermis was a key feature.

Conclusion: Highlights the need for bias mitigation and equitable AI deployment in pathology.

Abstract: Artificial Intelligence (AI) has demonstrated success in computational pathology (CPath) for disease detection, biomarker classification, and prognosis prediction. However, its potential to learn unintended demographic biases, particularly those related to social determinants of health, remains understudied. This study investigates whether deep learning models can predict self-reported race from digitized dermatopathology slides and identifies potential morphological shortcuts. Using a multisite dataset with a racially diverse population, we apply an attention-based mechanism to uncover race-associated morphological features. After evaluating three dataset curation strategies to control for confounding factors, the final experiment showed that White and Black demographic groups retained high prediction performance (AUC: 0.799, 0.762), while overall performance dropped to 0.663. Attention analysis revealed the epidermis as a key predictive feature, with significant performance declines when these regions were removed. These findings highlight the need for careful data curation and bias mitigation to ensure equitable AI deployment in pathology. Code available at: https://github.com/sinai-computational-pathology/CPath_SAIF.

[205] See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs

Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, Jide Li

Main category: cs.CV

TL;DR: ViHallu is a vision-centric framework to mitigate hallucinations in LVLMs by improving visual-semantic alignment through visual variation images and instructions.

Details

Motivation: LVLMs often hallucinate, generating text inconsistent with visual content, especially in fine-grained scenarios. Existing text-centric methods are limited.

Method: ViHallu uses visual variation images and constructed visual instructions to enhance fine-tuning, improving visual-semantic alignment.

Result: ViHallu reduces hallucinations and improves fine-grained visual understanding in LVLMs, validated by benchmarks.

Conclusion: ViHallu effectively addresses hallucination in LVLMs through visual-centric methods and releases a dataset for further research.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces visual variation images with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models’ fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at https://github.com/oliviadzy/ViHallu.

cs.AI

[206] When Truthful Representations Flip Under Deceptive Instructions?

Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, Pan Li

Main category: cs.AI

TL;DR: The paper investigates how deceptive instructions alter the internal representations of LLMs compared to truthful ones, revealing predictable shifts and distinct subspaces for deception.

Details

Motivation: Understanding how LLMs process deceptive versus truthful instructions to improve detection and control of dishonest outputs.

Method: Analyzed internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct using linear probes and Sparse Autoencoders (SAEs) on a factual verification task.

Result: Deceptive instructions cause significant representational shifts, detectable in early-to-mid layers, with distinct truthful/deceptive subspaces identified.

Conclusion: The study provides layer- and feature-level insights into LLM deception, aiding detection and mitigation of instructed dishonesty.

Abstract: Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations ``flip’’, such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model’s instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are similar), concentrated in early-to-mid layers and detectable even on complex datasets. We also identify specific SAE features highly sensitive to deceptive instruction and use targeted visualizations to confirm distinct truthful/deceptive representational subspaces. % Our analysis pinpoints layer-wise and feature-level correlates of instructed dishonesty, offering insights for LLM detection and control. Our findings expose feature- and layer-level signatures of deception, offering new insights for detecting and mitigating instructed dishonesty in LLMs.

[207] Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence

Matthieu Queloz

Main category: cs.AI

TL;DR: The paper redefines systematicity in AI beyond explainability, distinguishing four senses of ‘systematicity of thought’ to reconcile it with connectionism. It argues for a dynamic, rationale-driven approach to systematization in AI.

Details

Motivation: To address the broader ideal of systematicity in AI, moving beyond the narrow focus on explainability and resolving tensions between systematicity and connectionism.

Method: Proposes a conceptual framework distinguishing four senses of ‘systematicity of thought’ and applies five rationales for systematization to AI models.

Result: Reconciles systematicity with connectionism and identifies the ‘hard systematicity challenge,’ advocating for a dynamic, rationale-based approach to systematization in AI.

Conclusion: Systematicity in AI should be guided by specific rationales, determining how and when models need to be systematic, rather than adhering to a rigid ideal.

Abstract: This paper argues that explainability is only one facet of a broader ideal that shapes our expectations towards artificial intelligence (AI). Fundamentally, the issue is to what extent AI exhibits systematicity–not merely in being sensitive to how thoughts are composed of recombinable constituents, but in striving towards an integrated body of thought that is consistent, coherent, comprehensive, and parsimoniously principled. This richer conception of systematicity has been obscured by the long shadow of the “systematicity challenge” to connectionism, according to which network architectures are fundamentally at odds with what Fodor and colleagues termed “the systematicity of thought.” I offer a conceptual framework for thinking about “the systematicity of thought” that distinguishes four senses of the phrase. I use these distinctions to defuse the perceived tension between systematicity and connectionism and show that the conception of systematicity that historically shaped our sense of what makes thought rational, authoritative, and scientific is more demanding than the Fodorian notion. To determine whether we have reason to hold AI models to this ideal of systematicity, I then argue, we must look to the rationales for systematization and explore to what extent they transfer to AI models. I identify five such rationales and apply them to AI. This brings into view the “hard systematicity challenge.” However, the demand for systematization itself needs to be regulated by the rationales for systematization. This yields a dynamic understanding of the need to systematize thought, which tells us how systematic we need AI models to be and when.

[208] CoEx – Co-evolving World-model and Exploration

Minsoo Kim, Seung-won Hwang

Main category: cs.AI

TL;DR: CoEx introduces a hierarchical agent architecture to dynamically update LLM-based world models, improving planning accuracy by integrating observations into a neurosymbolic belief state.

Details

Motivation: Existing LLM agents rely on static world models, leading to misalignment with the true state and erroneous plans.

Method: CoEx uses hierarchical state abstraction and LLM reasoning to orchestrate dynamic plans, updating a neurosymbolic belief state with subgoal experiences.

Result: CoEx outperforms existing agents in planning and exploration across diverse scenarios like ALFWorld, PDDL, and Jericho.

Conclusion: CoEx’s dynamic world model update mechanism enhances LLM agent planning, reducing errors and improving adaptability.

Abstract: Planning in modern LLM agents relies on the utilization of LLM as an internal world model, acquired during pretraining. However, existing agent designs fail to effectively assimilate new observations into dynamic updates of the world model. This reliance on the LLM’s static internal world model is progressively prone to misalignment with the underlying true state of the world, leading to the generation of divergent and erroneous plans. We introduce a hierarchical agent architecture, CoEx, in which hierarchical state abstraction allows LLM planning to co-evolve with a dynamically updated model of the world. CoEx plans and interacts with the world by using LLM reasoning to orchestrate dynamic plans consisting of subgoals, and its learning mechanism continuously incorporates these subgoal experiences into a persistent world model in the form of a neurosymbolic belief state, comprising textual inferences and code-based symbolic memory. We evaluate our agent across a diverse set of agent scenarios involving rich environments and complex tasks including ALFWorld, PDDL, and Jericho. Our experiments show that CoEx outperforms existing agent paradigms in planning and exploration.

[209] An Explainable Emotion Alignment Framework for LLM-Empowered Agent in Metaverse Service Ecosystem

Qun Ma, Xiao Xue, Ming Zhang, Yifan Shen, Zihan Zhao

Main category: cs.AI

TL;DR: The paper proposes an explainable emotion alignment framework for LLM-based agents in Metaverse service ecosystems to address challenges like data fusion, knowledge association, and ethical safety.

Details

Motivation: To bridge the gap between virtual and real-world services in Metaverse ecosystems, focusing on improving LLM-based agents' decision-making with factual factors.

Method: An explainable emotion alignment framework is introduced, integrating factual factors into LLM-based agents’ decision-making.

Result: A simulation in an O2O food delivery scenario shows the framework’s effectiveness in achieving realistic social emergence.

Conclusion: The framework enhances the alignment of LLM-based agents with relational facts, improving their role in Metaverse service ecosystems.

Abstract: Metaverse service is a product of the convergence between Metaverse and service systems, designed to address service-related challenges concerning digital avatars, digital twins, and digital natives within Metaverse. With the rise of large language models (LLMs), agents now play a pivotal role in Metaverse service ecosystem, serving dual functions: as digital avatars representing users in the virtual realm and as service assistants (or NPCs) providing personalized support. However, during the modeling of Metaverse service ecosystems, existing LLM-based agents face significant challenges in bridging virtual-world services with real-world services, particularly regarding issues such as character data fusion, character knowledge association, and ethical safety concerns. This paper proposes an explainable emotion alignment framework for LLM-based agents in Metaverse Service Ecosystem. It aims to integrate factual factors into the decision-making loop of LLM-based agents, systematically demonstrating how to achieve more relational fact alignment for these agents. Finally, a simulation experiment in the Offline-to-Offline food delivery scenario is conducted to evaluate the effectiveness of this framework, obtaining more realistic social emergence.

[210] Magentic-UI: Towards Human-in-the-loop Agentic Systems

Hussein Mozannar, Gagan Bansal, Cheng Tan, Adam Fourney, Victor Dibia, Jingya Chen, Jack Gerrits, Tyler Payne, Matheus Kunzler Maldaner, Madeleine Grunde-McLaughlin, Eric Zhu, Griffin Bassman, Jacob Alber, Peter Chang, Ricky Loynd, Friederike Niedtner, Ece Kamar, Maya Murad, Rafah Hosn, Saleema Amershi

Main category: cs.AI

TL;DR: AI agents with human oversight (Magentic-UI) improve safety and efficiency in multi-step tasks, outperforming fully autonomous systems.

Details

Motivation: AI agents lack human-level performance and pose safety risks; human-in-the-loop systems can enhance productivity and safety.

Method: Developed Magentic-UI, a web interface for human-agent interaction, featuring flexible tools and six interaction mechanisms (e.g., co-planning, action guards).

Result: Evaluated across benchmarks, user testing, and safety assessments, showing improved task completion and safety.

Conclusion: Magentic-UI advances safe and efficient human-AI collaboration, balancing autonomy and oversight.

Abstract: AI agents powered by large language models are increasingly capable of autonomously completing complex, multi-step tasks using external tools. Yet, they still fall short of human-level performance in most domains including computer use, software development, and research. Their growing autonomy and ability to interact with the outside world, also introduces safety and security risks including potentially misaligned actions and adversarial manipulation. We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems. We introduce Magentic-UI, an open-source web interface for developing and studying human-agent interaction. Built on a flexible multi-agent architecture, Magentic-UI supports web browsing, code execution, and file manipulation, and can be extended with diverse tools via Model Context Protocol (MCP). Moreover, Magentic-UI presents six interaction mechanisms for enabling effective, low-cost human involvement: co-planning, co-tasking, multi-tasking, action guards, and long-term memory. We evaluate Magentic-UI across four dimensions: autonomous task completion on agentic benchmarks, simulated user testing of its interaction capabilities, qualitative studies with real users, and targeted safety assessments. Our findings highlight Magentic-UI’s potential to advance safe and efficient human-agent collaboration.

[211] LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Kai Chen, Xiaofeng Wang, Baosheng Wang

Main category: cs.AI

TL;DR: A novel benchmark-free evaluation method, LLM-Crowdsourced, is proposed to address issues in evaluating large language models (LLMs) by leveraging LLMs to generate, answer, and evaluate questions, ensuring dynamic, transparent, objective, and professional criteria.

Details

Motivation: Existing evaluation methods for LLMs face challenges like data contamination, black-box operation, and subjective preference, hindering comprehensive assessment of their true capabilities.

Method: The LLM-Crowdsourced paradigm uses LLMs to generate questions, answer them independently, and evaluate each other, incorporating dynamic, transparent, objective, and professional criteria.

Result: Experiments on eight LLMs in mathematics and programming show the method’s effectiveness in distinguishing performance, revealing novel insights like Gemini’s superior question-design capabilities and memorization-based answering in some models.

Conclusion: The proposed method offers a robust, comprehensive evaluation of LLMs, uncovering findings traditional methods miss, and demonstrates high consistency in results.

Abstract: Although large language models (LLMs) demonstrate remarkable capabilities across various tasks, evaluating their capabilities remains a challenging task. Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference. These issues make it difficult to evaluate the LLMs’ true capabilities comprehensively. To tackle these challenges, we propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced. It utilizes LLMs to generate questions, answer independently, and evaluate mutually. This method integrates four key evaluation criteria: dynamic, transparent, objective, and professional, which existing evaluation methods cannot satisfy simultaneously. Experiments on eight mainstream LLMs across mathematics and programming verify the advantages of our method in distinguishing LLM performance. Furthermore, our study reveals several novel findings that are difficult for traditional methods to detect, including but not limited to: (1) Gemini demonstrates the highest original and professional question-design capabilities among others; (2) Some LLMs exhibit ‘‘memorization-based answering’’ by misrecognizing questions as familiar ones with a similar structure; (3) LLM evaluation results demonstrate high consistency (robustness).

[212] Beyond Accuracy: How AI Metacognitive Sensitivity improves AI-assisted Decision Making

ZhaoBin Li, Mark Steyvers

Main category: cs.AI

TL;DR: AI’s metacognitive sensitivity (accurate confidence scoring) can improve human decision-making, sometimes outperforming higher-accuracy AI with lower sensitivity.

Details

Motivation: To understand how AI's predictive accuracy and confidence reliability jointly impact human decision quality.

Method: Theoretical framework and behavioral experiment to assess AI’s metacognitive sensitivity and its effect on decision-making.

Result: AI with lower accuracy but higher metacognitive sensitivity can enhance human decision performance.

Conclusion: AI assistance should be evaluated and optimized for both accuracy and metacognitive sensitivity to improve decision outcomes.

Abstract: In settings where human decision-making relies on AI input, both the predictive accuracy of the AI system and the reliability of its confidence estimates influence decision quality. We highlight the role of AI metacognitive sensitivity – its ability to assign confidence scores that accurately distinguish correct from incorrect predictions – and introduce a theoretical framework for assessing the joint impact of AI’s predictive accuracy and metacognitive sensitivity in hybrid decision-making settings. Our analysis identifies conditions under which an AI with lower predictive accuracy but higher metacognitive sensitivity can enhance the overall accuracy of human decision making. Finally, a behavioral experiment confirms that greater AI metacognitive sensitivity improves human decision performance. Together, these findings underscore the importance of evaluating AI assistance not only by accuracy but also by metacognitive sensitivity, and of optimizing both to achieve superior decision outcomes.

[213] Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence

Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang

Main category: cs.AI

TL;DR: The paper introduces Embodied Web Agents, a new AI paradigm integrating physical and digital intelligence, and provides a benchmark for assessing cross-domain tasks.

Details

Motivation: Current AI agents are siloed, limiting their ability to solve tasks requiring both physical and digital intelligence.

Method: Developed a simulation platform integrating 3D environments with web interfaces and created a benchmark for diverse tasks.

Result: Revealed performance gaps between AI systems and humans in cross-domain tasks.

Conclusion: The work highlights challenges and opportunities in merging embodied cognition with web-scale knowledge, with resources made publicly available.

Abstract: AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

[214] On the Definition of Intelligence

Kei-Sing Ng

Main category: cs.AI

TL;DR: The paper proposes a general criterion for intelligence, defining it as the ability to generate samples from a category given samples of it, formalized as ε-category intelligence.

Details

Motivation: To create a species-agnostic and evaluable definition of intelligence that encompasses diverse paradigms like reinforcement learning and analogical reasoning.

Method: Introduces ε-category intelligence, where intelligence is measured by the inability of a distinguisher to separate generated from original samples beyond a tolerance ε.

Result: A formal framework for evaluating intelligence, with outlined empirical protocols and implications for safety and generalization.

Conclusion: The proposed criterion provides a measurable and generalizable foundation for engineering AGI, with potential impacts on evaluation and safety.

Abstract: To engineer AGI, we should first capture the essence of intelligence in a species-agnostic form that can be evaluated, while being sufficiently general to encompass diverse paradigms of intelligent behavior, including reinforcement learning, generative models, classification, analogical reasoning, and goal-directed decision-making. We propose a general criterion based on sample fidelity: intelligence is the ability, given sample(s) from a category, to generate sample(s) from the same category. We formalise this intuition as {\epsilon}-category intelligence: it is {\epsilon}-intelligent with respect to a category if no chosen admissible distinguisher can separate generated from original samples beyond tolerance {\epsilon}. We present the formal framework, outline empirical protocols, and discuss implications for evaluation, safety, and generalization.

[215] Cross-Border Legal Adaptation of Autonomous Vehicle Design based on Logic and Non-monotonic Reasoning

Zhe Yu, Yiwei Lu, Burkhard Schafer, Zhe Lin

Main category: cs.AI

TL;DR: The paper addresses legal compliance challenges for autonomous vehicles in transnational contexts, offering a reasoning system to aid designers in adapting designs and understanding legal implications.

Details

Motivation: To support designers in navigating legal complexities for autonomous vehicles across borders by integrating legal reasoning into the design process.

Method: Uses argumentation theory and partial order sets of natural numbers to model normative reasoning and priority, applied to legal text case analysis.

Result: Demonstrates a reasoning system that enhances flexibility in design adaptation and clarifies legal implications for cross-border applications.

Conclusion: The proposed system effectively aids designers in aligning autonomous vehicle designs with transnational legal requirements.

Abstract: This paper focuses on the legal compliance challenges of autonomous vehicles in a transnational context. We choose the perspective of designers and try to provide supporting legal reasoning in the design process. Based on argumentation theory, we introduce a logic to represent the basic properties of argument-based practical (normative) reasoning, combined with partial order sets of natural numbers to express priority. Finally, through case analysis of legal texts, we show how the reasoning system we provide can help designers to adapt their design solutions more flexibly in the cross-border application of autonomous vehicles and to more easily understand the legal implications of their decisions.

[216] Nearest-Better Network for Visualizing and Analyzing Combinatorial Optimization Problems: A Unified Tool

Yiya Diao, Changhe Li, Sanyou Zeng, Xinye Cai, Wenjian Luo, Shengxiang Yang, Carlos A. Coello Coello

Main category: cs.AI

TL;DR: The paper improves the Nearest-Better Network (NBN) method for visualizing optimization landscapes, introduces an efficient computation method, and applies it to reveal new insights about OneMax and TSP problems.

Details

Motivation: The NBN method is time-consuming and challenging to extend to combinatorial problems, limiting its utility in analyzing algorithm behavior.

Method: The paper provides a theoretical derivation of NBN as a maximum probability transition network and introduces an efficient logarithmic linear time computation method.

Result: Application to OneMax and TSP reveals landscape features like neutrality, ruggedness, and modality, and identifies limitations in state-of-the-art TSP algorithms (EAX and LKH).

Conclusion: The efficient NBN method enables deeper analysis of optimization landscapes, uncovering critical challenges and algorithm limitations in combinatorial problems.

Abstract: The Nearest-Better Network (NBN) is a powerful method to visualize sampled data for continuous optimization problems while preserving multiple landscape features. However, the calculation of NBN is very time-consuming, and the extension of the method to combinatorial optimization problems is challenging but very important for analyzing the algorithm’s behavior. This paper provides a straightforward theoretical derivation showing that the NBN network essentially functions as the maximum probability transition network for algorithms. This paper also presents an efficient NBN computation method with logarithmic linear time complexity to address the time-consuming issue. By applying this efficient NBN algorithm to the OneMax problem and the Traveling Salesman Problem (TSP), we have made several remarkable discoveries for the first time: The fitness landscape of OneMax exhibits neutrality, ruggedness, and modality features. The primary challenges of TSP problems are ruggedness, modality, and deception. Two state-of-the-art TSP algorithms (i.e., EAX and LKH) have limitations when addressing challenges related to modality and deception, respectively. LKH, based on local search operators, fails when there are deceptive solutions near global optima. EAX, which is based on a single population, can efficiently maintain diversity. However, when multiple attraction basins exist, EAX retains individuals within multiple basins simultaneously, reducing inter-basin interaction efficiency and leading to algorithm’s stagnation.

[217] Collaborative Medical Triage under Uncertainty: A Multi-Agent Dynamic Matching Approach

Hongyan Cheng, Chengzhang Yu, Yanshu Shi, Chiyue Wang, Cong Liu, Zhanpeng Jin

Main category: cs.AI

TL;DR: An AI-driven multi-agent system improves emergency department triage by addressing medical specialization gaps, heterogeneous structures, and inefficient questioning, achieving high accuracy in department classification.

Details

Motivation: The post-pandemic healthcare demand surge and nursing shortages necessitate innovative AI solutions for efficient and accurate triage.

Method: The system uses three specialized agents (RecipientAgent, InquirerAgent, DepartmentAgent) collaborating via structured inquiry and department-specific rules, evaluated on a comprehensive Chinese medical triage dataset.

Result: Achieves 89.2% primary and 73.9% secondary department classification accuracy after four patient interactions.

Conclusion: Provides a scalable framework for AI-assisted triage adaptable to diverse healthcare institutions while ensuring clinical accuracy.

Abstract: The post-pandemic surge in healthcare demand, coupled with critical nursing shortages, has placed unprecedented pressure on emergency department triage systems, necessitating innovative AI-driven solutions. We present a multi-agent interactive intelligent system for medical triage that addresses three fundamental challenges in current AI-based triage systems: insufficient medical specialization leading to hallucination-induced misclassifications, heterogeneous department structures across healthcare institutions, and inefficient detail-oriented questioning that impedes rapid triage decisions. Our system employs three specialized agents - RecipientAgent, InquirerAgent, and DepartmentAgent - that collaborate through structured inquiry mechanisms and department-specific guidance rules to transform unstructured patient symptoms into accurate department recommendations. To ensure robust evaluation, we constructed a comprehensive Chinese medical triage dataset from a medical website, comprising 3,360 real-world cases spanning 9 primary departments and 62 secondary departments. Through systematic data imputation using large language models, we address the prevalent issue of incomplete medical records in real-world data. Experimental results demonstrate that our multi-agent system achieves 89.2% accuracy in primary department classification and 73.9% accuracy in secondary department classification after four rounds of patient interaction. The system’s pattern-matching-based guidance mechanisms enable efficient adaptation to diverse hospital configurations while maintaining high triage accuracy. Our work provides a scalable framework for deploying AI-assisted triage systems that can accommodate the organizational heterogeneity of healthcare institutions while ensuring clinically sound decision-making.

[218] MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines

Yaolun Zhang, Xiaogeng Liu, Chaowei Xiao

Main category: cs.AI

TL;DR: MetaAgent is a framework for automatically generating multi-agent systems using finite state machines, outperforming existing auto-designed methods and matching human-designed systems.

Details

Motivation: Existing multi-agent frameworks are limited to pre-defined scenarios, and automated methods lack flexibility, tool integration, and rely on external data.

Method: MetaAgent uses finite state machines to design and optimize multi-agent systems from task descriptions, controlling agent actions and state transitions.

Result: Experiments show MetaAgent surpasses auto-designed methods and matches human-designed systems in performance.

Conclusion: MetaAgent provides a flexible, automated solution for multi-agent system design, addressing limitations of current methods.

Abstract: Large Language Models (LLMs) have demonstrated the ability to solve a wide range of practical tasks within multi-agent systems. However, existing human-designed multi-agent frameworks are typically limited to a small set of pre-defined scenarios, while current automated design methods suffer from several limitations, such as the lack of tool integration, dependence on external training data, and rigid communication structures. In this paper, we propose MetaAgent, a finite state machine based framework that can automatically generate a multi-agent system. Given a task description, MetaAgent will design a multi-agent system and polish it through an optimization algorithm. When the multi-agent system is deployed, the finite state machine will control the agent’s actions and the state transitions. To evaluate our framework, we conduct experiments on both text-based tasks and practical tasks. The results indicate that the generated multi-agent system surpasses other auto-designed methods and can achieve a comparable performance with the human-designed multi-agent system, which is optimized for those specific tasks.

[219] Enhancing Manufacturing Knowledge Access with LLMs and Context-aware Prompting

Sebastian Monka, Irlan Grangel-González, Stefan Schmid, Lavdim Halilaj, Marc Rickart, Oliver Rudolph, Rui Dias

Main category: cs.AI

TL;DR: LLMs can translate natural language queries into SPARQL for KGs, improving with context-aware prompting, especially in manufacturing.

Details

Motivation: Simplify KG data retrieval for non-experts by automating SPARQL query generation using LLMs.

Method: Evaluate strategies for feeding KG context to LLMs, testing on manufacturing KGs (Bosch Line, I40 Core).

Result: LLMs perform better with adequate KG schema context, reducing errors and improving query accuracy.

Conclusion: Context-aware LLMs can democratize KG access, aiding decision-making in manufacturing.

Abstract: Knowledge graphs (KGs) have transformed data management within the manufacturing industry, offering effective means for integrating disparate data sources through shared and structured conceptual schemas. However, harnessing the power of KGs can be daunting for non-experts, as it often requires formulating complex SPARQL queries to retrieve specific information. With the advent of Large Language Models (LLMs), there is a growing potential to automatically translate natural language queries into the SPARQL format, thus bridging the gap between user-friendly interfaces and the sophisticated architecture of KGs. The challenge remains in adequately informing LLMs about the relevant context and structure of domain-specific KGs, e.g., in manufacturing, to improve the accuracy of generated queries. In this paper, we evaluate multiple strategies that use LLMs as mediators to facilitate information retrieval from KGs. We focus on the manufacturing domain, particularly on the Bosch Line Information System KG and the I40 Core Information Model. In our evaluation, we compare various approaches for feeding relevant context from the KG to the LLM and analyze their proficiency in transforming real-world questions into SPARQL queries. Our findings show that LLMs can significantly improve their performance on generating correct and complete queries when provided only the adequate context of the KG schema. Such context-aware prompting techniques help LLMs to focus on the relevant parts of the ontology and reduce the risk of hallucination. We anticipate that the proposed techniques help LLMs to democratize access to complex data repositories and empower informed decision-making in manufacturing settings.

[220] ASP-FZN: A Translation-based Constraint Answer Set Solver

Thomas Eiter, Tobias Geibinger, Tobias Kaminski, Nysret Musliu, Johannes Oetsch

Main category: cs.AI

TL;DR: The paper introduces asp-fzn, a solver for Constraint Answer Set Programming (CASP), which translates CASP programs into FlatZinc for use with backend solvers. It competes with state-of-the-art ASP solvers and outperforms clingcon in some CASP benchmarks.

Details

Motivation: To extend Answer Set Programming (ASP) with linear constraints and provide a solver-independent approach using FlatZinc for broader solver compatibility.

Method: Translates CASP programs into the FlatZinc language, leveraging backend solvers for Constraint Programming and Integer Programming. Supports rich linear constraints and global constraints.

Result: asp-fzn is competitive with top ASP solvers and outperforms clingcon in some CASP benchmarks.

Conclusion: asp-fzn is a promising solver for CASP, offering compatibility with ASP solvers and superior performance in certain CASP scenarios.

Abstract: We present the solver asp-fzn for Constraint Answer Set Programming (CASP), which extends ASP with linear constraints. Our approach is based on translating CASP programs into the solver-independent FlatZinc language that supports several Constraint Programming and Integer Programming backend solvers. Our solver supports a rich language of linear constraints, including some common global constraints. As for evaluation, we show that asp-fzn is competitive with state-of-the-art ASP solvers on benchmarks taken from past ASP competitions. Furthermore, we evaluate it on several CASP problems from the literature and compare its performance with clingcon, which is a prominent CASP solver that supports most of the asp-fzn language. The performance of asp-fzn is very promising as it is already competitive on plain ASP and even outperforms clingcon on some CASP benchmarks.

[221] Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies

Hugo Garrido-Lestache, Jeremy Kedziora

Main category: cs.AI

TL;DR: TAAC is a reinforcement learning algorithm for multi-agent collaboration, using attention mechanisms and a penalized loss function to improve teamwork. It outperforms benchmarks in simulated soccer.

Details

Motivation: Enhancing multi-agent collaboration in cooperative environments by addressing the challenges of joint-action spaces and role diversity.

Method: Uses Centralized Training/Centralized Execution with multi-headed attention in actor and critic, plus a penalized loss function for role diversity.

Result: Superior performance in simulated soccer, with better win rates, goal differentials, and collaborative behaviors.

Conclusion: TAAC effectively improves multi-agent collaboration through dynamic communication and role diversity, outperforming existing methods.

Abstract: This paper introduces Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm designed to enhance multi-agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi-headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter-agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint-action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi-agent paradigms, including Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter-agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).

[222] The Incomplete Bridge: How AI Research (Mis)Engages with Psychology

Han Jiang, Pengda Wang, Xiaoyuan Yi, Xing Xie, Ziang Xiao

Main category: cs.AI

TL;DR: This study maps interdisciplinary integration between AI and psychology by analyzing 1,006 AI papers and 2,544 psychology citations, identifying key patterns, operationalization methods, and misapplications to enhance collaboration.

Details

Motivation: To explore the synergy between AI and psychology, leveraging social science insights for AI design and understanding.

Method: Analysis of 1,006 LLM-related AI papers (2023-2025) and their 2,544 cited psychology publications to identify integration patterns and operationalization.

Result: Identified key interdisciplinary patterns, frequently referenced psychology domains, and common misapplications of theories.

Conclusion: Provides a roadmap for effective interdisciplinary collaboration, advancing AI systems through deeper integration of psychology.

Abstract: Social sciences have accumulated a rich body of theories and methodologies for investigating the human mind and behaviors, while offering valuable insights into the design and understanding of Artificial Intelligence (AI) systems. Focusing on psychology as a prominent case, this study explores the interdisciplinary synergy between AI and the field by analyzing 1,006 LLM-related papers published in premier AI venues between 2023 and 2025, along with the 2,544 psychology publications they cite. Through our analysis, we identify key patterns of interdisciplinary integration, locate the psychology domains most frequently referenced, and highlight areas that remain underexplored. We further examine how psychology theories/frameworks are operationalized and interpreted, identify common types of misapplication, and offer guidance for more effective incorporation. Our work provides a comprehensive map of interdisciplinary engagement between AI and psychology, thereby facilitating deeper collaboration and advancing AI systems.

[223] Automatically discovering heuristics in a complex SAT solver with large language models

Yiwen Sun, Furong Ye, Zhihan Chen, Ke Wei, Shaowei Cai

Main category: cs.AI

TL;DR: AutoModSAT uses LLMs to optimize SAT solvers, achieving 50% performance improvement over baselines and 30% over SOTA solvers, with a 20% speedup.

Details

Motivation: Modern SAT solvers are hard to optimize due to complex architectures, and existing configuration frameworks offer limited gains.

Method: Develops AutoModSAT with LLM-friendly solver guidelines, automatic prompt optimization, and an efficient search strategy (EA evolutionary algorithm).

Result: 50% improvement over baseline, 30% over SOTA solvers, and 20% speedup.

Conclusion: Bridges AI-driven heuristics with system optimization, offering methodological and empirical advancements for solver development.

Abstract: Satisfiability problem (SAT) is a cornerstone of computational complexity with broad industrial applications, and it remains challenging to optimize modern SAT solvers in real-world settings due to their intricate architectures. While automatic configuration frameworks have been developed, they rely on manually constrained search spaces and yield limited performance gains. This work introduces a novel paradigm which effectively optimizes complex SAT solvers via Large Language Models (LLMs), and a tool called AutoModSAT is developed. Three fundamental challenges are addressed in order to achieve superior performance: (1) LLM-friendly solver: Systematic guidelines are proposed for developing a modularized solver to meet LLMs’ compatibility, emphasizing code simplification, information share and bug reduction; (2) Automatic prompt optimization: An unsupervised automatic prompt optimization method is introduced to advance the diversity of LLMs’ output; (3) Efficient search strategy: We design a presearch strategy and an EA evolutionary algorithm for the final efficient and effective discovery of heuristics. Extensive experiments across a wide range of datasets demonstrate that AutoModSAT achieves 50% performance improvement over the baseline solver and achieves 30% superiority against the state-of-the-art (SOTA) solvers. Moreover, AutoModSAT attains a 20% speedup on average compared to parameter-tuned alternatives of the SOTA solvers, showcasing the enhanced capability in handling complex problem instances. This work bridges the gap between AI-driven heuristics discovery and mission-critical system optimization, and provides both methodological advancements and empirically validated results for next-generation complex solver development.

[224] KIX: A Knowledge and Interaction-Centric Metacognitive Framework for Task Generalization

Arun Kumar, Paul Schrater

Main category: cs.AI

TL;DR: The paper introduces KIX, a metacognitive reasoning framework, to bridge the gap between human-like general intelligence and specialized artificial agents by leveraging structured knowledge representations.

Details

Motivation: To address the lack of generalist behaviors in artificial agents compared to humans, who flexibly reuse high-level knowledge.

Method: Proposes the Knowledge-Interaction-eXecution (KIX) framework, which uses interactions with objects via a type space to learn transferable concepts.

Result: KIX facilitates generalization and offers a principled way to integrate knowledge into reinforcement learning.

Conclusion: KIX holds promise for enabling generalist behaviors in AI, robotics, and autonomous systems.

Abstract: People aptly exhibit general intelligence behaviors through flexible problem-solving and the ability to adapt to novel situations by reusing and applying high-level knowledge acquired over time. In contrast, artificial agents tend to be specialists, lacking such generalist behaviors. To bridge this gap, artificial agents will require understanding and exploiting critical structured knowledge representations. We introduce a metacognitive reasoning framework, Knowledge-Interaction-eXecution (KIX), and argue that interactions with objects, by leveraging a type space, facilitate the learning of transferable interaction concepts and promote generalization. This framework offers a principled approach for integrating knowledge into reinforcement learning and holds promise as an enabler for generalist behaviors in artificial intelligence, robotics, and autonomous systems.

[225] Learning Neural Strategy-Proof Matching Mechanism from Examples

Ryota Maruo, Koh Takeuchi, Hisashi Kashima

Main category: cs.AI

TL;DR: The paper introduces NeuralSD, a neural network-based matching mechanism that ensures strategy-proofness, handles varying agent numbers, and incorporates contextual information, outperforming baselines.

Details

Motivation: Existing matching mechanisms lack guarantees for strategy-proofness, flexibility in agent numbers, and contextual information integration, limiting real-world applicability.

Method: Proposes NeuralSD, an attention-based neural network architecture using tensor serial dictatorship (TSD) for differentiable relaxation, enabling end-to-end training while maintaining strategy-proofness.

Result: NeuralSD outperforms baselines in predicting matchings and achieves better metrics for matching outcomes.

Conclusion: NeuralSD provides a robust, flexible, and strategy-proof solution for two-sided matching, addressing key limitations of prior work.

Abstract: Designing two-sided matching mechanisms is challenging when practical demands for matching outcomes are difficult to formalize and the designed mechanism must satisfy theoretical conditions. To address this, prior work has proposed a framework that learns a matching mechanism from examples, using a parameterized family that satisfies properties such as stability. However, despite its usefulness, this framework does not guarantee strategy-proofness (SP), and cannot handle varying numbers of agents or incorporate publicly available contextual information about agents, both of which are crucial in real-world applications. In this paper, we propose a new parametrized family of matching mechanisms that always satisfy strategy-proofness, are applicable for an arbitrary number of agents, and deal with public contextual information of agents, based on the serial dictatorship (SD). This family is represented by NeuralSD, a novel neural network architecture based on SD, where agent rankings in SD are treated as learnable parameters computed from agents’ contexts using an attention-based sub-network. To enable learning, we introduce tensor serial dictatorship (TSD), a differentiable relaxation of SD using tensor operations. This allows NeuralSD to be trained end-to-end from example matchings while satisfying SP. We conducted experiments to learn a matching mechanism from matching examples while satisfying SP. We demonstrated that our method outperformed baselines in predicting matchings and on several metrics for goodness of matching outcomes.

[226] Can adversarial attacks by large language models be attributed?

Manuel Cebrian, Andres Abeliuk, Jan Arne Telle

Main category: cs.AI

TL;DR: The paper explores challenges in attributing outputs from Large Language Models (LLMs) in adversarial settings, combining theoretical and empirical approaches. It identifies non-identifiability for certain LLM classes and highlights rapid growth in plausible model origins, making exhaustive attribution impractical.

Details

Motivation: Addressing the growing importance of attributing LLM outputs in adversarial contexts like cyberattacks and disinformation campaigns.

Method: Uses formal language theory (identification in the limit) and empirical analysis of the LLM ecosystem to model LLM outputs as formal languages.

Result: Shows non-identifiability for infinite deterministic/probabilistic LLM classes, identifiability for finite deterministic ones, and provides a new counterexample for finite probabilistic LLMs. Also quantifies rapid growth in plausible model origins.

Conclusion: Exhaustive attribution is infeasible due to non-identifiability and combinatorial growth of model origins, posing significant challenges for practical applications.

Abstract: Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation campaigns-presents significant challenges that are likely to grow in importance. We approach this attribution problem from both a theoretical and an empirical perspective, drawing on formal language theory (identification in the limit) and data-driven analysis of the expanding LLM ecosystem. By modeling an LLM’s set of possible outputs as a formal language, we analyze whether finite samples of text can uniquely pinpoint the originating model. Our results show that, under mild assumptions of overlapping capabilities among models, certain classes of LLMs are fundamentally non-identifiable from their outputs alone. We delineate four regimes of theoretical identifiability: (1) an infinite class of deterministic (discrete) LLM languages is not identifiable (Gold’s classical result from 1967); (2) an infinite class of probabilistic LLMs is also not identifiable (by extension of the deterministic case); (3) a finite class of deterministic LLMs is identifiable (consistent with Angluin’s tell-tale criterion); and (4) even a finite class of probabilistic LLMs can be non-identifiable (we provide a new counterexample establishing this negative result). Complementing these theoretical insights, we quantify the explosion in the number of plausible model origins (hypothesis space) for a given output in recent years. Even under conservative assumptions-each open-source model fine-tuned on at most one new dataset-the count of distinct candidate models doubles approximately every 0.5 years, and allowing multi-dataset fine-tuning combinations yields doubling times as short as 0.28 years. This combinatorial growth, alongside the extraordinary computational cost of brute-force likelihood attribution across all models and potential users, renders exhaustive attribution infeasible in practice.

[227] A Survey on Large Language Model Acceleration based on KV Cache Management

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen

Main category: cs.AI

TL;DR: A survey on KV cache management strategies for accelerating LLM inference, covering token-level, model-level, and system-level optimizations.

Details

Motivation: Address the computational and memory challenges of LLMs during inference to enable real-world, long-context, and real-time applications.

Method: Categorizes KV cache management into token-level (e.g., selection, quantization), model-level (e.g., architectural innovations), and system-level (e.g., memory management) optimizations.

Result: Provides taxonomies, comparative analyses, and benchmarks to evaluate KV cache strategies.

Conclusion: Offers insights for developing efficient KV cache techniques to support practical LLM deployment.

Abstract: Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. The curated paper list for KV cache management is in: \href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}.

[228] CollabLLM: From Passive Responders to Active Collaborators

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao

Main category: cs.AI

TL;DR: CollabLLM enhances multiturn human-LLM collaboration by using Multiturn-aware Rewards, improving task performance and user satisfaction.

Details

Motivation: Current LLMs lack long-term interaction optimization, leading to passive responses and inefficient conversations.

Method: CollabLLM employs collaborative simulation and reinforcement fine-tuning with Multiturn-aware Rewards.

Result: CollabLLM outperforms baselines with 18.5% higher task performance and 46.3% improved interactivity, increasing user satisfaction by 17.6%.

Conclusion: CollabLLM advances human-centered AI by actively uncovering user intent and improving interaction efficiency.

Abstract: Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responses using Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions-a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.

[229] Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

Main category: cs.AI

TL;DR: The paper introduces a method to transform coding benchmarks into scoring datasets for evaluating synthetic verifiers, proposing metrics and releasing four new benchmarks. It shows reasoning improves test case generation and scaling test cases boosts verification accuracy.

Details

Motivation: To enhance the coding and reasoning capabilities of LLMs by improving synthetic verification techniques beyond predefined tests.

Method: Proposes transforming coding benchmarks into scoring datasets, introduces metrics, and evaluates synthetic verifiers using standard, reasoning-based, and reward-based LLMs.

Result: Reasoning improves test case generation, and scaling test cases enhances verification accuracy. Four new benchmarks (HE-R, HE-R+, MBPP-R, MBPP-R+) are released.

Conclusion: The approach effectively evaluates synthetic verifiers, with reasoning and test case scaling proving beneficial for verification accuracy.

Abstract: Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.

[230] AGITB: A Signal-Level Benchmark for Evaluating Artificial General Intelligence

Matej Šprogar

Main category: cs.AI

TL;DR: AGITB is a new benchmarking suite for evaluating low-level cognitive precursors in AI, designed to test core computational invariants without pretraining or symbolic manipulation.

Details

Motivation: Current AI lacks human-like general intelligence, and existing evaluation frameworks fail to measure true generality. AGITB aims to address this gap.

Method: AGITB includes twelve automatible tests focused on binary signal prediction, isolating core computational invariants like determinism and generalization.

Result: No current AI system has fully passed AGITB, highlighting its rigor and potential as a benchmark for general intelligence.

Conclusion: AGITB is a promising, interpretable, and actionable tool for guiding progress toward artificial general intelligence.

Abstract: Despite major advances in machine learning, current artificial intelligence systems continue to fall short of human-like general intelligence. While large language and reasoning models can generate fluent and coherent outputs, they lack the deep understanding and adaptive reasoning that characterize truly general intelligence. Existing evaluation frameworks, which are centered on broad language or perception tasks, fail to capture generality at its core and offer no guidance. The artificial general intelligence testbed (AGITB) is a novel and freely available benchmarking suite comprising twelve fully automatable tests designed to evaluate low-level cognitive precursors through binary signal prediction. AGITB requires models to forecast temporal sequences without pretraining, symbolic manipulation, or semantic grounding. The framework isolates core computational invariants - such as determinism, sensitivity, and generalization - that align with principles of biological information processing. Engineered to resist brute-force and memorization-based approaches, AGITB presumes no prior knowledge and demands learning from first principles. While humans pass all tests, no current AI system has met the full AGITB criteria, underscoring its potential as a rigorous, interpretable, and actionable benchmark for guiding and evaluating progress toward artificial general intelligence. A reference implementation of AGITB is available on GitHub.

[231] Don’t Lag, RAG: Training-Free Adversarial Detection Using RAG

Roie Kazoom, Raz Lapid, Moshe Sipper, Ofer Hadar

Main category: cs.AI

TL;DR: A training-free VRAG framework uses VLMs for adversarial patch detection, achieving high accuracy without retraining.

Details

Motivation: Adversarial patch attacks threaten vision systems, and traditional defenses are impractical due to retraining needs.

Method: VRAG integrates VLMs to retrieve and reason about adversarial patches using a growing database, avoiding additional training.

Result: Open-source UI-TARS-72B-DPO achieves 95% accuracy, while closed-source Gemini-2.0 reaches 98%.

Conclusion: VRAG offers a robust, practical defense against adversarial patches with minimal human input.

Abstract: Adversarial patch attacks pose a major threat to vision systems by embedding localized perturbations that mislead deep models. Traditional defense methods often require retraining or fine-tuning, making them impractical for real-world deployment. We propose a training-free Visual Retrieval-Augmented Generation (VRAG) framework that integrates Vision-Language Models (VLMs) for adversarial patch detection. By retrieving visually similar patches and images that resemble stored attacks in a continuously expanding database, VRAG performs generative reasoning to identify diverse attack types, all without additional training or fine-tuning. We extensively evaluate open-source large-scale VLMs, including Qwen-VL-Plus, Qwen2.5-VL-72B, and UI-TARS-72B-DPO, alongside Gemini-2.0, a closed-source model. Notably, the open-source UI-TARS-72B-DPO model achieves up to 95 percent classification accuracy, setting a new state-of-the-art for open-source adversarial patch detection. Gemini-2.0 attains the highest overall accuracy, 98 percent, but remains closed-source. Experimental results demonstrate VRAG’s effectiveness in identifying a variety of adversarial patches with minimal human annotation, paving the way for robust, practical defenses against evolving adversarial patch attacks.

[232] Subgoal-Guided Policy Heuristic Search with Learned Subgoals

Jake Tuero, Michael Buro, Levi H. S. Lelis

Main category: cs.AI

TL;DR: The paper introduces a method to improve the sample efficiency of learning subgoal-based policies for policy tree search by leveraging search trees from both successful and failed attempts.

Details

Motivation: Current policy tree search algorithms require costly complete solution trajectories for training, especially for hard instances, leading to wasted samples in failed attempts.

Method: The proposed method learns subgoals and subgoal-conditioned policies from expanded search trees, including those from failed attempts.

Result: Empirical results show improved sample efficiency in learning policies and heuristic functions.

Conclusion: The novel approach enhances training efficiency by utilizing data from all search attempts, reducing waste in learning.

Abstract: Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.

[233] Clustering via Self-Supervised Diffusion

Roy Uziel, Irit Chelly, Oren Freifeld, Ari Pakman

Main category: cs.AI

TL;DR: CLUDI is a self-supervised clustering framework using diffusion models and Vision Transformer features, achieving state-of-the-art performance.

Details

Motivation: Diffusion models have not been applied to clustering despite their success in generative tasks. CLUDI aims to bridge this gap.

Method: CLUDI uses a teacher-student paradigm with stochastic diffusion-based sampling for diverse cluster assignments, refined by the student.

Result: CLUDI sets new benchmarks in clustering robustness and adaptability, outperforming existing methods.

Conclusion: CLUDI demonstrates the potential of diffusion models in clustering, offering a novel and effective approach.

Abstract: Diffusion models, widely recognized for their success in generative tasks, have not yet been applied to clustering. We introduce Clustering via Diffusion (CLUDI), a self-supervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher-student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions. Our code is available at https://github.com/BGU-CS-VIL/CLUDI.

[234] The wall confronting large language models

Peter V. Coveney, Sauro Succi

Main category: cs.AI

TL;DR: Large language models (LLMs) face inherent limitations in improving prediction uncertainty due to scaling laws, making them unreliable for scientific standards. Their learning mechanism may cause error pileup and degenerative behavior, exacerbated by spurious correlations in large datasets. Avoiding this requires prioritizing insight and problem understanding.

Details

Motivation: To highlight the limitations of LLMs in achieving reliable predictions for scientific inquiry due to scaling laws and inherent learning mechanisms.

Method: Analyzes the relationship between LLM scaling laws, prediction uncertainty, and error pileup, alongside spurious correlations in large datasets.

Result: LLMs’ learning mechanisms and scaling laws severely limit their reliability, leading to potential degenerative AI behavior.

Conclusion: Avoiding degenerative AI pathways in LLMs requires a greater focus on insight and understanding of problem structures.

Abstract: We show that the scaling laws which determine the performance of large language models (LLMs) severely limit their ability to improve the uncertainty of their predictions. As a result, raising their reliability to meet the standards of scientific inquiry is intractable by any reasonable measure. We argue that the very mechanism which fuels much of the learning power of LLMs, namely the ability to generate non-Gaussian output distributions from Gaussian input ones, might well be at the roots of their propensity to produce error pileup, ensuing information catastrophes and degenerative AI behaviour. This tension between learning and accuracy is a likely candidate mechanism underlying the observed low values of the scaling components. It is substantially compounded by the deluge of spurious correlations pointed out by Calude and Longo which rapidly increase in any data set merely as a function of its size, regardless of its nature. The fact that a degenerative AI pathway is a very probable feature of the LLM landscape does not mean that it must inevitably arise in all future AI research. Its avoidance, which we also discuss in this paper, necessitates putting a much higher premium on insight and understanding of the structural characteristics of the problems being investigated.

[235] HypKG: Hypergraph-based Knowledge Graph Contextualization for Precision Healthcare

Yuzhang Xie, Xu Han, Ran Xu, Xiao Hu, Jiaying Lu, Carl Yang

Main category: cs.AI

TL;DR: HypKG integrates EHR data with KGs for contextualized healthcare predictions using hypergraph transformers, improving accuracy and KG utility.

Details

Motivation: General KGs lack patient-specific contexts crucial for precision healthcare, while EHRs provide rich personal data. Integrating these can enhance healthcare predictions.

Method: HypKG uses entity-linking to connect KGs with EHRs, then employs hypergraph transformers to learn contextualized representations for both.

Result: HypKG significantly improves healthcare predictions in experiments with a biomedical KG and real-world EHR datasets.

Conclusion: HypKG enhances KG utility by integrating patient contexts, improving prediction accuracy and knowledge quality.

Abstract: Knowledge graphs (KGs) are important products of the semantic web, which are widely used in various application domains. Healthcare is one of such domains where KGs are intensively used, due to the high requirement for knowledge accuracy and interconnected nature of healthcare data. However, KGs storing general factual information often lack the ability to account for important contexts of the knowledge such as the status of specific patients, which are crucial in precision healthcare. Meanwhile, electronic health records (EHRs) provide rich personal data, including various diagnoses and medications, which provide natural contexts for general KGs. In this paper, we propose HypKG, a framework that integrates patient information from EHRs into KGs to generate contextualized knowledge representations for accurate healthcare predictions. Using advanced entity-linking techniques, we connect relevant knowledge from general KGs with patient information from EHRs, and then utilize a hypergraph model to “contextualize” the knowledge with the patient information. Finally, we employ hypergraph transformers guided by downstream prediction tasks to jointly learn proper contextualized representations for both KGs and patients, fully leveraging existing knowledge in KGs and patient contexts in EHRs. In experiments using a large biomedical KG and two real-world EHR datasets, HypKG demonstrates significant improvements in healthcare prediction tasks across multiple evaluation metrics. Additionally, by integrating external contexts, HypKG can learn to adjust the representations of entities and relations in KG, potentially improving the quality and real-world utility of knowledge.

[236] Multi-Agent Reinforcement Learning for Dynamic Mobility Resource Allocation with Hierarchical Adaptive Grouping

Farshid Nooshi, Suining He

Main category: cs.AI

TL;DR: Proposes HAG-PS, a multi-agent reinforcement learning method for dynamic mobility resource allocation, addressing policy sharing and memory efficiency in urban settings.

Details

Motivation: To rebalance mobility demand and supply by dynamically allocating resources like bikes/e-scooters and ride-sharing vehicles in urban areas.

Method: Uses hierarchical adaptive grouping for parameter sharing, agent grouping based on trajectory closeness, and learnable ID embeddings for specialization.

Result: Demonstrated superior performance (e.g., improved bike availability) using NYC bike-sharing data (1.2M+ trips).

Conclusion: HAG-PS effectively addresses mobility resource allocation challenges, outperforming baseline methods.

Abstract: Allocating mobility resources (e.g., shared bikes/e-scooters, ride-sharing vehicles) is crucial for rebalancing the mobility demand and supply in the urban environments. We propose in this work a novel multi-agent reinforcement learning named Hierarchical Adaptive Grouping-based Parameter Sharing (HAG-PS) for dynamic mobility resource allocation. HAG-PS aims to address two important research challenges regarding multi-agent reinforcement learning for mobility resource allocation: (1) how to dynamically and adaptively share the mobility resource allocation policy (i.e., how to distribute mobility resources) across agents (i.e., representing the regional coordinators of mobility resources); and (2) how to achieve memory-efficient parameter sharing in an urban-scale setting. To address the above challenges, we have provided following novel designs within HAG-PS. To enable dynamic and adaptive parameter sharing, we have designed a hierarchical approach that consists of global and local information of the mobility resource states (e.g., distribution of mobility resources). We have developed an adaptive agent grouping approach in order to split or merge the groups of agents based on their relative closeness of encoded trajectories (i.e., states, actions, and rewards). We have designed a learnable identity (ID) embeddings to enable agent specialization beyond simple parameter copy. We have performed extensive experimental studies based on real-world NYC bike sharing data (a total of more than 1.2 million trips), and demonstrated the superior performance (e.g., improved bike availability) of HAG-PS compared with other baseline approaches.

[237] A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jiayi Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Dongrui Liu, Qihan Ren, Cheng Qian, Zhenghailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, Mengdi Wang

Main category: cs.AI

TL;DR: The paper reviews self-evolving agents for LLMs, focusing on what, when, and how to evolve, and discusses applications, challenges, and future directions.

Details

Motivation: LLMs are static and struggle with adaptation in dynamic environments, necessitating self-evolving agents for real-time reasoning and learning.

Method: The survey systematically categorizes evolutionary mechanisms, adaptation methods, and designs, alongside evaluating metrics and benchmarks.

Result: A structured framework is provided for designing self-evolving agents, with applications in coding, education, and healthcare.

Conclusion: The survey outlines a roadmap for advancing adaptive agentic systems, aiming for Artificial Super Intelligence (ASI).

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift – from scaling static models to developing self-evolving agents – has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organized around three foundational dimensions – what to evolve, when to evolve, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing adaptive agentic systems in both research and real-world deployments, ultimately shedding lights to pave the way for the realization of Artificial Super Intelligence (ASI), where agents evolve autonomously, performing at or beyond human-level intelligence across a wide array of tasks.

[238] ST-GDance: Long-Term and Collision-Free Group Choreography from Music

Jing Xu, Weiqiang Wang, Cunjian Chen, Jun Liu, Qiuhong Ke

Main category: cs.AI

TL;DR: ST-GDance is a framework for generating synchronized group dance sequences from music, addressing scalability and collision issues by decoupling spatial and temporal dependencies.

Details

Motivation: Group dance generation is challenging due to the need for synchronization and spatial coordination, especially with increasing dancers and sequence length, leading to computational complexity and motion collisions.

Method: ST-GDance uses lightweight graph convolutions for spatial modeling and accelerated sparse attention for temporal modeling to optimize choreography.

Result: The framework outperforms state-of-the-art methods on the AIOZ-GDance dataset, especially in generating long, coherent, and collision-free sequences.

Conclusion: ST-GDance effectively addresses scalability and collision challenges in group dance generation, offering a practical solution for applications in film, gaming, and animation.

Abstract: Group dance generation from music has broad applications in film, gaming, and animation production. However, it requires synchronizing multiple dancers while maintaining spatial coordination. As the number of dancers and sequence length increase, this task faces higher computational complexity and a greater risk of motion collisions. Existing methods often struggle to model dense spatial-temporal interactions, leading to scalability issues and multi-dancer collisions. To address these challenges, we propose ST-GDance, a novel framework that decouples spatial and temporal dependencies to optimize long-term and collision-free group choreography. We employ lightweight graph convolutions for distance-aware spatial modeling and accelerated sparse attention for efficient temporal modeling. This design significantly reduces computational costs while ensuring smooth and collision-free interactions. Experiments on the AIOZ-GDance dataset demonstrate that ST-GDance outperforms state-of-the-art baselines, particularly in generating long and coherent group dance sequences. Project page: https://yilliajing.github.io/ST-GDance-Website/.

[239] DualSG: A Dual-Stream Explicit Semantic-Guided Multivariate Time Series Forecasting Framework

Kuiye Ding, Fanda Fan, Yao Wang, Ruijie jian, Xiaorui Wang, Luqi Gong, Yishan Jiang, Chunjie Luo an Jianfeng Zhan

Main category: cs.AI

TL;DR: DualSG is a dual-stream framework using LLMs as semantic guides to refine traditional predictions, avoiding alignment issues and improving accuracy.

Details

Motivation: To address the loss of numerical precision and alignment difficulties in using LLMs for multivariate time series forecasting.

Method: Proposes DualSG, a dual-stream framework with explicit semantic guidance via Time Series Caption and a caption-guided fusion module.

Result: Outperforms 15 state-of-the-art baselines across diverse real-world datasets.

Conclusion: Explicitly combining numerical forecasting with semantic guidance enhances performance and interpretability.

Abstract: Multivariate Time Series Forecasting plays a key role in many applications. Recent works have explored using Large Language Models for MTSF to take advantage of their reasoning abilities. However, many methods treat LLMs as end-to-end forecasters, which often leads to a loss of numerical precision and forces LLMs to handle patterns beyond their intended design. Alternatively, methods that attempt to align textual and time series modalities within latent space frequently encounter alignment difficulty. In this paper, we propose to treat LLMs not as standalone forecasters, but as semantic guidance modules within a dual-stream framework. We propose DualSG, a dual-stream framework that provides explicit semantic guidance, where LLMs act as Semantic Guides to refine rather than replace traditional predictions. As part of DualSG, we introduce Time Series Caption, an explicit prompt format that summarizes trend patterns in natural language and provides interpretable context for LLMs, rather than relying on implicit alignment between text and time series in the latent space. We also design a caption-guided fusion module that explicitly models inter-variable relationships while reducing noise and computation. Experiments on real-world datasets from diverse domains show that DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the value of explicitly combining numerical forecasting with semantic guidance.

[240] MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors

Shouyi Lu, Zihan Lin, Chao Lu, Huanran Wang, Guirong Zhuo, Lianqing Zheng

Main category: cs.AI

TL;DR: MultiEditor is a dual-branch latent diffusion framework for editing images and LiDAR point clouds in driving scenarios, improving cross-modality consistency and rare-category vehicle detection.

Details

Motivation: Addressing the long-tailed distribution of real-world data in autonomous driving, which hinders generalization for rare but safety-critical vehicle categories.

Method: Uses 3D Gaussian Splatting (3DGS) as a prior, with multi-level appearance control and depth-guided deformable cross-modality conditioning.

Result: Achieves high visual/geometric fidelity, editing controllability, and cross-modality consistency, enhancing detection accuracy for rare classes.

Conclusion: MultiEditor effectively improves multimodal data editing and rare-category vehicle detection in autonomous driving systems.

Abstract: Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism–comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement–to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.

[241] Tiny-BioMoE: a Lightweight Embedding Model for Biosignal Analysis

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper introduces Tiny-BioMoE, a lightweight pretrained model for biosignal analysis, aimed at improving automatic pain assessment through multimodal physiological signals.

Details

Motivation: Accurate pain assessment is crucial for patient care and management. Automatic systems using physiological signals can provide objective insights and enhance monitoring.

Method: The study proposes Tiny-BioMoE, a pretrained embedding model trained on 4.4 million biosignal images with 7.3 million parameters, tested on diverse physiological signals.

Result: The model demonstrates effectiveness in automatic pain recognition across multiple modalities, including electrodermal activity and blood volume pulse.

Conclusion: Tiny-BioMoE offers a lightweight, efficient solution for biosignal analysis in pain assessment, with potential for clinical applications.

Abstract: Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person’s state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed approach introduces \textit{Tiny-BioMoE}, a lightweight pretrained embedding model for biosignal analysis. Trained on $4.4$ million biosignal image representations and consisting of only $7.3$ million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model’s effectiveness across diverse modalities in automatic pain recognition tasks. \textit{\textcolor{blue}{The model’s architecture (code) and weights are available at https://github.com/GkikasStefanos/Tiny-BioMoE.

[242] Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper proposes a pipeline using electrodermal activity signals for automatic pain assessment, demonstrating its effectiveness through experiments and comparing it favorably to traditional methods.

Details

Motivation: Reliable pain assessment is crucial for effective management and reducing distress. Automated systems using physiological signals can provide objective insights.

Method: The method involves creating and visualizing multiple representations of electrodermal activity signals in a multi-representation diagram, incorporating various processing and filtering techniques.

Result: The approach yields comparable or superior results to traditional fusion methods, proving its robustness.

Conclusion: The proposed pipeline is a robust alternative for integrating different signal representations or modalities in pain-assessment systems.

Abstract: Pain is a multifaceted phenomenon that affects a substantial portion of the population. Reliable and consistent evaluation benefits those experiencing pain and underpins the development of effective and advanced management strategies. Automatic pain-assessment systems deliver continuous monitoring, inform clinical decision-making, and aim to reduce distress while preventing functional decline. By incorporating physiological signals, these systems provide objective, accurate insights into an individual’s condition. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages electrodermal activity signals as input modality. Multiple representations of the signal are created and visualized as waveforms, and they are jointly visualized within a single multi-representation diagram. Extensive experiments incorporating various processing and filtering techniques, along with multiple representation combinations, demonstrate the effectiveness of the proposed approach. It consistently yields comparable, and in several cases superior, results to traditional fusion methods, establishing it as a robust alternative for integrating different signal representations or modalities.

[243] Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: A study proposes a pain assessment method using respiration signals, a cross-attention transformer, and multi-windowing, showing strong performance with efficient models.

Details

Motivation: Accurate pain assessment is crucial for effective management and reducing distress; automatic systems can support continuous monitoring.

Method: The method uses respiration signals, a cross-attention transformer, and a multi-windowing strategy to capture short-term, long-term, and global features.

Result: Respiration proves valuable for pain assessment; compact, optimized models outperform larger ones. The multi-window approach enhances feature representation.

Conclusion: The proposed pipeline is effective for pain assessment, demonstrating the potential of efficient models and respiration signals.

Abstract: Pain is a complex condition affecting a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain, and it supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring and support clinical decision-making, aiming to reduce distress and prevent functional decline. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages respiration as the input signal and incorporates a highly efficient cross-attention transformer alongside a multi-windowing strategy. Extensive experiments demonstrate that respiration is a valuable physiological modality for pain assessment. Moreover, experiments revealed that compact and efficient models, when properly optimized, can achieve strong performance, often surpassing larger counterparts. The proposed multi-window approach effectively captures both short-term and long-term features, as well as global characteristics, thereby enhancing the model’s representational capacity.

[244] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li

Main category: cs.AI

TL;DR: UI-AGILE is a framework enhancing GUI agents with improved training (Continuous Reward, Simple Thinking reward, Cropping-based Resampling) and inference (Decomposed Grounding with Selection) methods, achieving state-of-the-art performance.

Details

Motivation: Existing GUI agent techniques struggle with reasoning designs, ineffective rewards, and visual noise, limiting their effectiveness.

Method: Proposes training enhancements (Continuous Reward, Simple Thinking reward, Cropping-based Resampling) and inference method (Decomposed Grounding with Selection) to improve GUI agents.

Result: Achieves 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro, setting state-of-the-art performance.

Conclusion: UI-AGILE effectively addresses key challenges in GUI agent training and inference, significantly improving performance.

Abstract: The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process:

a Continuous Reward function to incentivize high-precision grounding; 2) a “Simple Thinking” reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.

cs.SD

[245] Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics

Shreyansh Pathak, Sonu Shreshtha, Richa Singh, Mayank Vatsa

Main category: cs.SD

TL;DR: QPAudioEraser, a quantum-inspired framework, effectively erases voice signatures from biometric models while preserving utility, outperforming existing methods.

Details

Motivation: Address privacy vulnerabilities in voice-enabled systems by enabling efficient erasure of individual-specific voice data to comply with regulations like GDPR and DPDP Act.

Method: Uses a quantum-inspired approach: weight initialization via destructive interference, superposition-based label transformations, quantum loss function, and entanglement-inspired weight mixing.

Result: Achieves 0% Forget Accuracy (complete erasure) with minimal performance degradation (0.05%) on retained data, outperforming baselines.

Conclusion: QPAudioEraser is a robust solution for privacy-preserving erasure in audio biometric systems.

Abstract: The widespread adoption of voice-enabled authentication and audio biometric systems have significantly increased privacy vulnerabilities associated with sensitive speech data. Compliance with privacy regulations such as GDPR’s right to be forgotten and India’s DPDP Act necessitates targeted and efficient erasure of individual-specific voice signatures from already-trained biometric models. Existing unlearning methods designed for visual data inadequately handle the sequential, temporal, and high-dimensional nature of audio signals, leading to ineffective or incomplete speaker and accent erasure. To address this, we introduce QPAudioEraser, a quantum-inspired audio unlearning framework. Our our-phase approach involves: (1) weight initialization using destructive interference to nullify target features, (2) superposition-based label transformations that obscure class identity, (3) an uncertainty-maximizing quantum loss function, and (4) entanglement-inspired mixing of correlated weights to retain model knowledge. Comprehensive evaluations with ResNet18, ViT, and CNN architectures across AudioMNIST, Speech Commands, LibriSpeech, and Speech Accent Archive datasets validate QPAudioEraser’s superior performance. The framework achieves complete erasure of target data (0% Forget Accuracy) while incurring minimal impact on model utility, with a performance degradation on retained data as low as 0.05%. QPAudioEraser consistently surpasses conventional baselines across single-class, multi-class, sequential, and accent-level erasure scenarios, establishing the proposed approach as a robust privacy-preserving solution.

[246] A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection

Hogeon Yu

Main category: cs.SD

TL;DR: The paper proposes a two-step learning framework for Sound Event Localization and Detection (SELD) to address limitations of single- and dual-branch architectures, improving performance through task-specific feature learning and effective fusion.

Details

Motivation: Existing SELD methods face optimization conflicts (single-branch) or limited information exchange (dual-branch), necessitating a better approach.

Method: A two-step framework: (1) tracwise reordering for temporal consistency, (2) separate SED and DoA training to prevent interference, followed by feature fusion.

Result: Validated on the 2023 DCASE challenge dataset, the framework outperforms single- and dual-branch methods in event classification and localization.

Conclusion: The proposed framework effectively addresses SELD challenges, enhancing spatial and event representation.

Abstract: Sound Event Localization and Detection (SELD) is crucial in spatial audio processing, enabling systems to detect sound events and estimate their 3D directions. Existing SELD methods use single- or dual-branch architectures: single-branch models share SED and DoA representations, causing optimization conflicts, while dual-branch models separate tasks but limit information exchange. To address this, we propose a two-step learning framework. First, we introduce a tracwise reordering format to maintain temporal consistency, preventing event reassignments across tracks. Next, we train SED and DoA networks to prevent interference and ensure task-specific feature learning. Finally, we effectively fuse DoA and SED features to enhance SELD performance with better spatial and event representation. Experiments on the 2023 DCASE challenge Task 3 dataset validate our framework, showing its ability to overcome single- and dual-branch limitations and improve event classification and localization.

[247] Adaptive Duration Model for Text Speech Alignment

Junjie Cao

Main category: cs.SD

TL;DR: A novel duration prediction framework improves phoneme-level alignment in TTS models, enhancing accuracy and robustness, especially for zero-shot TTS.

Details

Motivation: Autoregressive TTS models struggle with brittle alignments, while non-autoregressive models rely on external duration sources. A better solution is needed for accurate and adaptable alignment.

Method: Proposes a duration prediction framework that provides phoneme-level duration distributions, improving alignment precision and adaptability.

Result: The model achieves an 11.3% improvement in alignment accuracy and enhances zero-shot TTS robustness against prompt-input mismatch.

Conclusion: The proposed framework offers a significant advancement in TTS alignment, addressing key limitations of existing methods.

Abstract: Speech-to-text alignment is a critical component of neural text to-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive end to-end TTS models rely on durations extracted from external sources, using additional duration models for alignment. In this paper, we propose a novel duration prediction framework that can give compromising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and condition adaptation ability compared to previous baseline models. Numerically, it has roughly a 11.3 percents immprovement on alignment accuracy, and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.

[248] Next Tokens Denoising for Speech Synthesis

Yanqing Liu, Ruiqing Xue, Chong Zhang, Yufei Liu, Gang Wang, Bohan Li, Yao Qian, Lei He, Shujie Liu, Sheng Zhao

Main category: cs.SD

TL;DR: Dragon-FM unifies AR and flow-matching for TTS, addressing limitations of AR and diffusion models, enabling efficient high-quality audio generation.

Details

Motivation: AR models lack future context and are slow; diffusion models struggle with KV caching. Dragon-FM aims to overcome these issues.

Method: Combines AR modeling across chunks for coherence and parallel flow-matching within chunks for fast denoising, using 48 kHz audio codec tokens.

Result: Efficiently generates high-quality zero-shot podcasts, leveraging KV-cache and future context.

Conclusion: Dragon-FM bridges AR and diffusion models, offering a scalable solution for extended content generation.

Abstract: While diffusion and autoregressive (AR) models have significantly advanced generative modeling, they each present distinct limitations. AR models, which rely on causal attention, cannot exploit future context and suffer from slow generation speeds. Conversely, diffusion models struggle with key-value (KV) caching. To overcome these challenges, we introduce Dragon-FM, a novel text-to-speech (TTS) design that unifies AR and flow-matching. This model processes 48 kHz audio codec tokens in chunks at a compact 12.5 tokens per second rate. This design enables AR modeling across chunks, ensuring global coherence, while parallel flow-matching within chunks facilitates fast iterative denoising. Consequently, the proposed model can utilize KV-cache across chunks and incorporate future context within each chunk. Furthermore, it bridges continuous and discrete feature modeling, demonstrating that continuous AR flow-matching can predict discrete tokens with finite scalar quantizers. This efficient codec and fast chunk-autoregressive architecture also makes the proposed model particularly effective for generating extended content. Experiment for demos of our work} on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.

[249] Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation

Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim

Main category: cs.SD

TL;DR: The study explores how acoustic context effects (pitch, rate, timbre) influence vowel perception in L1 and L2 speakers, revealing similar prosodic profiles despite conflicting proximal and distal effects.

Details

Motivation: To clarify how language background interacts with acoustic context effects in speech perception, particularly for L2 speakers.

Method: A reverse-correlation approach was used to vary pitch and speech rate around vowel pairs for English and French L2 speakers, reconstructing prosodic profiles.

Result: Vowel perception is influenced by conflicting proximal (congruent) and distal (contrastive) effects, with L1 and L2 speakers showing similar prosodic profiles.

Conclusion: The study introduces a novel method to analyze acoustic context effects across stimuli, timescales, and domains, highlighting shared perceptual strategies between L1 and L2 speakers.

Abstract: Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.

[250] Text-Driven Voice Conversion via Latent State-Space Modeling

Wen Li, Sofia Martinez, Priyanka Shah

Main category: cs.SD

TL;DR: The paper introduces LSS-VC, a Latent State-Space method for text-driven voice conversion, enabling fine-grained control over speaker characteristics and prosody through textual descriptions.

Details

Motivation: Existing methods rely on direct text-to-speech training, limiting flexibility in controlling nuanced style or timbral features.

Method: The proposed LSS-VC treats utterances as dynamical systems in a latent space, using a state-space model for voice style transformation. It employs an adaptive cross-modal fusion mechanism for style injection.

Result: LSS-VC outperforms baselines in quality metrics, offering smoother style transitions, fewer artifacts, and precise text-based control.

Conclusion: The LSS-VC method provides interpretable and fine-grained control over voice conversion, improving flexibility and quality over existing approaches.

Abstract: Text-driven voice conversion allows customization of speaker characteristics and prosodic elements using textual descriptions. However, most existing methods rely heavily on direct text-to-speech training, limiting their flexibility in controlling nuanced style elements or timbral features. In this paper, we propose a novel \textbf{Latent State-Space} approach for text-driven voice conversion (\textbf{LSS-VC}). Our method treats each utterance as an evolving dynamical system in a continuous latent space. Drawing inspiration from mamba, which introduced a state-space model for efficient text-driven \emph{image} style transfer, we adapt a loosely related methodology for \emph{voice} style transformation. Specifically, we learn a voice latent manifold where style and content can be manipulated independently by textual style prompts. We propose an adaptive cross-modal fusion mechanism to inject style information into the voice latent representation, enabling interpretable and fine-grained control over speaker identity, speaking rate, and emphasis. Extensive experiments show that our approach significantly outperforms recent baselines in both subjective and objective quality metrics, while offering smoother transitions between styles, reduced artifacts, and more precise text-based style control.

[251] BENYO-S2ST-Corpus-1: A Bilingual English-to-Yoruba Direct Speech-to-Speech Translation Corpus

Emmanuel Adetiba, Abdultaofeek Abayomi, Raymond J. Kala, Ayodele H. Ifijeh, Oluwatobi E. Dare, Olabode Idowu-Bismark, Gabriel O. Sobola, Joy N. Adetiba, Monsurat Adepeju Lateef, Heather Cole-Lewis

Main category: cs.SD

TL;DR: The study introduces BENYO-S2ST-Corpus-1, a bilingual English-to-Yoruba speech-to-speech translation dataset, addressing the shortage for high-to-low resource language pairs. It uses a hybrid approach combining existing Yoruba data with AI-generated English audios and an audio augmentation algorithm.

Details

Motivation: To address the lack of S2ST datasets for English-to-Yoruba and similar high-to-low resource language pairs, which hinders translation model development.

Method: Leveraged existing Yoruba audio data (YORULECT Corpus) and generated English audios using AI (Facebook MMS). Developed an audio augmentation algorithm (AcoustAug) to expand the dataset.

Result: Created BENYO-S2ST-Corpus-1 with 24,064 samples (12,032 per language) totaling 41.20 hours. Also built a Yoruba TTS model (YoruTTS-1.5) with moderate pitch similarity (F0 RMSE: 63.54).

Conclusion: The corpus and method can aid in curating datasets for other high-to-low resource African languages, bridging digital divides. The resources are publicly available.

Abstract: There is a major shortage of Speech-to-Speech Translation (S2ST) datasets for high resource-to-low resource language pairs such as English-to-Yoruba. Thus, in this study, we curated the Bilingual English-to-Yoruba Speech-to-Speech Translation Corpus Version 1 (BENYO-S2ST-Corpus-1). The corpus is based on a hybrid architecture we developed for large-scale direct S2ST corpus creation at reduced cost. To achieve this, we leveraged non speech-to-speech Standard Yoruba (SY) real-time audios and transcripts in the YORULECT Corpus as well as the corresponding Standard English (SE) transcripts. YORULECT Corpus is small scale(1,504) samples, and it does not have paired English audios. Therefore, we generated the SE audios using pre-trained AI models (i.e. Facebook MMS). We also developed an audio augmentation algorithm named AcoustAug based on three latent acoustic features to generate augmented audios from the raw audios of the two languages. BENYO-S2ST-Corpus-1 has 12,032 audio samples per language, which gives a total of 24,064 sample size. The total audio duration for the two languages is 41.20 hours. This size is quite significant. Beyond building S2ST models, BENYO-S2ST-Corpus-1 can be used to build pretrained models or improve existing ones. The created corpus and Coqui framework were used to build a pretrained Yoruba TTS model (named YoruTTS-1.5) as a proof of concept. The YoruTTS-1.5 gave a F0 RMSE value of 63.54 after 1,000 epochs, which indicates moderate fundamental pitch similarity with the reference real-time audio. Ultimately, the corpus architecture in this study can be leveraged by researchers and developers to curate datasets for multilingual high-resource-to-low-resource African languages. This will bridge the huge digital divides in translations among high and low-resource language pairs. BENYO-S2ST-Corpus-1 and YoruTTS-1.5 are publicly available at (https://bit.ly/40bGMwi).

cs.LG

[252] CIMR: Contextualized Iterative Multimodal Reasoning for Robust Instruction Following in LVLMs

Yangshu Yuan, Heng Chen, Xinyi Jiang, Christian Ng, Kexin Qiu

Main category: cs.LG

TL;DR: CIMR, a new framework for multi-modal reasoning, outperforms state-of-the-art models by integrating iterative self-correction and dynamic feedback.

Details

Motivation: Addressing the limitations of LLMs and LVLMs in handling complex multi-step, multi-modal tasks requiring logical reasoning and iterative refinement.

Method: Proposes CIMR with two stages: initial reasoning and iterative refinement using multi-modal feedback, enhanced by a dynamic fusion module.

Result: Achieves 91.5% accuracy on the MAP dataset, surpassing GPT-4V (89.2%) and other models.

Conclusion: CIMR’s iterative reasoning and self-correction significantly improve performance in complex multi-modal tasks.

Abstract: The rapid advancement of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) has enhanced our ability to process and generate human language and visual information. However, these models often struggle with complex, multi-step multi-modal instructions that require logical reasoning, dynamic feedback integration, and iterative self-correction. To address this, we propose CIMR: Contextualized Iterative Multimodal Reasoning, a novel framework that introduces a context-aware iterative reasoning and self-correction module. CIMR operates in two stages: initial reasoning and response generation, followed by iterative refinement using parsed multi-modal feedback. A dynamic fusion module deeply integrates textual, visual, and contextual features at each step. We fine-tune LLaVA-1.5-7B on the Visual Instruction Tuning (VIT) dataset and evaluate CIMR on the newly introduced Multi-modal Action Planning (MAP) dataset. CIMR achieves 91.5% accuracy, outperforming state-of-the-art models such as GPT-4V (89.2%), LLaVA-1.5 (78.5%), MiniGPT-4 (75.3%), and InstructBLIP (72.8%), demonstrating the efficacy of its iterative reasoning and self-correction capabilities in complex tasks.

[253] Prototype-Guided Pseudo-Labeling with Neighborhood-Aware Consistency for Unsupervised Adaptation

Eman Ali, Chetan Arora, Muhammad Haris Khan

Main category: cs.LG

TL;DR: Proposes an adaptive pseudo-labeling framework for CLIP, improving unsupervised adaptation by integrating prototype and neighborhood consistency, outperforming existing methods.

Details

Motivation: Pseudo-labels from zero-shot predictions in CLIP are noisy under domain shifts or complex visuals, and fixed-threshold filtering is unreliable.

Method: Introduces PICS (assessing pseudo-label accuracy via feature compactness/separation) and NALR (refining labels via semantic neighborhood similarities), with adaptive weighting.

Result: Achieves state-of-the-art performance on 11 benchmark datasets, providing more accurate pseudo-labels efficiently.

Conclusion: The framework enhances CLIP’s adaptation by dynamically refining pseudo-labels, proving effective in unsupervised settings.

Abstract: In unsupervised adaptation for vision-language models such as CLIP, pseudo-labels derived from zero-shot predictions often exhibit significant noise, particularly under domain shifts or in visually complex scenarios. Conventional pseudo-label filtering approaches, which rely on fixed confidence thresholds, tend to be unreliable in fully unsupervised settings. In this work, we propose a novel adaptive pseudo-labeling framework that enhances CLIP’s adaptation performance by integrating prototype consistency and neighborhood-based consistency. The proposed method comprises two key components: PICS, which assesses pseudo-label accuracy based on in-class feature compactness and cross-class feature separation; and NALR, which exploits semantic similarities among neighboring samples to refine pseudo-labels dynamically. Additionally, we introduce an adaptive weighting mechanism that adjusts the influence of pseudo-labeled samples during training according to their estimated correctness. Extensive experiments on 11 benchmark datasets demonstrate that our method achieves state-of-the-art performance in unsupervised adaptation scenarios, delivering more accurate pseudo-labels while maintaining computational efficiency.

[254] Towards Interpretable Renal Health Decline Forecasting via Multi-LMM Collaborative Reasoning Framework

Peng-Yi Wu, Pei-Cing Huang, Ting-Yu Chen, Chantung Ku, Ming-Yen Lin, Yihuang Kang

Main category: cs.LG

TL;DR: A collaborative framework enhances open-source Large Multimodal Models (LMMs) for eGFR prediction, improving accuracy and interpretability while addressing deployment challenges.

Details

Motivation: Accurate and interpretable eGFR prediction is crucial for CKD management, but existing LMMs face deployment, privacy, and reliability issues.

Method: The framework uses visual knowledge transfer, abductive reasoning, and short-term memory to boost LMM performance and interpretability.

Result: The method matches proprietary models in predictive performance and interpretability, offering clinically meaningful explanations.

Conclusion: The framework advances healthcare AI by balancing accuracy with interpretability, aiding clinical decision-making.

Abstract: Accurate and interpretable prediction of estimated glomerular filtration rate (eGFR) is essential for managing chronic kidney disease (CKD) and supporting clinical decisions. Recent advances in Large Multimodal Models (LMMs) have shown strong potential in clinical prediction tasks due to their ability to process visual and textual information. However, challenges related to deployment cost, data privacy, and model reliability hinder their adoption. In this study, we propose a collaborative framework that enhances the performance of open-source LMMs for eGFR forecasting while generating clinically meaningful explanations. The framework incorporates visual knowledge transfer, abductive reasoning, and a short-term memory mechanism to enhance prediction accuracy and interpretability. Experimental results show that the proposed framework achieves predictive performance and interpretability comparable to proprietary models. It also provides plausible clinical reasoning processes behind each prediction. Our method sheds new light on building AI systems for healthcare that combine predictive accuracy with clinically grounded interpretability.

[255] Test-time Prompt Refinement for Text-to-Image Models

Mohammad Abdul Hafeez Khan, Yash Jain, Siddhartha Bhattacharyya, Vibhav Vineet

Main category: cs.LG

TL;DR: A closed-loop framework (TIR) refines prompts iteratively using a pretrained MLLM to improve text-to-image generation alignment without retraining the T2I model.

Details

Motivation: Address prompt sensitivity in T2I models, where minor wording changes cause inconsistent outputs.

Method: Uses a pretrained MLLM to analyze images and prompts, refining prompts iteratively for better alignment.

Result: Improves alignment and visual coherence across benchmark datasets, maintaining plug-and-play compatibility.

Conclusion: TIR effectively mirrors human iterative refinement, enhancing T2I model outputs without additional training.

Abstract: Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user’s prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.

[256] Multi-fidelity Bayesian Data-Driven Design of Energy Absorbing Spinodoid Cellular Structures

Leo Guo, Hirak Kansara, Siamak F. Khosroshahi, GuoQi Zhang, Wei Tan

Main category: cs.LG

TL;DR: The paper compares Bayesian optimization (BO) and multi-fidelity BO (MFBO) for optimizing energy absorption in spinodoid structures, showing MFBO outperforms BO by up to 11%. It also addresses sampling quality and design sensitivity analysis.

Details

Motivation: To reconcile the increasing computational cost of FE simulations with data-driven design demands, and to compare BO and MFBO in a real-world engineering problem.

Method: Uses Sobol’ samples and variance-based sensitivity analysis to reduce design complexity, then applies and compares BO and MFBO for optimizing spinodoid structures.

Result: MFBO outperforms BO by up to 11% in maximizing energy absorption, demonstrating its effectiveness for expensive data-driven design problems.

Conclusion: MFBO is a superior method for optimizing expensive objectives like energy absorption, and the open-source results support broader use of multi-fidelity techniques.

Abstract: Finite element (FE) simulations of structures and materials are getting increasingly more accurate, but also more computationally expensive as a collateral result. This development happens in parallel with a growing demand of data-driven design. To reconcile the two, a robust and data-efficient optimization method called Bayesian optimization (BO) has been previously established as a technique to optimize expensive objective functions. In parallel, the mesh width of an FE model can be exploited to evaluate an objective at a lower or higher fidelity (cost & accuracy) level. The multi-fidelity setting applied to BO, called multi-fidelity BO (MFBO), has also seen previous success. However, BO and MFBO have not seen a direct comparison with when faced with with a real-life engineering problem, such as metamaterial design for deformation and absorption qualities. Moreover, sampling quality and assessing design parameter sensitivity is often an underrepresented part of data-driven design. This paper aims to address these shortcomings by employing Sobol’ samples with variance-based sensitivity analysis in order to reduce design problem complexity. Furthermore, this work describes, implements, applies and compares the performance BO with that MFBO when maximizing the energy absorption (EA) problem of spinodoid cellular structures is concerned. The findings show that MFBO is an effective way to maximize the EA of a spinodoid structure and is able to outperform BO by up to 11% across various hyperparameter settings. The results, which are made open-source, serve to support the utility of multi-fidelity techniques across expensive data-driven design problems.

[257] The challenge of hidden gifts in multi-agent reinforcement learning

Dane Malenfant, Blake A. Richards

Main category: cs.LG

TL;DR: The paper explores ‘hidden gifts’ in MARL, where agents unknowingly benefit from others’ actions. A grid-world task reveals state-of-the-art MARL algorithms struggle with collective rewards due to hidden actions, but independent agents succeed with action history and a derived correction term.

Details

Motivation: To understand how 'hidden gifts'—unobserved beneficial actions by others—impact credit assignment in MARL and test if algorithms can learn collective rewards despite this challenge.

Method: A grid-world task where agents must unlock doors with a shared key, requiring them to drop it for others (a hidden gift). Tests various RL and MARL algorithms, including independent agents with action history and a derived correction term.

Result: MARL algorithms fail to achieve collective rewards, but independent agents with action history succeed. A correction term further improves their reliability.

Conclusion: Hidden gifts complicate credit assignment in MARL. Independent agents with learning awareness can overcome this, suggesting potential improvements for MARL approaches.

Abstract: Sometimes we benefit from actions that others have taken even when we are unaware that they took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These “hidden gifts” represent an interesting challenge for multi-agent reinforcement learning (MARL), since assigning credit when the beneficial actions of others are hidden is non-trivial. Here, we study the impact of hidden gifts with a very simple MARL task. In this task, agents in a grid-world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus the act of dropping the key for others is a “hidden gift”. We show that several different state-of-the-art RL algorithms, including MARL algorithms, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that independent model-free policy gradient agents can solve the task when we provide them with information about their own action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for these independent agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show that credit assignment in multi-agent settings can be particularly challenging in the presence of “hidden gifts”, and demonstrate that learning awareness in independent agents can benefit these settings.

[258] Prediction of acoustic field in 1-D uniform duct with varying mean flow and temperature using neural networks

D. Veerababu, Prasanta K. Ghosh

Main category: cs.LG

TL;DR: Neural networks solve sound propagation in ducts with heterogeneous media, validated against traditional methods, and explore temperature gradient effects.

Details

Motivation: To develop a neural network-based numerical tool for solving acoustic problems constrained by physical laws.

Method: Convert the governing equation into an unconstrained optimization problem solved using neural networks, predicting acoustic variables and validating with Runge-Kutta.

Result: Both acoustic pressure and particle velocity are accurately predicted, and the impact of temperature gradients is analyzed.

Conclusion: Neural networks, aided by transfer learning and automatic differentiation, are effective for acoustic applications.

Abstract: Neural networks constrained by the physical laws emerged as an alternate numerical tool. In this paper, the governing equation that represents the propagation of sound inside a one-dimensional duct carrying a heterogeneous medium is derived. The problem is converted into an unconstrained optimization problem and solved using neural networks. Both the acoustic state variables: acoustic pressure and particle velocity are predicted and validated with the traditional Runge-Kutta solver. The effect of the temperature gradient on the acoustic field is studied. Utilization of machine learning techniques such as transfer learning and automatic differentiation for acoustic applications is demonstrated.

[259] Shape Invariant 3D-Variational Autoencoder: Super Resolution in Turbulence flow

Anuraj Maurya

Main category: cs.LG

TL;DR: A review of deep learning and classical methods for turbulence modeling, focusing on multiscale integration and super-resolution reconstruction.

Details

Motivation: To leverage deep learning for extracting insights from high-dimensional turbulence data and address challenges in fluid dynamics.

Method: Overview of classical and deep learning approaches, with focus on multiscale turbulence models and deep generative models for super-resolution.

Result: Deep learning enhances turbulence modeling by integrating multiscale data and improving reconstruction accuracy.

Conclusion: Deep learning offers promising tools for advancing turbulence modeling, particularly in multiscale integration and super-resolution applications.

Abstract: Deep learning provides a versatile suite of methods for extracting structured information from complex datasets, enabling deeper understanding of underlying fluid dynamic phenomena. The field of turbulence modeling, in particular, benefits from the growing availability of high-dimensional data obtained through experiments, field observations, and large-scale simulations spanning multiple spatio-temporal scales. This report presents a concise overview of both classical and deep learningbased approaches to turbulence modeling. It further investigates two specific challenges at the intersection of fluid dynamics and machine learning: the integration of multiscale turbulence models with deep learning architectures, and the application of deep generative models for super-resolution reconstruction

[260] Principled Curriculum Learning using Parameter Continuation Methods

Harsh Nilesh Pathak, Randy Paffenroth

Main category: cs.LG

TL;DR: A parameter continuation method for neural network optimization, linking to homotopies and curriculum learning, shows better generalization than ADAM in supervised and unsupervised tasks.

Details

Motivation: To improve neural network optimization by leveraging connections between parameter continuation, homotopies, and curriculum learning.

Method: Proposes a theoretically justified parameter continuation method for optimizing neural networks.

Result: Outperforms state-of-the-art techniques like ADAM in generalization for supervised and unsupervised learning.

Conclusion: The method is effective and theoretically sound, offering superior performance in deep learning tasks.

Abstract: In this work, we propose a parameter continuation method for the optimization of neural networks. There is a close connection between parameter continuation, homotopies, and curriculum learning. The methods we propose here are theoretically justified and practically effective for several problems in deep neural networks. In particular, we demonstrate better generalization performance than state-of-the-art optimization techniques such as ADAM for supervised and unsupervised learning tasks.

[261] Hybrid activation functions for deep neural networks: S3 and S4 – a novel approach to gradient flow optimization

Sergii Kavun

Main category: cs.LG

TL;DR: The paper introduces hybrid activation functions S3 and S4, with S4 outperforming baselines in accuracy, convergence, and gradient stability.

Details

Motivation: Address limitations of traditional activation functions (e.g., dead neurons in ReLU, vanishing gradients in sigmoid/tanh).

Method: Propose S3 (Sigmoid-Softsign hybrid) and S4 (smoothed S3 with tunable parameter k). Tested on classification and regression tasks.

Result: S4 achieved 97.4% accuracy (MNIST), 96.0% (Iris), and 18.7 MSE (Boston Housing), with faster convergence and stable gradients.

Conclusion: Hybrid activation functions like S4, with tunable parameters, improve neural network training and performance.

Abstract: Activation functions are critical components in deep neural networks, directly influencing gradient flow, training stability, and model performance. Traditional functions like ReLU suffer from dead neuron problems, while sigmoid and tanh exhibit vanishing gradient issues. We introduce two novel hybrid activation functions: S3 (Sigmoid-Softsign) and its improved version S4 (smoothed S3). S3 combines sigmoid for negative inputs with softsign for positive inputs, while S4 employs a smooth transition mechanism controlled by a steepness parameter k. We conducted comprehensive experiments across binary classification, multi-class classification, and regression tasks using three different neural network architectures. S4 demonstrated superior performance compared to nine baseline activation functions, achieving 97.4% accuracy on MNIST, 96.0% on Iris classification, and 18.7 MSE on Boston Housing regression. The function exhibited faster convergence (-19 for ReLU) and maintained stable gradient flow across network depths. Comparative analysis revealed S4’s gradient range of [0.24, 0.59] compared to ReLU’s 18% dead neurons in deep networks. The S4 activation function addresses key limitations of existing functions through its hybrid design and smooth transition mechanism. The tunable parameter k allows adaptation to different tasks and network depths, making S4 a versatile choice for deep learning applications. These findings suggest that hybrid activation functions represent a promising direction for improving neural network training dynamics.

[262] Spatial-Temporal Reinforcement Learning for Network Routing with Non-Markovian Traffic

Molly Wang

Main category: cs.LG

TL;DR: A spatial-temporal RL approach using GNNs and RNNs improves routing in dynamic networks by capturing spatial and temporal dynamics, outperforming traditional RL methods.

Details

Motivation: Standard RL and MDP frameworks fail in non-Markovian scenarios and lack spatial awareness, limiting optimal routing in complex networks.

Method: Integrates GNNs for spatial topology and RNNs for temporal traffic patterns to enhance RL-based routing decisions.

Result: Outperforms traditional RL techniques and shows robustness to topology changes.

Conclusion: The proposed spatial-temporal RL method effectively addresses limitations of traditional RL in dynamic network routing.

Abstract: Reinforcement Learning (RL) has become a well-established approach for optimizing packet routing in communication networks. Standard RL algorithms typically are based on the Markov Decision Process (MDP), which assumes that the current state of the environment provides all the necessary information for system evolution and decision-making. However, this Markovian assumption is invalid in many practical scenarios, making the MDP and RL frameworks inadequate to produce the optimal solutions. Additionally, traditional RL algorithms often employ function approximations (e.g., by neural networks) that do not explicitly capture the spatial relationships inherent in environments with complex network topologies. Communication networks are characterized by dynamic traffic patterns and arbitrary numbers of nodes and links, which further complicate the decision-making process. To address these challenges, we propose a spatial-temporal RL approach that integrates Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs) to adequately capture the spatial dynamics regarding network topology and temporal traffic patterns, respectively, to enhance routing decisions. Our evaluation demonstrates that the proposed method outperforms and is more robust to changes in the network topology when compared with traditional RL techniques.

[263] Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, Linli Xu

Main category: cs.LG

TL;DR: SimVQ addresses representation collapse in Vector Quantization by reparameterizing code vectors via a learnable linear transformation, improving codebook usage and scalability.

Details

Motivation: Existing VQ methods suffer from representation collapse, limiting scalability and codebook utilization due to disjoint codebook optimization.

Method: SimVQ reparameterizes code vectors using a learnable linear transformation layer over a latent basis, optimizing the entire linear space instead of individual code vectors.

Result: SimVQ improves codebook usage, is easy to implement, and generalizes well across image and audio tasks.

Conclusion: SimVQ effectively prevents collapse and enhances VQ performance without compromising model capacity.

Abstract: Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose \textbf{Sim}ple\textbf{VQ}, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the \textit{entire linear space} rather than nearest \textit{individual code vectors}. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures.

[264] SourceSplice: Source Selection for Machine Learning Tasks

Ambarish Singh, Romila Pradhan

Main category: cs.LG

TL;DR: The paper introduces SourceGrasp and SourceSplice, frameworks for selecting optimal data source subsets to enhance ML model performance, addressing gaps in prior data discovery methods.

Details

Motivation: Existing data discovery methods neglect source quality for ML tasks, leading to suboptimal model performance. This work aims to improve ML outcomes by selecting high-quality data sources.

Method: Proposes SourceGrasp (metaheuristic with greediness and randomization) and SourceSplice (inspired by gene splicing) for efficient source selection. Evaluated on real-world and synthetic datasets.

Result: SourceSplice outperforms with fewer subset explorations, identifying high-utility data sources. Sensitivity studies validate its robustness.

Conclusion: The frameworks effectively address data source selection for ML tasks, with SourceSplice showing superior performance and adaptability.

Abstract: Data quality plays a pivotal role in the predictive performance of machine learning (ML) tasks - a challenge amplified by the deluge of data sources available in modern organizations.Prior work in data discovery largely focus on metadata matching, semantic similarity or identifying tables that should be joined to answer a particular query, but do not consider source quality for high performance of the downstream ML task.This paper addresses the problem of determining the best subset of data sources that must be combined to construct the underlying training dataset for a given ML task.We propose SourceGrasp and SourceSplice, frameworks designed to efficiently select a suitable subset of sources that maximizes the utility of the downstream ML model.Both the algorithms rely on the core idea that sources (or their combinations) contribute differently to the task utility, and must be judiciously chosen.While SourceGrasp utilizes a metaheuristic based on a greediness criterion and randomization, the SourceSplice framework presents a source selection mechanism inspired from gene splicing - a core concept used in protein synthesis.We empirically evaluate our algorithms on three real-world datasets and synthetic datasets and show that, with significantly fewer subset explorations, SourceSplice effectively identifies subsets of data sources leading to high task utility.We also conduct studies reporting the sensitivity of SourceSplice to the decision choices under several settings.

[265] Measuring Time-Series Dataset Similarity using Wasserstein Distance

Hongjie Chen, Akshay Mehra, Josh Kimball, Ryan A. Rossi

Main category: cs.LG

TL;DR: A distribution-based method using Wasserstein distance to measure time-series dataset similarity, showing effectiveness in model selection and performance estimation.

Details

Motivation: The need to measure time-series dataset similarity for tasks like model selection, finetuning, and visualization.

Method: Proposes using Wasserstein distance between multivariate normal distributions representing time-series datasets.

Result: High correlation (>0.60) between the proposed measure and inference loss, aiding in identifying similar datasets and performance estimation.

Conclusion: The Wasserstein distance-based method effectively measures time-series dataset similarity and supports foundation model evaluation.

Abstract: The emergence of time-series foundation model research elevates the growing need to measure the (dis)similarity of time-series datasets. A time-series dataset similarity measure aids research in multiple ways, including model selection, finetuning, and visualization. In this paper, we propose a distribution-based method to measure time-series dataset similarity by leveraging the Wasserstein distance. We consider a time-series dataset an empirical instantiation of an underlying multivariate normal distribution (MVN). The similarity between two time-series datasets is thus computed as the Wasserstein distance between their corresponding MVNs. Comprehensive experiments and visualization show the effectiveness of our approach. Specifically, we show how the Wasserstein distance helps identify similar time-series datasets and facilitates inference performance estimation of foundation models in both out-of-distribution and transfer learning evaluation, with high correlations between our proposed measure and the inference loss (>0.60).

[266] CTG-Insight: A Multi-Agent Interpretable LLM Framework for Cardiotocography Analysis and Classification

Black Sun, Die, Hu

Main category: cs.LG

TL;DR: CTG-Insight is a multi-agent LLM system for interpretable fetal monitoring, achieving high accuracy (96.4%) and F1-score (97.8%) by decomposing CTG data into medically defined features.

Details

Motivation: Current remote fetal monitoring systems lack interpretability, making raw CTG data hard for expectant parents to understand.

Method: CTG-Insight uses a multi-agent LLM system to analyze CTG traces, breaking them into five medical features (baseline, variability, accelerations, decelerations, sinusoidal pattern) and synthesizing results with a natural language explanation.

Result: Achieves state-of-the-art accuracy (96.4%) and F1-score (97.8%) on the NeuroFetalNet Dataset, outperforming deep learning and single-agent LLM baselines.

Conclusion: CTG-Insight provides an interpretable and extensible framework for CTG analysis, improving transparency in fetal health monitoring.

Abstract: Remote fetal monitoring technologies are becoming increasingly common. Yet, most current systems offer limited interpretability, leaving expectant parents with raw cardiotocography (CTG) data that is difficult to understand. In this work, we present CTG-Insight, a multi-agent LLM system that provides structured interpretations of fetal heart rate (FHR) and uterine contraction (UC) signals. Drawing from established medical guidelines, CTG-Insight decomposes each CTG trace into five medically defined features: baseline, variability, accelerations, decelerations, and sinusoidal pattern, each analyzed by a dedicated agent. A final aggregation agent synthesizes the outputs to deliver a holistic classification of fetal health, accompanied by a natural language explanation. We evaluate CTG-Insight on the NeuroFetalNet Dataset and compare it against deep learning models and the single-agent LLM baseline. Results show that CTG-Insight achieves state-of-the-art accuracy (96.4%) and F1-score (97.8%) while producing transparent and interpretable outputs. This work contributes an interpretable and extensible CTG analysis framework.

[267] Explainability-Driven Feature Engineering for Mid-Term Electricity Load Forecasting in ERCOT’s SCENT Region

Abhiram Bhupatiraju, Sung Bum Ahn

Main category: cs.LG

TL;DR: Comparative analysis of machine learning models (Linear Regression, XGBoost, LightGBM, LSTM) for midterm electricity load forecasting, emphasizing SHAP for explainability.

Details

Motivation: Accurate load forecasting is critical for power system operations, maintenance, and financial planning due to weather and temporal dynamics.

Method: Evaluates Linear Regression, XGBoost, LightGBM, and LSTM for forecasting, using SHAP for feature contribution analysis.

Result: Highlights the importance of SHAP for improving model transparency and accuracy in midterm load forecasting.

Conclusion: Machine learning models, combined with SHAP, enhance forecasting accuracy and explainability for power system planning.

Abstract: Accurate load forecasting is essential to the operation of modern electric power systems. Given the sensitivity of electricity demand to weather variability and temporal dynamics, capturing non-linear patterns is essential for long-term planning. This paper presents a comparative analysis of machine learning models, Linear Regression, XGBoost, LightGBM, and Long Short-Term Memory (LSTM), for forecasting system-wide electricity load up to one year in advance. Midterm forecasting has shown to be crucial for maintenance scheduling, resource allocation, financial forecasting, and market participation. The paper places a focus on the use of a method called “Shapley Additive Explanations” (SHAP) to improve model explainability. SHAP enables the quantification of feature contributions, guiding informed feature engineering and improving both model transparency and forecasting accuracy.

[268] TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

Stéphane d’Ascoli, Jérémy Rapin, Yohann Benchetrit, Hubert Banville, Jean-Rémi King

Main category: cs.LG

TL;DR: TRIBE is a multimodal deep neural network that predicts brain responses across modalities, cortical areas, and individuals, outperforming unimodal models and winning the Algonauts 2025 competition.

Details

Motivation: Neuroscience's fragmented approach hinders unified cognition models. TRIBE aims to integrate multimodal brain responses for a cohesive understanding.

Method: Combines pretrained text, audio, and video models with a transformer to predict fMRI responses to videos.

Result: Achieved top performance in Algonauts 2025, excelling in high-level associative cortices.

Conclusion: TRIBE advances integrative brain modeling, with potential applications in perception and comprehension.

Abstract: Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at https://github.com/facebookresearch/algonauts-2025.

[269] Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

Oleksiy Ostapenko, Charles Guille-Escuret, Luke Kumar, Max Tian, Denis Kocetkov, Gopeshh Subbaraj, Raymond Li, Joel Lamy-Poirier, Sebastien Paquet, Torsten Scholak

Main category: cs.LG

TL;DR: A framework for optimizing domain-specific dataset construction in foundation model training by estimating data source quality and resource allocation efficiently.

Details

Motivation: To address the limitation of point estimates in prior work, which can mislead data scaling decisions due to lack of rank invariance across compute scales.

Method: Extends point estimate approaches by performing multiple annealing runs to estimate scaling laws, analyzing performance gains relative to acquisition costs.

Result: Validated on a 7B-parameter model, showing efficient estimation of scaling behaviors for data sources, leading to better resource allocation.

Conclusion: The approach enables data-driven decision-making for selecting and optimizing data sources, improving cost efficiency.

Abstract: We introduce a framework for optimizing domain-specific dataset construction in foundation model training. Specifically, we seek a cost-efficient way to estimate the quality of data sources (e.g. synthetically generated or filtered web data, etc.) in order to make optimal decisions about resource allocation for data sourcing from these sources for the stage two pre-training phase, aka annealing, with the goal of specializing a generalist pre-trained model to specific domains. Our approach extends the usual point estimate approaches, aka micro-annealing, to estimating scaling laws by performing multiple annealing runs of varying compute spent on data curation and training. This addresses a key limitation in prior work, where reliance on point estimates for data scaling decisions can be misleading due to the lack of rank invariance across compute scales – a phenomenon we confirm in our experiments. By systematically analyzing performance gains relative to acquisition costs, we find that scaling curves can be estimated for different data sources. Such scaling laws can inform cost effective resource allocation across different data acquisition methods (e.g. synthetic data), data sources (e.g. user or web data) and available compute resources. We validate our approach through experiments on a pre-trained model with 7 billion parameters. We adapt it to: a domain well-represented in the pre-training data – the medical domain, and a domain underrepresented in the pretraining corpora – the math domain. We show that one can efficiently estimate the scaling behaviors of a data source by running multiple annealing runs, which can lead to different conclusions, had one used point estimates using the usual micro-annealing technique instead. This enables data-driven decision-making for selecting and optimizing data sources.

[270] Agent-centric learning: from external reward maximization to internal knowledge curation

Hanqi Zhou, Fryderyk Mantiuk, David G. Nagy, Charley M. Wu

Main category: cs.LG

TL;DR: The paper introduces representational empowerment, shifting focus from external control to internal knowledge adaptability for general intelligence.

Details

Motivation: Traditional AI focuses on external objectives, leading to specialized but inflexible agents. The paper aims to address this by emphasizing internal knowledge control.

Method: Proposes representational empowerment, measuring an agent’s ability to maintain and diversify its internal knowledge structures.

Result: Suggests that internal representation control enhances adaptability and preparedness, distinct from direct environmental influence.

Conclusion: Representational empowerment offers a new design lens for creating more adaptable and intelligent systems.

Abstract: The pursuit of general intelligence has traditionally centered on external objectives: an agent’s control over its environments or mastery of specific tasks. This external focus, however, can produce specialized agents that lack adaptability. We propose representational empowerment, a new perspective towards a truly agent-centric learning paradigm by moving the locus of control inward. This objective measures an agent’s ability to controllably maintain and diversify its own knowledge structures. We posit that the capacity – to shape one’s own understanding – is an element for achieving better ``preparedness’’ distinct from direct environmental influence. Focusing on internal representations as the main substrate for computing empowerment offers a new lens through which to design adaptable intelligent systems.

[271] Weighted Conditional Flow Matching

Sergio Calvo-Ordonez, Matthieu Meunier, Alvaro Cartea, Christoph Reisinger, Yarin Gal, Jose Miguel Hernandez-Lobato

Main category: cs.LG

TL;DR: W-CFM improves CFM by weighting training pairs with a Gibbs kernel, aligning paths closer to straight-line interpolations, enhancing efficiency and accuracy without costly OT.

Details

Motivation: Standard CFM paths deviate from straight-line interpolations, slowing generation and reducing accuracy. W-CFM addresses this without expensive OT.

Method: W-CFM modifies CFM loss by weighting training pairs with a Gibbs kernel, recovering entropic OT coupling with minimal marginal bias.

Result: W-CFM matches or outperforms baselines in sample quality, fidelity, and diversity while maintaining computational efficiency.

Conclusion: W-CFM offers a computationally efficient alternative to OT-based CFM enhancements, improving path straightness and generation performance.

Abstract: Conditional flow matching (CFM) has emerged as a powerful framework for training continuous normalizing flows due to its computational efficiency and effectiveness. However, standard CFM often produces paths that deviate significantly from straight-line interpolations between prior and target distributions, making generation slower and less accurate due to the need for fine discretization at inference. Recent methods enhance CFM performance by inducing shorter and straighter trajectories but typically rely on computationally expensive mini-batch optimal transport (OT). Drawing insights from entropic optimal transport (EOT), we propose Weighted Conditional Flow Matching (W-CFM), a novel approach that modifies the classical CFM loss by weighting each training pair $(x, y)$ with a Gibbs kernel. We show that this weighting recovers the entropic OT coupling up to some bias in the marginals, and we provide the conditions under which the marginals remain nearly unchanged. Moreover, we establish an equivalence between W-CFM and the minibatch OT method in the large-batch limit, showing how our method overcomes computational and performance bottlenecks linked to batch size. Empirically, we test our method on unconditional generation on various synthetic and real datasets, confirming that W-CFM achieves comparable or superior sample quality, fidelity, and diversity to other alternative baselines while maintaining the computational efficiency of vanilla CFM.

[272] Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation

Afonso Martini Spezia, Mariana Recamonde-Mendoza

Main category: cs.LG

TL;DR: The paper investigates cluster-based cross-validation strategies, proposing a new technique combining Mini Batch K-Means with class stratification. It evaluates performance on balanced and imbalanced datasets, finding the new method effective for balanced data but traditional stratified cross-validation better for imbalanced data.

Details

Motivation: To address the potential bias in cross-validation due to unrepresentative data folds and improve model evaluation strategies.

Method: Proposes a new cross-validation technique combining Mini Batch K-Means with class stratification. Experiments compare this and other strategies on 20 datasets using four supervised learning algorithms, analyzing bias, variance, and computational cost.

Result: Mini Batch K-Means with class stratification outperformed others on balanced datasets but not on imbalanced ones, where traditional stratified cross-validation was superior. No single clustering algorithm consistently excelled.

Conclusion: The work enhances understanding of cluster-based cross-validation, reaffirms the value of stratified cross-validation for imbalanced data, and suggests directions for more robust model evaluation.

Abstract: Cross-validation plays a fundamental role in Machine Learning, enabling robust evaluation of model performance and preventing overestimation on training and validation data. However, one of its drawbacks is the potential to create data subsets (folds) that do not adequately represent the diversity of the original dataset, which can lead to biased performance estimates. The objective of this work is to deepen the investigation of cluster-based cross-validation strategies by analyzing the performance of different clustering algorithms through experimental comparison. Additionally, a new cross-validation technique that combines Mini Batch K-Means with class stratification is proposed. Experiments were conducted on 20 datasets (both balanced and imbalanced) using four supervised learning algorithms, comparing cross-validation strategies in terms of bias, variance, and computational cost. The technique that uses Mini Batch K-Means with class stratification outperformed others in terms of bias and variance on balanced datasets, though it did not significantly reduce computational cost. On imbalanced datasets, traditional stratified cross-validation consistently performed better, showing lower bias, variance, and computational cost, making it a safe choice for performance evaluation in scenarios with class imbalance. In the comparison of different clustering algorithms, no single algorithm consistently stood out as superior. Overall, this work contributes to improving predictive model evaluation strategies by providing a deeper understanding of the potential of cluster-based data splitting techniques and reaffirming the effectiveness of well-established strategies like stratified cross-validation. Moreover, it highlights perspectives for increasing the robustness and reliability of model evaluations, especially in datasets with clustering characteristics.

[273] Spatial-Temporal Data Mining for Ocean Science: Data, Methodologies, and Opportunities

Hanchen Yang, Wengen Li, Shuyu Wang, Hui Li, Jihong Guan, Shuigeng Zhou, Jiannong Cao

Main category: cs.LG

TL;DR: This paper surveys spatial-temporal data mining (STDM) studies for ocean science, covering datasets, data quality enhancement, and four task types, while highlighting research opportunities.

Details

Motivation: The complexity and unique characteristics of ST ocean data hinder model design and training, and a lack of comprehensive surveys impedes progress in ocean data mining.

Method: The paper reviews ST ocean datasets, explores data quality enhancement techniques, and classifies STDM studies into prediction, event detection, pattern mining, and anomaly detection tasks.

Result: The survey provides a structured overview of STDM techniques in ocean science, aiding interdisciplinary understanding.

Conclusion: The paper identifies open challenges and opportunities, facilitating collaboration between computer and ocean scientists.

Abstract: With the rapid amassing of spatial-temporal (ST) ocean data, many spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, including climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated but with unique characteristics, e.g., diverse regionality and high sparsity. These characteristics make it difficult to design and train STDM models on ST ocean data. To the best of our knowledge, a comprehensive survey of existing studies remains missing in the literature, which hinders not only computer scientists from identifying the research issues in ocean data mining but also ocean scientists to apply advanced STDM techniques. In this paper, we provide a comprehensive survey of existing STDM studies for ocean science. Concretely, we first review the widely-used ST ocean datasets and highlight their unique characteristics. Then, typical ST ocean data quality enhancement techniques are explored. Next, we classify existing STDM studies in ocean science into four types of tasks, i.e., prediction, event detection, pattern mining, and anomaly detection, and elaborate on the techniques for these tasks. Finally, promising research opportunities are discussed. This survey can help scientists from both computer science and ocean science better understand the fundamental concepts, key techniques, and open challenges of STDM for ocean science.

[274] CS-SHRED: Enhancing SHRED for Robust Recovery of Spatiotemporal Dynamics

Romulo B. da Silva, Cássio M. Oishi, Diego Passos, J. Nathan Kutz

Main category: cs.LG

TL;DR: CS-SHRED integrates Compressed Sensing with a Shallow Recurrent Decoder for robust spatiotemporal data reconstruction, outperforming traditional methods with higher fidelity and noise resilience.

Details

Motivation: To address challenges in reconstructing spatiotemporal dynamics from incomplete, compressed, or corrupted data, especially in noisy or sparse sensor scenarios.

Method: Combines CS techniques with SHRED, using an adaptive loss function (MSE, MAE, SNR regularization) and LSTM for temporal modeling.

Result: Achieves higher reconstruction fidelity (improved SSIM, PSNR, lower errors) in diverse applications like fluid flows and climate data.

Conclusion: CS-SHRED is a versatile tool for environmental and scientific data analysis, offering superior performance and robustness.

Abstract: We present $\textbf{CS-SHRED}$, a novel deep learning architecture that integrates Compressed Sensing (CS) into a Shallow Recurrent Decoder ($\textbf{SHRED}$) to reconstruct spatiotemporal dynamics from incomplete, compressed, or corrupted data. Our approach introduces two key innovations. First, by incorporating CS techniques into the $\textbf{SHRED}$ architecture, our method leverages a batch-based forward framework with $\ell_1$ regularization to robustly recover signals even in scenarios with sparse sensor placements, noisy measurements, and incomplete sensor acquisitions. Second, an adaptive loss function dynamically combines Mean Squared Error (MSE) and Mean Absolute Error (MAE) terms with a piecewise Signal-to-Noise Ratio (SNR) regularization, which suppresses noise and outliers in low-SNR regions while preserving fine-scale features in high-SNR regions. We validate $\textbf{CS-SHRED}$ on challenging problems including viscoelastic fluid flows, maximum specific humidity fields, sea surface temperature distributions, and rotating turbulent flows. Compared to the traditional $\textbf{SHRED}$ approach, $\textbf{CS-SHRED}$ achieves significantly higher reconstruction fidelity - as demonstrated by improved SSIM and PSNR values, lower normalized errors, and enhanced LPIPS scores-thereby providing superior preservation of small-scale structures and increased robustness against noise and outliers. Our results underscore the advantages of the jointly trained CS and SHRED design architecture which includes an LSTM sequence model for characterizing the temporal evolution with a shallow decoder network (SDN) for modeling the high-dimensional state space. The SNR-guided adaptive loss function for the spatiotemporal data recovery establishes $\textbf{CS-SHRED}$ as a promising tool for a wide range of applications in environmental, climatic, and scientific data analyses.

[275] Hypernetworks for Model-Heterogeneous Personalized Federated Learning

Chen Zhang, Husheng Li, Xiang Liu, Linshan Jiang, Danxin Wang

Main category: cs.LG

TL;DR: MH-pFedHN and MH-pFedHNGD are hypernetwork-based frameworks for personalized federated learning, addressing model heterogeneity without external data or client architecture disclosure.

Details

Motivation: Existing methods for personalized federated learning often rely on external data or partial strategies, limiting practicality and scalability.

Method: Proposes MH-pFedHN, using a server-side hypernetwork with client-specific embeddings, and MH-pFedHNGD, integrating an optional global model for better generalization.

Result: Achieves competitive accuracy and strong generalization across benchmarks, serving as a robust baseline.

Conclusion: The frameworks enhance privacy, flexibility, and scalability in model-heterogeneous personalized federated learning.

Abstract: Recent advances in personalized federated learning have focused on addressing client model heterogeneity. However, most existing methods still require external data, rely on model decoupling, or adopt partial learning strategies, which can limit their practicality and scalability. In this paper, we revisit hypernetwork-based methods and leverage their strong generalization capabilities to design a simple yet effective framework for heterogeneous personalized federated learning. Specifically, we propose MH-pFedHN, which leverages a server-side hypernetwork that takes client-specific embedding vectors as input and outputs personalized parameters tailored to each client’s heterogeneous model. To promote knowledge sharing and reduce computation, we introduce a multi-head structure within the hypernetwork, allowing clients with similar model sizes to share heads. Furthermore, we further propose MH-pFedHNGD, which integrates an optional lightweight global model to improve generalization. Our framework does not rely on external datasets and does not require disclosure of client model architectures, thereby offering enhanced privacy and flexibility. Extensive experiments on multiple benchmarks and model settings demonstrate that our approach achieves competitive accuracy, strong generalization, and serves as a robust baseline for future research in model-heterogeneous personalized federated learning.

[276] Parametrized Multi-Agent Routing via Deep Attention Models

Salar Basiri, Dhananjay Tiwari, Srinivasa M. Salapaka

Main category: cs.LG

TL;DR: A scalable deep learning framework (ParaSDM) is proposed for parametrized sequential decision-making, focusing on Facility-Location and Path Optimization (FLPO). It integrates the Maximum Entropy Principle (MEP) with a neural policy model (SPN), achieving significant speedups and cost reductions compared to baselines.

Details

Motivation: FLPO problems are NP-hard due to mixed discrete-continuous structures and non-convex objectives. Existing methods struggle with scalability and efficiency, motivating a deep learning approach.

Method: The framework combines MEP with the Shortest Path Network (SPN), a permutation-invariant encoder-decoder, enabling efficient gradient-based optimization over shared parameters.

Result: SPN achieves up to 100x speedup in policy inference and 6% optimality gap. FLPO approach reduces costs by 10x compared to metaheuristics and matches Gurobi’s optimality at 1500x speedup.

Conclusion: The framework sets a new state of the art for ParaSDM, demonstrating the effectiveness of structured deep models for large-scale mixed-integer optimization.

Abstract: We propose a scalable deep learning framework for parametrized sequential decision-making (ParaSDM), where multiple agents jointly optimize discrete action policies and shared continuous parameters. A key subclass of this setting arises in Facility-Location and Path Optimization (FLPO), where multi-agent systems must simultaneously determine optimal routes and facility locations, aiming to minimize the cumulative transportation cost within the network. FLPO problems are NP-hard due to their mixed discrete-continuous structure and highly non-convex objective. To address this, we integrate the Maximum Entropy Principle (MEP) with a neural policy model called the Shortest Path Network (SPN)-a permutation-invariant encoder-decoder that approximates the MEP solution while enabling efficient gradient-based optimization over shared parameters. The SPN achieves up to 100$\times$ speedup in policy inference and gradient computation compared to MEP baselines, with an average optimality gap of approximately 6% across a wide range of problem sizes. Our FLPO approach yields over 10$\times$ lower cost than metaheuristic baselines while running significantly faster, and matches Gurobi’s optimal cost with annealing at a 1500$\times$ speedup-establishing a new state of the art for ParaSDM problems. These results highlight the power of structured deep models for solving large-scale mixed-integer optimization tasks.

[277] MSQ: Memory-Efficient Bit Sparsification Quantization

Seokho Han, Seoyeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, Jong Hwan Ko

Main category: cs.LG

TL;DR: MSQ is a memory-efficient bit sparsification quantization method for DNNs, reducing parameters and training time while maintaining accuracy.

Details

Motivation: Optimizing DNN efficiency on mobile/edge devices, addressing challenges in mixed-precision quantization and bit-level sparsity methods.

Method: Uses a round-clamp quantizer for differentiable LSB computation, regularization for sparsity, and Hessian information for pruning multiple LSBs.

Result: Achieves 8.00x parameter reduction, 86% training time reduction, and competitive accuracy/compression.

Conclusion: MSQ is practical for training efficient DNNs on resource-constrained devices.

Abstract: As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and accuracy compared to uniform quantization. However, finding the optimal precision for each layer is challenging. Recent studies utilizing bit-level sparsity have shown promise, yet they often introduce substantial training complexity and high GPU memory requirements. In this paper, we propose Memory-Efficient Bit Sparsification Quantization (MSQ), a novel approach that addresses these limitations. MSQ applies a round-clamp quantizer to enable differentiable computation of the least significant bits (LSBs) from model weights. It further employs regularization to induce sparsity in these LSBs, enabling effective precision reduction without explicit bit-level parameter splitting. Additionally, MSQ incorporates Hessian information, allowing the simultaneous pruning of multiple LSBs to further enhance training efficiency. Experimental results show that MSQ achieves up to 8.00x reduction in trainable parameters and up to 86% reduction in training time compared to previous bit-level quantization, while maintaining competitive accuracy and compression rates. This makes it a practical solution for training efficient DNNs on resource-constrained devices.

[278] Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance

Songsheng Wang, Rucheng Yu, Zhihang Yuan, Chao Yu, Feng Gao, Yu Wang, Derek F. Wong

Main category: cs.LG

TL;DR: Spec-VLA accelerates Vision-Language-Action models using speculative decoding, achieving 1.42x speedup without compromising success rates.

Details

Motivation: Current VLA models face computational inefficiencies due to large parameter sizes and autoregressive decoding. Speculative decoding, effective for LLMs, is unexplored for VLAs.

Method: Introduces Spec-VLA, a framework using speculative decoding with a relaxed acceptance mechanism based on action token distances.

Result: Achieves 44% longer acceptance length and 1.42x speedup over OpenVLA baseline, maintaining success rates.

Conclusion: Spec-VLA demonstrates speculative execution’s potential for VLA models, offering significant speed improvements.

Abstract: Vision-Language-Action (VLA) models have made substantial progress by leveraging the robust capabilities of Visual Language Models (VLMs). However, VLMs’ significant parameter size and autoregressive (AR) decoding nature impose considerable computational demands on VLA models. While Speculative Decoding (SD) has shown efficacy in accelerating Large Language Models (LLMs) by incorporating efficient drafting and parallel verification, allowing multiple tokens to be generated in one forward pass, its application to VLA models remains unexplored. This work introduces Spec-VLA, an SD framework designed to accelerate VLA models. Due to the difficulty of the action prediction task and the greedy decoding mechanism of the VLA models, the direct application of the advanced SD framework to the VLA prediction task yields a minor speed improvement. To boost the generation speed, we propose an effective mechanism to relax acceptance utilizing the relative distances represented by the action tokens of the VLA model. Empirical results across diverse test scenarios affirm the effectiveness of the Spec-VLA framework, and further analysis substantiates the impact of our proposed strategies, which enhance the acceptance length by 44%, achieving 1.42 times speedup compared with the OpenVLA baseline, without compromising the success rate. The success of the Spec-VLA framework highlights the potential for broader application of speculative execution in VLA prediction scenarios.

[279] Multimodal Late Fusion Model for Problem-Solving Strategy Classification in a Machine Learning Game

Clemens Witt, Thiemo Leonhardt, Nadine Bergner, Mareen Grillenberger

Main category: cs.LG

TL;DR: A multimodal late fusion model combining visual and action data improves classification of students’ problem-solving strategies in educational games, outperforming unimodal models by 15%.

Details

Motivation: Existing methods using abstracted gameplay logs may miss subtle behavioral cues, limiting accurate assessment of cognitive strategies.

Method: Proposes a multimodal late fusion model integrating screencast visual data and structured in-game actions.

Result: The fusion model increased classification accuracy by over 15% compared to unimodal baselines in a pilot study with 149 students.

Conclusion: Multimodal ML enhances strategy-sensitive assessment and adaptive support in interactive learning environments.

Abstract: Machine learning models are widely used to support stealth assessment in digital learning environments. Existing approaches typically rely on abstracted gameplay log data, which may overlook subtle behavioral cues linked to learners’ cognitive strategies. This paper proposes a multimodal late fusion model that integrates screencast-based visual data and structured in-game action sequences to classify students’ problem-solving strategies. In a pilot study with secondary school students (N=149) playing a multitouch educational game, the fusion model outperformed unimodal baseline models, increasing classification accuracy by over 15%. Results highlight the potential of multimodal ML for strategy-sensitive assessment and adaptive support in interactive learning contexts.

[280] Theoretical Analysis of Relative Errors in Gradient Computations for Adversarial Attacks with CE Loss

Yunrui Yu, Hang Su, Cheng-zhong Xu, Zhizhong Su, Jun Zhu

Main category: cs.LG

TL;DR: The paper analyzes floating-point arithmetic errors in gradient-based adversarial attacks using CE loss, proposes T-MIFPE loss to mitigate these errors, and shows its superiority over existing methods.

Details

Motivation: To address overestimation in gradient-based attacks due to floating-point errors and improve attack accuracy.

Method: Theoretical analysis of floating-point errors in four attack scenarios, leading to the T-MIFPE loss with an optimal scaling factor.

Result: T-MIFPE outperforms CE, C&W, DLR, and MIFPE in attack potency and robustness evaluation on MNIST, CIFAR-10, and CIFAR-100.

Conclusion: T-MIFPE effectively reduces floating-point errors, enhancing adversarial attack accuracy and robustness assessment.

Abstract: Gradient-based adversarial attacks using the Cross-Entropy (CE) loss often suffer from overestimation due to relative errors in gradient computation induced by floating-point arithmetic. This paper provides a rigorous theoretical analysis of these errors, conducting the first comprehensive study of floating-point computation errors in gradient-based attacks across four distinct scenarios: (i) unsuccessful untargeted attacks, (ii) successful untargeted attacks, (iii) unsuccessful targeted attacks, and (iv) successful targeted attacks. We establish theoretical foundations characterizing the behavior of relative numerical errors under different attack conditions, revealing previously unknown patterns in gradient computation instability, and identify floating-point underflow and rounding as key contributors. Building on this insight, we propose the Theoretical MIFPE (T-MIFPE) loss function, which incorporates an optimal scaling factor $T = t^*$ to minimize the impact of floating-point errors, thereby enhancing the accuracy of gradient computation in adversarial attacks. Extensive experiments on the MNIST, CIFAR-10, and CIFAR-100 datasets demonstrate that T-MIFPE outperforms existing loss functions, including CE, C&W, DLR, and MIFPE, in terms of attack potency and robustness evaluation accuracy.

[281] RANA: Robust Active Learning for Noisy Network Alignment

Yixuan Nan, Xixun Lin, Yanmin Shang, Zhuofan Li, Can Zhao, Yanan Cao

Main category: cs.LG

TL;DR: RANA is a robust active learning framework for noisy network alignment, addressing structural and labeling noise while improving alignment accuracy.

Details

Motivation: Existing network alignment methods overlook noise issues, which degrade performance. RANA aims to tackle structural and labeling noise alongside label sparsity.

Method: RANA introduces a Noise-aware Selection Module for structural noise and a Label Denoising Module for labeling noise, using cleanliness scores and multi-source fusion denoising.

Result: RANA outperforms state-of-the-art active learning-based methods in alignment accuracy on three real-world datasets.

Conclusion: RANA effectively improves robustness in network alignment by addressing noise and sparsity, demonstrating superior performance.

Abstract: Network alignment has attracted widespread attention in various fields. However, most existing works mainly focus on the problem of label sparsity, while overlooking the issue of noise in network alignment, which can substantially undermine model performance. Such noise mainly includes structural noise from noisy edges and labeling noise caused by human-induced and process-driven errors. To address these problems, we propose RANA, a Robust Active learning framework for noisy Network Alignment. RANA effectively tackles both structure noise and label noise while addressing the sparsity of anchor link annotations, which can improve the robustness of network alignment models. Specifically, RANA introduces the proposed Noise-aware Selection Module and the Label Denoising Module to address structural noise and labeling noise, respectively. In the first module, we design a noise-aware maximization objective to select node pairs, incorporating a cleanliness score to address structural noise. In the second module, we propose a novel multi-source fusion denoising strategy that leverages model and twin node pairs labeling to provide more accurate labels for node pairs. Empirical results on three real-world datasets demonstrate that RANA outperforms state-of-the-art active learning-based methods in alignment accuracy. Our code is available at https://github.com/YXNan0110/RANA.

[282] RCR-AF: Enhancing Model Generalization via Rademacher Complexity Reduction Activation Function

Yunrui Yu, Kafeng Wang, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: The paper introduces RCR-AF, a novel activation function combining GELU and ReLU advantages, to enhance neural network robustness against adversarial attacks.

Details

Motivation: Deep neural networks are vulnerable to adversarial attacks, especially in safety-sensitive applications, prompting the need for improved activation functions to boost robustness.

Method: Proposes RCR-AF, which blends GELU’s smoothness and ReLU’s monotonicity, with hyperparameters α and γ to control sparsity and capacity, theoretically analyzed via Rademacher complexity.

Result: RCR-AF outperforms ReLU, GELU, and Swish in clean accuracy and adversarial robustness, validated through empirical evaluations.

Conclusion: RCR-AF offers a principled, effective solution for improving neural network robustness, with theoretical and empirical support.

Abstract: Despite their widespread success, deep neural networks remain critically vulnerable to adversarial attacks, posing significant risks in safety-sensitive applications. This paper investigates activation functions as a crucial yet underexplored component for enhancing model robustness. We propose a Rademacher Complexity Reduction Activation Function (RCR-AF), a novel activation function designed to improve both generalization and adversarial resilience. RCR-AF uniquely combines the advantages of GELU (including smoothness, gradient stability, and negative information retention) with ReLU’s desirable monotonicity, while simultaneously controlling both model sparsity and capacity through built-in clipping mechanisms governed by two hyperparameters, $\alpha$ and $\gamma$. Our theoretical analysis, grounded in Rademacher complexity, demonstrates that these parameters directly modulate the model’s Rademacher complexity, offering a principled approach to enhance robustness. Comprehensive empirical evaluations show that RCR-AF consistently outperforms widely-used alternatives (ReLU, GELU, and Swish) in both clean accuracy under standard training and in adversarial robustness within adversarial training paradigms.

[283] Proto-EVFL: Enhanced Vertical Federated Learning via Dual Prototype with Extremely Unaligned Data

Wei Guo, Yiyang Duan, Zhaojun Hu, Yiqi Tong, Fuzhen Zhuang, Xiao Zhang, Jin Dong, Ruofan Wu, Tengfei Liu, Yifan Sun

Main category: cs.LG

TL;DR: Proto-EVFL is a dual-prototype framework for vertical federated learning (VFL) that addresses class imbalance issues by dynamically selecting unaligned samples and aggregating features adaptively.

Details

Motivation: Class imbalance in VFL leads to insufficient feature representation and model bias, hindering collaborative learning.

Method: Uses dual prototypes for class relationships, probabilistic sample selection via optimal transport, and adaptive feature aggregation.

Result: Achieves superior performance, outperforming baselines by at least 6.97% even in zero-shot scenarios.

Conclusion: Proto-EVFL effectively mitigates class imbalance and feature inconsistency in VFL, validated by theoretical and empirical results.

Abstract: In vertical federated learning (VFL), multiple enterprises address aligned sample scarcity by leveraging massive locally unaligned samples to facilitate collaborative learning. However, unaligned samples across different parties in VFL can be extremely class-imbalanced, leading to insufficient feature representation and limited model prediction space. Specifically, class-imbalanced problems consist of intra-party class imbalance and inter-party class imbalance, which can further cause local model bias and feature contribution inconsistency issues, respectively. To address the above challenges, we propose Proto-EVFL, an enhanced VFL framework via dual prototypes. We first introduce class prototypes for each party to learn relationships between classes in the latent space, allowing the active party to predict unseen classes. We further design a probabilistic dual prototype learning scheme to dynamically select unaligned samples by conditional optimal transport cost with class prior probability. Moreover, a mixed prior guided module guides this selection process by combining local and global class prior probabilities. Finally, we adopt an \textit{adaptive gated feature aggregation strategy} to mitigate feature contribution inconsistency by dynamically weighting and aggregating local features across different parties. We proved that Proto-EVFL, as the first bi-level optimization framework in VFL, has a convergence rate of 1/\sqrt T. Extensive experiments on various datasets validate the superiority of our Proto-EVFL. Even in a zero-shot scenario with one unseen class, it outperforms baselines by at least 6.97%

[284] LoReUn: Data Itself Implicitly Provides Cues to Improve Machine Unlearning

Xiang Li, Qianli Shen, Haonan Wang, Kenji Kawaguchi

Main category: cs.LG

TL;DR: LoReUn introduces a dynamic reweighting strategy for machine unlearning, improving effectiveness by focusing on harder-to-unlearn data.

Details

Motivation: Existing machine unlearning methods treat all data equally, failing to address varying difficulty in unlearning certain data.

Method: LoReUn dynamically reweights data during unlearning based on loss, with minimal computational overhead.

Result: LoReUn reduces the gap between existing methods and exact unlearning, enhancing prevention of harmful content in generative models.

Conclusion: LoReUn is a simple, effective plug-and-play solution for improving machine unlearning, particularly in preventing harmful content.

Abstract: Recent generative models face significant risks of producing harmful content, which has underscored the importance of machine unlearning (MU) as a critical technique for eliminating the influence of undesired data. However, existing MU methods typically assign the same weight to all data to be forgotten, which makes it difficult to effectively forget certain data that is harder to unlearn than others. In this paper, we empirically demonstrate that the loss of data itself can implicitly reflect its varying difficulty. Building on this insight, we introduce Loss-based Reweighting Unlearning (LoReUn), a simple yet effective plug-and-play strategy that dynamically reweights data during the unlearning process with minimal additional computational overhead. Our approach significantly reduces the gap between existing MU methods and exact unlearning in both image classification and generation tasks, effectively enhancing the prevention of harmful content generation in text-to-image diffusion models.

[285] Geometry of nonlinear forecast reconciliation

Lorenzo Nespoli, Anubhab Biswas, Vasco Medici

Main category: cs.LG

TL;DR: This paper fills a gap in forecast reconciliation by proving theorems for nonlinear hypersurfaces and vector-valued functions, extending probabilistic guarantees and releasing a JAX-based Python package.

Details

Motivation: To address the lack of formal theorems for error reduction in nonlinear forecast reconciliation, especially for probabilistic settings.

Method: Derives theorems for nonlinear hypersurfaces (constant-sign curvature) and vector-valued functions, with probabilistic guarantees.

Result: Established theorems analogous to prior work, with broader applicability, and released a practical Python package.

Conclusion: The paper advances forecast reconciliation by providing theoretical foundations and tools for nonlinear contexts.

Abstract: Forecast reconciliation, an ex-post technique applied to forecasts that must satisfy constraints, has been a prominent topic in the forecasting literature over the past two decades. Recently, several efforts have sought to extend reconciliation methods to the probabilistic settings. Nevertheless, formal theorems demonstrating error reduction in nonlinear contexts, analogous to those presented in Panagiotelis et al.(2021), are still lacking. This paper addresses that gap by establishing such theorems for various classes of nonlinear hypersurfaces and vector-valued functions. Specifically, we derive an exact analog of Theorem 3.1 from Panagiotelis et al.(2021) for hypersurfaces with constant-sign curvature. Additionally, we provide probabilistic guarantees for the broader case of hypersurfaces with non-constant-sign curvature and for general vector-valued functions. To support reproducibility and practical adoption, we release a JAX-based Python package, \emph{to be released upon publication}, implementing the presented theorems and reconciliation procedures.

[286] SmilesT5: Domain-specific pretraining for molecular language models

Philip Spence, Brooks Paige, Anne Osbourn

Main category: cs.LG

TL;DR: The paper introduces domain-specific pretraining tasks for molecular property prediction using SMILES strings, improving performance and efficiency over traditional methods.

Details

Motivation: Molecular property prediction is crucial in drug discovery, and leveraging NLP advancements can enhance learning from molecular representations like SMILES strings.

Method: Proposes novel domain-specific text-to-text pretraining tasks for transformer models, evaluated on six classification benchmarks.

Result: Outperforms traditional likelihood-based training and prior fine-tuning tasks, with improved data and computational efficiency.

Conclusion: Pretrained embeddings offer comparable performance to fine-tuning but with lower computational cost, making them practical for downstream applications.

Abstract: Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show that data and computational efficiency can be improved by using these domain-specific pretraining tasks. Finally, the pretrained embeddings from the model can be used as fixed inputs into a downstream machine learning classifier and yield comparable performance to finetuning but with much lower computational overhead.

[287] HGCN(O): A Self-Tuning GCN HyperModel Toolkit for Outcome Prediction in Event-Sequence Data

Fang Wang, Paolo Ceravolo, Ernesto Damiani

Main category: cs.LG

TL;DR: HGCN(O) is a self-tuning toolkit using GCN models for event sequence prediction, outperforming traditional methods in accuracy and stability.

Details

Motivation: To improve prediction accuracy and stability for event sequences, especially in unbalanced datasets, by leveraging GCN architectures.

Method: Uses four GCN architectures (O-GCN, T-GCN, TP-GCN, TE-GCN) with varied node- and graph-level attributes and temporal dependencies via edge weights.

Result: GCNConv models perform best on unbalanced data, while all models are consistent on balanced data. HGCN(O) outperforms traditional methods.

Conclusion: HGCN(O) is effective for event sequence prediction, particularly in PBPM, demonstrating superior performance over conventional approaches.

Abstract: We propose HGCN(O), a self-tuning toolkit using Graph Convolutional Network (GCN) models for event sequence prediction. Featuring four GCN architectures (O-GCN, T-GCN, TP-GCN, TE-GCN) across the GCNConv and GraphConv layers, our toolkit integrates multiple graph representations of event sequences with different choices of node- and graph-level attributes and in temporal dependencies via edge weights, optimising prediction accuracy and stability for balanced and unbalanced datasets. Extensive experiments show that GCNConv models excel on unbalanced data, while all models perform consistently on balanced data. Experiments also confirm the superior performance of HGCN(O) over traditional approaches. Applications include Predictive Business Process Monitoring (PBPM), which predicts future events or states of a business process based on event logs.

[288] FGFP: A Fractional Gaussian Filter and Pruning for Deep Neural Networks Compression

Kuan-Ting Tu, Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

Main category: cs.LG

TL;DR: The paper proposes the FGFP framework, combining fractional Gaussian filters and pruning to compress DNNs for edge devices, achieving high accuracy with significant model size reduction.

Details

Motivation: Heavy DNN loads on edge devices necessitate efficient compression techniques.

Method: FGFP integrates fractional-order calculus and Gaussian functions to create fractional Gaussian filters (FGFs) and uses Adaptive Unstructured Pruning (AUP) for higher compression.

Result: ResNet-20 on CIFAR-10 saw an 85.2% model size reduction with only a 1.52% accuracy drop; ResNet-50 on ImageNet2012 achieved a 69.1% reduction with a 1.63% drop.

Conclusion: FGFP effectively compresses DNNs for edge deployment, outperforming existing methods in accuracy and compression.

Abstract: Network compression techniques have become increasingly important in recent years because the loads of Deep Neural Networks (DNNs) are heavy for edge devices in real-world applications. While many methods compress neural network parameters, deploying these models on edge devices remains challenging. To address this, we propose the fractional Gaussian filter and pruning (FGFP) framework, which integrates fractional-order differential calculus and Gaussian function to construct fractional Gaussian filters (FGFs). To reduce the computational complexity of fractional-order differential operations, we introduce Gr"unwald-Letnikov fractional derivatives to approximate the fractional-order differential equation. The number of parameters for each kernel in FGF is minimized to only seven. Beyond the architecture of Fractional Gaussian Filters, our FGFP framework also incorporates Adaptive Unstructured Pruning (AUP) to achieve higher compression ratios. Experiments on various architectures and benchmarks show that our FGFP framework outperforms recent methods in accuracy and compression. On CIFAR-10, ResNet-20 achieves only a 1.52% drop in accuracy while reducing the model size by 85.2%. On ImageNet2012, ResNet-50 achieves only a 1.63% drop in accuracy while reducing the model size by 69.1%.

[289] Accident-Driven Congestion Prediction and Simulation: An Explainable Framework Using Advanced Clustering and Bayesian Networks

Kranthi Kumar Talluri, Galia Weidl, Vaishnavi Kasuluru

Main category: cs.LG

TL;DR: A robust framework using AutoML-enhanced DEC and Bayesian Networks predicts traffic congestion from accidents with 95.6% accuracy, validated via SUMO simulations.

Details

Motivation: Address traffic congestion caused by accidents, which leads to delays, emissions, and safety issues.

Method: AutoML-enhanced DEC for clustering accident data and Bayesian Networks for congestion prediction, validated with SUMO simulations.

Result: AutoML-enhanced DEC outperforms traditional methods; BN achieves 95.6% accuracy and matches SUMO simulation results.

Conclusion: The proposed framework reliably predicts congestion from accidents, improving urban mobility.

Abstract: Traffic congestion due to uncertainties, such as accidents, is a significant issue in urban areas, as the ripple effect of accidents causes longer delays, increased emissions, and safety concerns. To address this issue, we propose a robust framework for predicting the impact of accidents on congestion. We implement Automated Machine Learning (AutoML)-enhanced Deep Embedding Clustering (DEC) to assign congestion labels to accident data and predict congestion probability using a Bayesian Network (BN). The Simulation of Urban Mobility (SUMO) simulation is utilized to evaluate the correctness of BN predictions using evidence-based scenarios. Results demonstrate that the AutoML-enhanced DEC has outperformed traditional clustering approaches. The performance of the proposed BN model achieved an overall accuracy of 95.6%, indicating its ability to understand the complex relationship of accidents causing congestion. Validation in SUMO with evidence-based scenarios demonstrated that the BN model’s prediction of congestion states closely matches those of SUMO, indicating the high reliability of the proposed BN model in ensuring smooth urban mobility.

[290] Pre-trained Models Perform the Best When Token Distributions Follow Zipf’s Law

Yanjin He, Qingkai Zeng, Meng Jiang

Main category: cs.LG

TL;DR: The paper proposes using Zipf’s law to determine optimal vocabulary size in tokenization, showing improved model performance when token distributions align with power-law behavior.

Details

Motivation: Current methods for selecting vocabulary size in tokenization rely on heuristics or dataset-specific choices, lacking a principled approach.

Method: Analyze token frequency distributions using Zipf’s law to align vocabulary size with power-law behavior.

Result: Models achieve peak performance when token distributions closely follow Zipf’s law, validated across NLP, genomics, and chemistry.

Conclusion: Zipfian alignment is a robust and generalizable criterion for selecting vocabulary size, enhancing model efficiency and effectiveness.

Abstract: Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal vocabulary size remains underexplored, typically relying on heuristics or dataset-specific choices. In this work, we propose a principled method for determining the vocabulary size by analyzing token frequency distributions through Zipf’s law. We show that downstream task performance correlates with how closely token distributions follow power-law behavior, and that aligning with Zipfian scaling improves both model efficiency and effectiveness. Extensive experiments across NLP, genomics, and chemistry demonstrate that models consistently achieve peak performance when the token distribution closely adheres to Zipf’s law, establishing Zipfian alignment as a robust and generalizable criterion for vocabulary size selection.

[291] Thermodynamics-Inspired Computing with Oscillatory Neural Networks for Inverse Matrix Computation

George Tsormpatzoglou, Filip Sabo, Aida Todri-Sanial

Main category: cs.LG

TL;DR: A thermodynamic-inspired computing paradigm using oscillatory neural networks (ONNs) is proposed for solving linear algebra problems, specifically inverse matrices.

Details

Motivation: To explore the feasibility of ONNs, traditionally used for combinatorial optimization, in solving linear algebra problems like inverse matrices.

Method: Analytical demonstration using the linear approximation of the coupled Kuramoto oscillator model, grounded in thermodynamics. Numerical simulations validate the approach.

Result: The method successfully computes inverse matrices, with numerical simulations identifying parameter regimes for highest accuracy.

Conclusion: ONNs, inspired by thermodynamics, can effectively solve linear algebra problems, expanding their applicability beyond combinatorial optimization.

Abstract: We describe a thermodynamic-inspired computing paradigm based on oscillatory neural networks (ONNs). While ONNs have been widely studied as Ising machines for tackling complex combinatorial optimization problems, this work investigates their feasibility in solving linear algebra problems, specifically the inverse matrix. Grounded in thermodynamic principles, we analytically demonstrate that the linear approximation of the coupled Kuramoto oscillator model leads to the inverse matrix solution. Numerical simulations validate the theoretical framework, and we examine the parameter regimes that computation has the highest accuracy.

[292] Efficient Differentially Private Fine-Tuning of LLMs via Reinforcement Learning

Afshin Khadangi, Amir Sartipi, Igor Tchappi, Ramin Bahmani, Gilbert Fridgen

Main category: cs.LG

TL;DR: RLDP is a reinforcement learning framework for differentially private optimization, improving model utility and efficiency while maintaining privacy.

Details

Motivation: Address the trade-off between data privacy and model utility in LLMs by dynamically optimizing DP-SGD parameters.

Method: Uses deep reinforcement learning (SAC) to adaptively adjust gradient-clipping thresholds and noise injection during training.

Result: Achieves perplexity reductions of 1.3-30.5% and 5.6% utility gain, with 71% faster convergence, while maintaining privacy.

Conclusion: RLDP effectively balances privacy and utility, outperforming static DP-SGD methods.

Abstract: The tension between data privacy and model utility has become the defining bottleneck for the practical deployment of large language models (LLMs) trained on sensitive corpora including healthcare. Differentially private stochastic gradient descent (DP-SGD) guarantees formal privacy, yet it does so at a pronounced cost: gradients are forcibly clipped and perturbed with noise, degrading sample efficiency and final accuracy. Numerous variants have been proposed to soften this trade-off, but they all share a handicap: their control knobs are hard-coded, global, and oblivious to the evolving optimization landscape. Consequently, practitioners are forced either to over-spend privacy budget in pursuit of utility, or to accept mediocre models in order to stay within privacy constraints. We present RLDP, the first framework to cast DP optimization itself as a closed-loop control problem amenable to modern deep reinforcement learning (RL). RLDP continuously senses rich statistics of the learning dynamics and acts by selecting fine-grained per parameter gradient-clipping thresholds as well as the magnitude of injected Gaussian noise. A soft actor-critic (SAC) hyper-policy is trained online during language model fine-tuning; it learns, from scratch, how to allocate the privacy budget where it matters and when it matters. Across more than 1,600 ablation experiments on GPT2-small, Llama-1B, Llama-3B, and Mistral-7B, RLDP delivers perplexity reductions of 1.3-30.5% (mean 5.4%) and an average 5.6% downstream utility gain. RLDP reaches each baseline’s final utility after only 13-43% of the gradient-update budget (mean speed-up 71%), all while honoring the same ($\epsilon$, $\delta$)-DP contract and exhibiting equal or lower susceptibility to membership-inference and canary-extraction attacks.

[293] DeepC4: Deep Conditional Census-Constrained Clustering for Large-scale Multitask Spatial Disaggregation of Urban Morphology

Joshua Dimasaka, Christian Geiß, Emily So

Main category: cs.LG

TL;DR: DeepC4, a deep learning-based spatial disaggregation method, improves urban morphology mapping by integrating census statistics and satellite imagery, outperforming GEM and METEOR in Rwanda.

Details

Motivation: Address discrepancies and uncertainties in coarse-to-fine-grained urban mapping, especially in developing economies, for sustainable development and disaster risk reduction.

Method: Deep Conditional Census-Constrained Clustering (DeepC4) combines census statistics as constraints with multitask learning of satellite imagery patterns.

Result: Enhanced mapping of urban morphology in Rwanda, particularly for building exposure and vulnerability, at a detailed administrative level.

Conclusion: DeepC4 offers a scalable, deep learning-based solution for auditing coarse-grained spatial data, supporting global sustainability goals.

Abstract: To understand our global progress for sustainable development and disaster risk reduction in many developing economies, two recent major initiatives - the Uniform African Exposure Dataset of the Global Earthquake Model (GEM) Foundation and the Modelling Exposure through Earth Observation Routines (METEOR) Project - implemented classical spatial disaggregation techniques to generate large-scale mapping of urban morphology using the information from various satellite imagery and its derivatives, geospatial datasets of the built environment, and subnational census statistics. However, the local discrepancy with well-validated census statistics and the propagated model uncertainties remain a challenge in such coarse-to-fine-grained mapping problems, specifically constrained by weak and conditional label supervision. Therefore, we present Deep Conditional Census-Constrained Clustering (DeepC4), a novel deep learning-based spatial disaggregation approach that incorporates local census statistics as cluster-level constraints while considering multiple conditional label relationships in a joint multitask learning of the patterns of satellite imagery. To demonstrate, compared to GEM and METEOR, we enhanced the quality of Rwandan maps of urban morphology, specifically building exposure and physical vulnerability, at the third-level administrative unit from the 2022 census. As the world approaches the conclusion of our global frameworks in 2030, our work has offered a new deep learning-based mapping technique towards a spatial auditing of our existing coarse-grained derived information at large scales.

[294] VAR: Visual Analysis for Rashomon Set of Machine Learning Models’ Performance

Yuanzhe Jin

Main category: cs.LG

TL;DR: The paper introduces VAR, a visualization tool for comparing ML models in the Rashomon set using heatmaps and scatter plots.

Details

Motivation: There's a lack of effective visualization methods for horizontally comparing multiple ML models with similar accuracies but different structures.

Method: The proposed VAR solution combines heatmaps and scatter plots to visualize and compare models in the Rashomon set.

Result: VAR helps developers identify optimal models under specific conditions and understand the Rashomon set’s characteristics.

Conclusion: VAR provides a practical and effective way to analyze and compare closely matched ML models.

Abstract: Evaluating the performance of closely matched machine learning(ML) models under specific conditions has long been a focus of researchers in the field of machine learning. The Rashomon set is a collection of closely matched ML models, encompassing a wide range of models with similar accuracies but different structures. Traditionally, the analysis of these sets has focused on vertical structural analysis, which involves comparing the corresponding features at various levels within the ML models. However, there has been a lack of effective visualization methods for horizontally comparing multiple models with specific features. We propose the VAR visualization solution. VAR uses visualization to perform comparisons of ML models within the Rashomon set. This solution combines heatmaps and scatter plots to facilitate the comparison. With the help of VAR, ML model developers can identify the optimal model under specific conditions and better understand the Rashomon set’s overall characteristics.

[295] Explaining Deep Network Classification of Matrices: A Case Study on Monotonicity

Leandro Farina, Sergey Korotov

Main category: cs.LG

TL;DR: A deep learning and XAI approach identifies human-interpretable rules for classifying monotone matrices, achieving 95% accuracy using two characteristic polynomial coefficients.

Details

Motivation: To derive practical criteria for classifying monotone matrices, which lack easy characterizations despite their simple definition.

Method: Combines deep neural networks with XAI techniques on a dataset of randomly generated matrices, using saliency methods to identify key features.

Result: Identifies two matrix parameters (coefficients of the characteristic polynomial) that classify matrices with 95% accuracy, revealing a simple bound for monotone matrices.

Conclusion: The approach successfully distills complex learned strategies into interpretable rules, offering a practical criterion for monotone matrix classification.

Abstract: This work demonstrates a methodology for using deep learning to discover simple, practical criteria for classifying matrices based on abstract algebraic properties. By combining a high-performance neural network with explainable AI (XAI) techniques, we can distill a model’s learned strategy into human-interpretable rules. We apply this approach to the challenging case of monotone matrices, defined by the condition that their inverses are entrywise nonnegative. Despite their simple definition, an easy characterization in terms of the matrix elements or the derived parameters is not known. Here, we present, to the best of our knowledge, the first systematic machine-learning approach for deriving a practical criterion that distinguishes monotone from non-monotone matrices. After establishing a labelled dataset by randomly generated monotone and non-monotone matrices uniformly on $(-1,1)$, we employ deep neural network algorithms for classifying the matrices as monotone or non-monotone, using both their entries and a comprehensive set of matrix features. By saliency methods, such as integrated gradients, we identify among all features, two matrix parameters which alone provide sufficient information for the matrix classification, with $95%$ accuracy, namely the absolute values of the two lowest-order coefficients, $c_0$ and $c_1$ of the matrix’s characteristic polynomial. A data-driven study of 18,000 random $7\times7$ matrices shows that the monotone class obeys $\lvert c_{0}/c_{1}\rvert\le0.18$ with probability $>99.98%$; because $\lvert c_{0}/c_{1}\rvert = 1/\mathrm{tr}(A^{-1})$ for monotone $A$, this is equivalent to the simple bound $\mathrm{tr}(A^{-1})\ge5.7$.

[296] Deep learning of geometrical cell division rules

Alexandre Durrmeyer, Jean-Christophe Palauqui, Philippe Andrey

Main category: cs.LG

TL;DR: The paper introduces a data-driven approach using deep neural networks to predict cell division plane positioning from mother cell geometry, outperforming traditional hypothesis-driven geometrical rules.

Details

Motivation: To overcome the limitations of hypothesis-driven geometrical rules in predicting cell division planes by leveraging deep learning to learn complex relationships between cell geometry and division patterns.

Method: A modified UNet architecture is used to learn and predict division patterns from mother cell geometry, validated on synthetic data and A. thaliana embryo cells.

Result: The model successfully predicts division patterns that were previously irreconcilable with existing geometrical rules.

Conclusion: Deep networks offer a powerful tool for understanding cell division patterns and generating new hypotheses about division control.

Abstract: The positioning of new cellular walls during cell division plays a key role in shaping plant tissue organization. The influence of cell geometry on the positioning of division planes has been previously captured into various geometrical rules. Accordingly, linking cell shape to division orientation has relied on the comparison between observed division patterns and predictions under specific rules. The need to define a priori the tested rules is a fundamental limitation of this hypothesis-driven approach. As an alternative, we introduce a data-based approach to investigate the relation between cell geometry and division plane positioning, exploiting the ability of deep neural network to learn complex relationships across multidimensional spaces. Adopting an image-based cell representation, we show how division patterns can be learned and predicted from mother cell geometry using a UNet architecture modified to operate on cell masks. Using synthetic data and A. thaliana embryo cells, we evaluate the model performances on a wide range of diverse cell shapes and division patterns. We find that the trained model accounted for embryo division patterns that were previously irreconcilable under existing geometrical rules. Our work shows the potential of deep networks to understand cell division patterns and to generate new hypotheses on the control of cell division positioning.

[297] H2Tune: Federated Foundation Model Fine-Tuning with Hybrid Heterogeneity

Wei Guo, Siyuan Lu, Yiqi Tong, Zhaojun Hu, Fuzhen Zhuang, Xiao Zhang, Tao Fan, Jin Dong

Main category: cs.LG

TL;DR: HHFFT addresses hybrid heterogeneity in federated fine-tuning with H2Tune, improving accuracy by 15.4%.

Details

Motivation: Existing FFT methods don't handle double heterogeneity in model architectures and tasks, leading to challenges in aggregation and knowledge interference.

Method: H2Tune uses sparsified triple matrix decomposition, relation-guided alignment, and task-knowledge disentanglement.

Result: Achieves 15.4% accuracy improvement over baselines with proven O(1/√T) convergence.

Conclusion: H2Tune effectively tackles hybrid heterogeneity in federated fine-tuning, outperforming existing methods.

Abstract: Different from existing federated fine-tuning (FFT) methods for foundation models, hybrid heterogeneous federated fine-tuning (HHFFT) is an under-explored scenario where clients exhibit double heterogeneity in model architectures and downstream tasks. This hybrid heterogeneity introduces two significant challenges: 1) heterogeneous matrix aggregation, where clients adopt different large-scale foundation models based on their task requirements and resource limitations, leading to dimensional mismatches during LoRA parameter aggregation; and 2) multi-task knowledge interference, where local shared parameters, trained with both task-shared and task-specific knowledge, cannot ensure only task-shared knowledge is transferred between clients. To address these challenges, we propose H2Tune, a federated foundation model fine-tuning with hybrid heterogeneity. Our framework H2Tune consists of three key components: (i) sparsified triple matrix decomposition to align hidden dimensions across clients through constructing rank-consistent middle matrices, with adaptive sparsification based on client resources; (ii) relation-guided matrix layer alignment to handle heterogeneous layer structures and representation capabilities; and (iii) alternating task-knowledge disentanglement mechanism to decouple shared and specific knowledge of local model parameters through alternating optimization. Theoretical analysis proves a convergence rate of O(1/\sqrt{T}). Extensive experiments show our method achieves up to 15.4% accuracy improvement compared to state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/H2Tune-1407.

[298] Transductive Model Selection under Prior Probability Shift

Lorenzo Volpi, Alejandro Moreo, Fabrizio Sebastiani

Main category: cs.LG

TL;DR: A method for hyperparameter optimization in transductive learning under prior probability shift, optimizing directly on unlabelled data instead of traditional cross-validation.

Details

Motivation: Addressing dataset shift in transductive learning, particularly prior probability shift, to improve model selection.

Method: Proposes optimizing hyperparameters directly on unlabelled target data, bypassing traditional cross-validation on labelled training data.

Result: Experimental results demonstrate the benefits of the proposed method.

Conclusion: The method effectively handles prior probability shift in transductive learning, offering a practical alternative to traditional model selection.

Abstract: Transductive learning is a supervised machine learning task in which, unlike in traditional inductive learning, the unlabelled data that require labelling are a finite set and are available at training time. Similarly to inductive learning contexts, transductive learning contexts may be affected by dataset shift, i.e., may be such that the IID assumption does not hold. We here propose a method, tailored to transductive classification contexts, for performing model selection (i.e., hyperparameter optimisation) when the data exhibit prior probability shift, an important type of dataset shift typical of anti-causal learning problems. In our proposed method the hyperparameters can be optimised directly on the unlabelled data to which the trained classifier must be applied; this is unlike traditional model selection methods, that are based on performing cross-validation on the labelled training data. We provide experimental results that show the benefits brought about by our method.

[299] Cluster-Based Random Forest Visualization and Interpretation

Max Sondag, Christofer Meinecke, Dennis Collaris, Tatiana von Landesberger, Stef van den Elzen

Main category: cs.LG

TL;DR: The paper introduces a visualization method to improve the interpretability of random forests by clustering similar trees and using new distance metrics and visualization techniques.

Details

Motivation: Random forests, while powerful, are hard to interpret due to their complexity. This paper aims to make them more understandable without oversimplifying.

Method: The authors propose clustering similar trees using a new distance metric and introduce two visualization methods: Feature Plot and Rule Plot.

Result: The approach is tested on the ‘Glass’ dataset and a user study, demonstrating its effectiveness in improving interpretability.

Conclusion: The proposed method enhances the interpretability of random forests, making them more accessible for users.

Abstract: Random forests are a machine learning method used to automatically classify datasets and consist of a multitude of decision trees. While these random forests often have higher performance and generalize better than a single decision tree, they are also harder to interpret. This paper presents a visualization method and system to increase interpretability of random forests. We cluster similar trees which enables users to interpret how the model performs in general without needing to analyze each individual decision tree in detail, or interpret an oversimplified summary of the full forest. To meaningfully cluster the decision trees, we introduce a new distance metric that takes into account both the decision rules as well as the predictions of a pair of decision trees. We also propose two new visualization methods that visualize both clustered and individual decision trees: (1) The Feature Plot, which visualizes the topological position of features in the decision trees, and (2) the Rule Plot, which visualizes the decision rules of the decision trees. We demonstrate the efficacy of our approach through a case study on the “Glass” dataset, which is a relatively complex standard machine learning dataset, as well as a small user study.

[300] Enhanced Prediction of CAR T-Cell Cytotoxicity with Quantum-Kernel Methods

Filippo Utro, Meltem Tolunay, Kahn Rhrissorrakrai, Tanvi P. Gujarati, Jie Shi, Sara Capponi, Mirko Amico, Nate Earnest-Noble, Laxmi Parida

Main category: cs.LG

TL;DR: A quantum approach using Projected Quantum Kernel (PQK) improves CAR T-cell cytotoxicity prediction, outperforming classical methods, especially in data-constrained scenarios.

Details

Motivation: The vast combinatorial space of CAR T-cell co-stimulatory domains makes experimental testing challenging, requiring innovative computational solutions.

Method: PQK embeds classical data into a high-dimensional Hilbert space and uses kernel methods for similarity measurement, tested on a 61-qubit quantum computer.

Result: PQK enhances classification performance for CAR T cytotoxicity prediction, particularly for signaling domains with limited data.

Conclusion: Quantum computing shows promise for addressing data-constrained problems in CAR T-cell engineering.

Abstract: Chimeric antigen receptor (CAR) T-cells are T-cells engineered to recognize and kill specific tumor cells. Through their extracellular domains, CAR T-cells bind tumor cell antigens which triggers CAR T activation and proliferation. These processes are regulated by co-stimulatory domains present in the intracellular region of the CAR T-cell. Through integrating novel signaling components into the co-stimulatory domains, it is possible to modify CAR T-cell phenotype. Identifying and experimentally testing new CAR constructs based on libraries of co-stimulatory domains is nontrivial given the vast combinatorial space defined by such libraries. This leads to a highly data constrained, poorly explored combinatorial problem, where the experiments undersample all possible combinations. We propose a quantum approach using a Projected Quantum Kernel (PQK) to address this challenge. PQK operates by embedding classical data into a high dimensional Hilbert space and employs a kernel method to measure sample similarity. Using 61 qubits on a gate-based quantum computer, we demonstrate the largest PQK application to date and an enhancement in the classification performance over purely classical machine learning methods for CAR T cytotoxicity prediction. Importantly, we show improved learning for specific signaling domains and domain positions, particularly where there was lower information highlighting the potential for quantum computing in data-constrained problems.

[301] Bayesian Optimization of Process Parameters of a Sensor-Based Sorting System using Gaussian Processes as Surrogate Models

Felix Kronenwett, Georg Maier, Thomas Laengle

Main category: cs.LG

TL;DR: The paper presents a Bayesian Optimization-based method for optimizing and adjusting process parameters in sensor-based sorting systems, minimizing experiments while addressing uncertainties.

Details

Motivation: Continuous verification and re-adjustment of process parameters are needed due to changing material stream compositions and requirements.

Method: Uses Gaussian process regression models as surrogate models within Bayesian Optimization to optimize parameters, considering uncertainties and dual optimization targets.

Result: Evaluated with three example process parameters, the method efficiently meets system behavior requirements.

Conclusion: The approach effectively optimizes and monitors sorting system parameters, reducing experimental needs while handling uncertainties.

Abstract: Sensor-based sorting systems enable the physical separation of a material stream into two fractions. The sorting decision is based on the image data evaluation of the sensors used and is carried out using actuators. Various process parameters must be set depending on the properties of the material stream, the dimensioning of the system, and the required sorting accuracy. However, continuous verification and re-adjustment are necessary due to changing requirements and material stream compositions. In this paper, we introduce an approach for optimizing, recurrently monitoring and adjusting the process parameters of a sensor-based sorting system. Based on Bayesian Optimization, Gaussian process regression models are used as surrogate models to achieve specific requirements for system behavior with the uncertainties contained therein. This method minimizes the number of necessary experiments while simultaneously considering two possible optimization targets based on the requirements for both material output streams. In addition, uncertainties are considered during determining sorting accuracies in the model calculation. We evaluated the method with three example process parameters.

[302] Teaching the Teacher: Improving Neural Network Distillability for Symbolic Regression via Jacobian Regularization

Soumyadeep Dhar, Kei Sen Fong, Mehul Motani

Main category: cs.LG

TL;DR: A novel Jacobian-based regularizer improves symbolic distillation of neural networks, enhancing fidelity by 120% while maintaining accuracy.

Details

Motivation: Distilling complex neural networks into interpretable symbolic formulas is challenging due to their brittleness and low-fidelity results.

Method: Introduces a Jacobian-based regularizer to encourage smoother, more distillable functions in the teacher network.

Result: Achieves a 120% relative improvement in R² score of distilled symbolic models without compromising teacher accuracy.

Conclusion: The method offers a practical and principled way to enhance interpretable model fidelity from neural networks.

Abstract: Distilling large neural networks into simple, human-readable symbolic formulas is a promising path toward trustworthy and interpretable AI. However, this process is often brittle, as the complex functions learned by standard networks are poor targets for symbolic discovery, resulting in low-fidelity student models. In this work, we propose a novel training paradigm to address this challenge. Instead of passively distilling a pre-trained network, we introduce a \textbf{Jacobian-based regularizer} that actively encourages the ``teacher’’ network to learn functions that are not only accurate but also inherently smoother and more amenable to distillation. We demonstrate through extensive experiments on a suite of real-world regression benchmarks that our method is highly effective. By optimizing the regularization strength for each problem, we improve the $R^2$ score of the final distilled symbolic model by an average of \textbf{120% (relative)} compared to the standard distillation pipeline, all while maintaining the teacher’s predictive accuracy. Our work presents a practical and principled method for significantly improving the fidelity of interpretable models extracted from complex neural networks.

[303] Label-free estimation of clinically relevant performance metrics under distribution shifts

Tim Flühmann, Alceu Bissoto, Trung-Dung Hoang, Lisa M. Koch

Main category: cs.LG

TL;DR: The paper introduces methods to estimate the full confusion matrix for performance monitoring of image classification models in clinical settings, addressing limitations of current accuracy-focused techniques under distribution shifts.

Details

Motivation: Ground-truth labels are often unavailable in target datasets, making direct performance assessment infeasible. Existing methods focus on accuracy estimation and lack evaluation in clinical domains with class imbalances and dataset shifts.

Method: Generalizations of existing performance prediction methods are introduced to estimate the full confusion matrix. These are benchmarked on chest x-ray data under real-world and simulated distribution shifts.

Result: The proposed methods reliably predicted clinically relevant metrics under shifts but revealed failure modes of current techniques in simulated scenarios.

Conclusion: Better understanding of real-world deployment contexts is needed for effective postmarket surveillance of medical AI models.

Abstract: Performance monitoring is essential for safe clinical deployment of image classification models. However, because ground-truth labels are typically unavailable in the target dataset, direct assessment of real-world model performance is infeasible. State-of-the-art performance estimation methods address this by leveraging confidence scores to estimate the target accuracy. Despite being a promising direction, the established methods mainly estimate the model’s accuracy and are rarely evaluated in a clinical domain, where strong class imbalances and dataset shifts are common. Our contributions are twofold: First, we introduce generalisations of existing performance prediction methods that directly estimate the full confusion matrix. Then, we benchmark their performance on chest x-ray data in real-world distribution shifts as well as simulated covariate and prevalence shifts. The proposed confusion matrix estimation methods reliably predicted clinically relevant counting metrics on medical images under distribution shifts. However, our simulated shift scenarios exposed important failure modes of current performance estimation techniques, calling for a better understanding of real-world deployment contexts when implementing these performance monitoring techniques for postmarket surveillance of medical AI models.

[304] DO-EM: Density Operator Expectation Maximization

Adit Vishnu, Abhay Shastry, Dhruva Kashyap, Chiranjib Bhattacharyya

Main category: cs.LG

TL;DR: The paper introduces a scalable Expectation-Maximization (EM) framework for training latent variable models using density operators (DOMs) on classical hardware, addressing scalability issues in quantum generative modeling.

Details

Motivation: Existing quantum training algorithms, like those for Quantum Boltzmann Machines, fail to scale to real-world data (e.g., MNIST). The EM algorithm's success in classical probabilistic models motivates its adaptation for DOMs.

Method: The authors reformulate the EM algorithm’s Expectation step as a quantum information projection (QIP) problem, solved using the Petz Recovery Map. They introduce the DO-EM algorithm, a Minorant-Maximization procedure optimizing a quantum evidence lower bound.

Result: The DO-EM algorithm ensures non-decreasing log-likelihood and is applied to train Quantum Interleaved Deep Boltzmann Machines (QiDBMs), which outperform classical DBMs on MNIST, reducing Fréchet Inception Distance by 40–60%.

Conclusion: The proposed DO-EM framework enables scalable training of DOMs on classical hardware, bridging the gap between quantum and classical generative modeling while achieving superior performance.

Abstract: Density operators, quantum generalizations of probability distributions, are gaining prominence in machine learning due to their foundational role in quantum computing. Generative modeling based on density operator models (\textbf{DOMs}) is an emerging field, but existing training algorithms – such as those for the Quantum Boltzmann Machine – do not scale to real-world data, such as the MNIST dataset. The Expectation-Maximization algorithm has played a fundamental role in enabling scalable training of probabilistic latent variable models on real-world datasets. \textit{In this paper, we develop an Expectation-Maximization framework to learn latent variable models defined through \textbf{DOMs} on classical hardware, with resources comparable to those used for probabilistic models, while scaling to real-world data.} However, designing such an algorithm is nontrivial due to the absence of a well-defined quantum analogue to conditional probability, which complicates the Expectation step. To overcome this, we reformulate the Expectation step as a quantum information projection (QIP) problem and show that the Petz Recovery Map provides a solution under sufficient conditions. Using this formulation, we introduce the Density Operator Expectation Maximization (DO-EM) algorithm – an iterative Minorant-Maximization procedure that optimizes a quantum evidence lower bound. We show that the \textbf{DO-EM} algorithm ensures non-decreasing log-likelihood across iterations for a broad class of models. Finally, we present Quantum Interleaved Deep Boltzmann Machines (\textbf{QiDBMs}), a \textbf{DOM} that can be trained with the same resources as a DBM. When trained with \textbf{DO-EM} under Contrastive Divergence, a \textbf{QiDBM} outperforms larger classical DBMs in image generation on the MNIST dataset, achieving a 40–60% reduction in the Fr'echet Inception Distance.

[305] G-Core: A Simple, Scalable and Balanced RLHF Trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, Ting Yao

Main category: cs.LG

TL;DR: G-Core is a scalable RLHF training framework addressing challenges like controller bottlenecks and dynamic workloads, improving efficiency and utilization.

Details

Motivation: Existing RLHF systems struggle with scalability and adaptability in multi-modal workflows and dynamic workloads.

Method: Introduces a parallel controller model and dynamic placement schema for efficient orchestration and resource use.

Result: Successfully trained models for WeChat, showing improved scalability and robustness.

Conclusion: G-Core advances RLHF training, supporting future large-scale human-aligned models.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models (LLMs) and diffusion models. While existing RLHF training systems have enabled significant progress, they often face challenges in scaling to multi-modal and diffusion workflows and adapting to dynamic workloads. In particular, current approaches may encounter limitations in controller scalability, flexible resource placement, and efficient orchestration when handling complex RLHF pipelines, especially in scenarios involving dynamic sampling or generative reward modeling. In this paper, we present \textbf{G-Core}, a simple, scalable, and balanced RLHF training framework designed to address these challenges. G-Core introduces a parallel controller programming model, enabling flexible and efficient orchestration of complex RLHF workflows without the bottlenecks of a single centralized controller. Furthermore, we propose a dynamic placement schema that adaptively partitions resources and schedules workloads, significantly reducing hardware idle time and improving utilization, even under highly variable training conditions. G-Core has successfully trained models that support WeChat product features serving a large-scale user base, demonstrating its effectiveness and robustness in real-world scenarios. Our results show that G-Core advances the state of the art in RLHF training, providing a solid foundation for future research and deployment of large-scale, human-aligned models.

[306] Quantifying surprise in clinical care: Detecting highly informative events in electronic health records with foundation models

Michael C. Burkhart, Bashar Ramadan, Luke Solo, William F. Parker, Brett K. Beaulieu-Jones

Main category: cs.LG

TL;DR: A foundation model-derived method identifies informative tokens and events in EHRs, improving anomaly detection and prediction of patient outcomes while aiding model interpretability.

Details

Motivation: To enhance the identification of significant events in EHRs beyond rule-based methods, focusing on context and improving predictive accuracy.

Method: Uses a foundation model to analyze EHR data in the full context of a patient’s hospitalization, flagging anomalies and assessing event informativeness.

Result: Identifies significant events for predicting outcomes and allows safe dropping of less informative data; aids in interpreting prognostic models.

Conclusion: The method improves EHR analysis by leveraging context and informativeness for better predictions and interpretability.

Abstract: We present a foundation model-derived method to identify highly informative tokens and events in electronic health records. Our approach considers incoming data in the entire context of a patient’s hospitalization and so can flag anomalous events that rule-based approaches would consider within a normal range. We demonstrate that the events our model flags are significant for predicting downstream patient outcomes and that a fraction of events identified as carrying little information can safely be dropped. Additionally, we show how informativeness can help interpret the predictions of prognostic models trained on foundation model-derived representations.

[307] Tapping into the Black Box: Uncovering Aligned Representations in Pretrained Neural Networks

Maciej Satkiewicz

Main category: cs.LG

TL;DR: ReLU networks implicitly learn a linear model, and its decision boundary can be approximated and visualized using modified backward passes, revealing interpretable patterns.

Details

Motivation: To uncover and interpret the implicit linear models learned by ReLU networks, enhancing understanding of neural network decision-making.

Method: Formally describe the implicit model and modify the backward pass to pull its decision boundary back to input space, generating excitation pullbacks.

Result: Excitation pullbacks reveal high-resolution, interpretable features aligned with perceptual patterns in ImageNet-pretrained architectures.

Conclusion: Neural networks rely on interpretable patterns, recoverable post-training, with implications for knowledge discovery and dependable AI systems.

Abstract: In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into. We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment on a number of popular ImageNet-pretrained deep architectures. This strongly suggests that neural networks do, in fact, rely on learned interpretable patterns that can be recovered after training. Thus, our findings may have profound implications for knowledge discovery and the development of dependable artificial systems.

[308] PAF-Net: Phase-Aligned Frequency Decoupling Network for Multi-Process Manufacturing Quality Prediction

Yang Luo, Haoyang Luan, Haoyun Pan, Yongquan Jia, Xiaofeng Gao, Guihai Chen

Main category: cs.LG

TL;DR: PAF-Net is a frequency decoupled time series prediction framework addressing challenges in multi-process manufacturing, outperforming baselines with lower MSE and MAE.

Details

Motivation: Accurate quality prediction in multi-process manufacturing is hindered by time-lagged interactions, overlapping operations, and inter-process dependencies.

Method: PAF-Net uses phase-correlation alignment, frequency independent patch attention with DCT, and frequency decoupled cross attention to address these challenges.

Result: PAF-Net achieves 7.06% lower MSE and 3.88% lower MAE than 10 baselines on 4 real-world datasets.

Conclusion: PAF-Net effectively resolves manufacturing prediction challenges and demonstrates superior performance.

Abstract: Accurate quality prediction in multi-process manufacturing is critical for industrial efficiency but hindered by three core challenges: time-lagged process interactions, overlapping operations with mixed periodicity, and inter-process dependencies in shared frequency bands. To address these, we propose PAF-Net, a frequency decoupled time series prediction framework with three key innovations: (1) A phase-correlation alignment method guided by frequency domain energy to synchronize time-lagged quality series, resolving temporal misalignment. (2) A frequency independent patch attention mechanism paired with Discrete Cosine Transform (DCT) decomposition to capture heterogeneous operational features within individual series. (3) A frequency decoupled cross attention module that suppresses noise from irrelevant frequencies, focusing exclusively on meaningful dependencies within shared bands. Experiments on 4 real-world datasets demonstrate PAF-Net’s superiority. It outperforms 10 well-acknowledged baselines by 7.06% lower MSE and 3.88% lower MAE. Our code is available at https://github.com/StevenLuan904/PAF-Net-Official.

[309] RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Zijing Zhang, Ziyang Chen, Mingxiao Li, Zhaopeng Tu, Xiaolong Li

Main category: cs.LG

TL;DR: RLVMR integrates process-level supervision into RL, improving reasoning and efficiency in autonomous agents, achieving state-of-the-art results.

Details

Motivation: Address the inefficiency and brittleness of RL methods that optimize only for final task success, reinforcing flawed reasoning paths.

Method: Introduces RLVMR, a framework with dense, process-level supervision, rewarding verifiable meta-reasoning behaviors and combining process-centric rewards with final outcomes.

Result: Achieves 83.6% success rate on ALFWorld and ScienceWorld benchmarks, with improved reasoning quality and reduced redundant actions.

Conclusion: RLVMR leads to more robust, efficient, and interpretable agents by enhancing reasoning and error recovery.

Abstract: The development of autonomous agents for complex, long-horizon tasks is a central goal in AI. However, dominant training paradigms face a critical limitation: reinforcement learning (RL) methods that optimize solely for final task success often reinforce flawed or inefficient reasoning paths, a problem we term inefficient exploration. This leads to agents that are brittle and fail to generalize, as they learn to find solutions without learning how to reason coherently. To address this, we introduce RLVMR, a novel framework that integrates dense, process-level supervision into end-to-end RL by rewarding verifiable, meta-reasoning behaviors. RLVMR equips an agent to explicitly tag its cognitive steps, such as planning, exploration, and reflection, and provides programmatic, rule-based rewards for actions that contribute to effective problem-solving. These process-centric rewards are combined with the final outcome signal and optimized using a critic-free policy gradient method. On the challenging ALFWorld and ScienceWorld benchmarks, RLVMR achieves new state-of-the-art results, with our 7B model reaching an 83.6% success rate on the most difficult unseen task split. Our analysis confirms these gains stem from improved reasoning quality, including significant reductions in redundant actions and enhanced error recovery, leading to more robust, efficient, and interpretable agents.

[310] Decentralized Differentially Private Power Method

Andrew Campbell, Anna Scaglione, Sean Peisert

Main category: cs.LG

TL;DR: A decentralized, differentially private method (D-DP-PM) for PCA in multi-agent networks, ensuring privacy and collaborative eigenvector estimation without a central aggregator.

Details

Motivation: Addresses the challenge of decentralized PCA where agents observe only subsets of data dimensions, ensuring privacy and collaboration.

Method: Agents share local embeddings with Gaussian noise, leveraging random initialization for privacy. The method guarantees $\epsilon,\delta$-DP and analyzes network topology impact.

Result: Proven privacy guarantees and convergence rates. Experiments show superior privacy-utility tradeoffs, especially for $\epsilon\in[2,5]$.

Conclusion: D-DP-PM offers efficient, privacy-preserving PCA in decentralized settings, balancing privacy and utility effectively.

Abstract: We propose a novel Decentralized Differentially Private Power Method (D-DP-PM) for performing Principal Component Analysis (PCA) in networked multi-agent settings. Unlike conventional decentralized PCA approaches where each agent accesses the full n-dimensional sample space, we address the challenging scenario where each agent observes only a subset of dimensions through row-wise data partitioning. Our method ensures $(\epsilon,\delta)$-Differential Privacy (DP) while enabling collaborative estimation of global eigenvectors across the network without requiring a central aggregator. We achieve this by having agents share only local embeddings of the current eigenvector iterate, leveraging both the inherent privacy from random initialization and carefully calibrated Gaussian noise additions. We prove that our algorithm satisfies the prescribed $(\epsilon,\delta)$-DP guarantee and establish convergence rates that explicitly characterize the impact of the network topology. Our theoretical analysis, based on linear dynamics and high-dimensional probability theory, provides tight bounds on both privacy and utility. Experiments on real-world datasets demonstrate that D-DP-PM achieves superior privacy-utility tradeoffs compared to naive local DP approaches, with particularly strong performance in moderate privacy regimes ($\epsilon\in[2, 5]$). The method converges rapidly, allowing practitioners to trade iterations for enhanced privacy while maintaining competitive utility.

[311] A Bit of Freedom Goes a Long Way: Classical and Quantum Algorithms for Reinforcement Learning under a Generative Model

Andris Ambainis, Joao F. Doriguello, Debbie Lim

Main category: cs.LG

TL;DR: Novel classical and quantum online algorithms for learning MDPs, leveraging hybrid exploration-generative RL and avoiding traditional RL paradigms, yielding improved regret bounds.

Details

Motivation: To improve regret bounds in RL for MDPs by avoiding traditional paradigms like optimism and posterior sampling, and leveraging quantum advantages.

Method: Hybrid exploration-generative RL model with classical and quantum algorithms for approximating optimal policies under a generative model.

Result: Quantum algorithms achieve logarithmic regret in finite-horizon MDPs and poly-logarithmic regret in infinite-horizon MDPs, outperforming classical bounds.

Conclusion: The proposed algorithms offer significant improvements in regret bounds, especially for quantum approaches, and generalize to compact state spaces.

Abstract: We propose novel classical and quantum online algorithms for learning finite-horizon and infinite-horizon average-reward Markov Decision Processes (MDPs). Our algorithms are based on a hybrid exploration-generative reinforcement learning (RL) model wherein the agent can, from time to time, freely interact with the environment in a generative sampling fashion, i.e., by having access to a “simulator”. By employing known classical and new quantum algorithms for approximating optimal policies under a generative model within our learning algorithms, we show that it is possible to avoid several paradigms from RL like “optimism in the face of uncertainty” and “posterior sampling” and instead compute and use optimal policies directly, which yields better regret bounds compared to previous works. For finite-horizon MDPs, our quantum algorithms obtain regret bounds which only depend logarithmically on the number of time steps $T$, thus breaking the $O(\sqrt{T})$ classical barrier. This matches the time dependence of the prior quantum works of Ganguly et al. (arXiv'23) and Zhong et al. (ICML'24), but with improved dependence on other parameters like state space size $S$ and action space size $A$. For infinite-horizon MDPs, our classical and quantum bounds still maintain the $O(\sqrt{T})$ dependence but with better $S$ and $A$ factors. Nonetheless, we propose a novel measure of regret for infinite-horizon MDPs with respect to which our quantum algorithms have $\operatorname{poly}\log{T}$ regret, exponentially better compared to classical algorithms. Finally, we generalise all of our results to compact state spaces.

[312] OWLViz: An Open-World Benchmark for Visual Question Answering

Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, Viet Dac Lai

Main category: cs.LG

TL;DR: A benchmark for Open World Visual Question Answering (OWLViz) is introduced, highlighting the gap between human and AI performance.

Details

Motivation: To evaluate AI's ability in complex multimodal tasks like visual understanding, web exploration, and tool usage.

Method: OWLViz benchmark with concise, unambiguous queries tested on humans and AI models like Gemini 2.0.

Result: Humans scored 69.2% accuracy, while Gemini 2.0 achieved only 26.6%, showing AI’s limitations.

Conclusion: The performance gap underscores the need for advancing multimodal AI systems in tool selection and reasoning.

Abstract: We present a challenging benchmark for the Open WorLd VISual question answering (OWLViz) task. OWLViz presents concise, unambiguous queries that require integrating multiple capabilities, including visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems’ ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.

[313] Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining

Deyu Cao, Samin Aref

Main category: cs.LG

TL;DR: The paper addresses resource efficiency in large language models by proposing an ultra-low-bit quantization method that improves upon ApiQ, reducing accuracy degradation without full retraining.

Details

Motivation: Concerns about the environmental and economic impact of large language models' resource usage during inference drive the need for efficient compression techniques like quantization.

Method: The study combines quantization-aware training with ApiQ’s partial training, identifies limitations, and introduces a saliency-aware regularization term for ultra-low-bit quantization.

Result: The proposed method reduces ApiQ’s accuracy degradation by 10.85% (LLaMA 7B) and 7.54% (LLaMA 13B).

Conclusion: The novel method enhances ApiQ’s performance without full retraining, offering a resource-efficient solution for large language models.

Abstract: The growing use of large language models has raised environmental and economic concerns about their intensity of resource usage during inference. Serving these models to each user requires substantial energy and water for cooling. Model compression techniques like quantization can shrink large language models and make them more resource efficient at the cost of potential performance degradation. Quantization methods compress model size through replacing their high-precision parameters by quantized values of lower precision. Among existing methods, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ’s level. First, we look into combining existing quantization-aware training techniques with ApiQ’s partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining is unlikely to be feasible through partial training. (2) This gain may depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. This publicly available method relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on LLaMA 7B and 13B benchmarks demonstrate that our method reduces the ApiQ’s accuracy degradation by 10.85% and 7.54% respectively. A Python implementation of the proposed quantization method is publicly available on GitHub https://github.com/TokuyuSou/ULB-SAPR.

[314] Hyperbolic Graph Learning: A Comprehensive Review

Menglin Yang, Min Zhou, Tong Zhang, Jiahong Liu, Zhihao Li, Lujia Pan, Hui Xiong, Irwin King

Main category: cs.LG

TL;DR: A survey on Hyperbolic Graph Learning (HGL) reviews methods, applications, and challenges, highlighting its potential for capturing complex relational structures in non-Euclidean data.

Details

Motivation: Traditional Euclidean graph representation struggles with hierarchical and non-Euclidean structures, prompting exploration of hyperbolic geometry for richer representations.

Method: Categorizes HGL into hyperbolic graph embeddings, graph neural networks, and emerging paradigms, with applications in recommender systems, bioinformatics, and more.

Result: Demonstrates hyperbolic geometry’s effectiveness in real-world graph learning tasks and identifies key challenges for future research.

Conclusion: HGL offers promising interdisciplinary opportunities, with ongoing challenges in scalability, integration with foundation models, and handling complex data.

Abstract: Graph representation learning in Euclidean space, despite its widespread adoption and proven utility in many domains, often struggles to effectively capture the inherent hierarchical and complex relational structures prevalent in real-world data, particularly for datasets exhibiting a highly non-Euclidean latent anatomy or power-law distributions. Hyperbolic geometry, with its constant negative curvature and exponential growth property, naturally accommodates such structures, offering a promising alternative for learning rich graph representations. This survey paper provides a comprehensive review of the rapidly evolving field of Hyperbolic Graph Learning (HGL). We systematically categorize and analyze existing methods broadly dividing them into (1) hyperbolic graph embedding-based techniques, (2) graph neural network-based hyperbolic models, and (3) emerging paradigms. Beyond methodologies, we extensively discuss diverse applications of HGL across multiple domains, including recommender systems, knowledge graphs, bioinformatics, and other relevant scenarios, demonstrating the broad applicability and effectiveness of hyperbolic geometry in real-world graph learning tasks. Most importantly, we identify several key challenges that serve as directions for advancing HGL, including handling complex data structures, developing geometry-aware learning objectives, ensuring trustworthy and scalable implementations, and integrating with foundation models, e.g., large language models. We highlight promising research opportunities in this exciting interdisciplinary area. A comprehensive repository can be found at https://github.com/digailab/awesome-hyperbolic-graph-learning.

[315] An Introduction to Modern Statistical Learning

Joseph G. Makin

Main category: cs.LG

TL;DR: A unified introduction to statistical learning, connecting classical models (GMM, HMM) to modern neural networks (VAE, diffusion models) under a single framework.

Details

Motivation: To bridge the gap between isolated explanations of ML algorithms and their connections to classical statistical models, while providing a consistent notation for beginners.

Method: Assimilates various models into a single framework for inference and learning, showing transformations between models with minimal alterations.

Result: Aims to offer a straight-line path from basics to modern models, complementing comprehensive texts like Bishop’s.

Conclusion: The work seeks to enhance understanding by unifying classical and modern ML, making advanced concepts more accessible.

Abstract: This work in progress aims to provide a unified introduction to statistical learning, building up slowly from classical models like the GMM and HMM to modern neural networks like the VAE and diffusion models. There are today many internet resources that explain this or that new machine-learning algorithm in isolation, but they do not (and cannot, in so brief a space) connect these algorithms with each other or with the classical literature on statistical models, out of which the modern algorithms emerged. Also conspicuously lacking is a single notational system which, although unfazing to those already familiar with the material (like the authors of these posts), raises a significant barrier to the novice’s entry. Likewise, I have aimed to assimilate the various models, wherever possible, to a single framework for inference and learning, showing how (and why) to change one model into another with minimal alteration (some of them novel, others from the literature). Some background is of course necessary. I have assumed the reader is familiar with basic multivariable calculus, probability and statistics, and linear algebra. The goal of this book is certainly not completeness, but rather to draw a more or less straight-line path from the basics to the extremely powerful new models of the last decade. The goal then is to complement, not replace, such comprehensive texts as Bishop’s \emph{Pattern Recognition and Machine Learning}, which is now 15 years old.

[316] Bridging Privacy and Robustness for Trustworthy Machine Learning

Xiaojin Zhang, Wei Chen

Main category: cs.LG

TL;DR: The paper explores the relationships between Local Differential Privacy (LDP), Maximum Bayesian Privacy (MBP), and Average Bayesian Privacy (ABP), linking them to algorithmic robustness in PAC learning. It shows that privacy mechanisms inherently provide robustness and quantifies how privacy leakage affects input robustness.

Details

Motivation: To address the need for robust privacy protection and algorithmic resilience in machine learning, especially against sophisticated adversaries with prior knowledge.

Method: Systematic theoretical investigation of LDP, MBP, and ABP, including formalizing their relationships and proving PAC robustness from MBP.

Result: Key findings include formalized LDP-MBP relationships, novel MBP-ABP bounds, and proof of PAC robustness from MBP. Privacy leakage’s impact on input robustness is also quantified.

Conclusion: The work unifies privacy and robustness theory, offering a framework for optimizing their trade-off, enabling more secure and resilient machine learning systems.

Abstract: The widespread adoption of machine learning necessitates robust privacy protection alongside algorithmic resilience. While Local Differential Privacy (LDP) provides foundational guarantees, sophisticated adversaries with prior knowledge demand more nuanced Bayesian privacy notions, such as Maximum Bayesian Privacy (MBP) and Average Bayesian Privacy (ABP), first introduced by \cite{zhang2022no}. Concurrently, machine learning systems require inherent robustness against data perturbations and adversarial manipulations. This paper systematically investigates the intricate theoretical relationships among LDP, MBP, and ABP. Crucially, we bridge these privacy concepts with algorithmic robustness, particularly within the Probably Approximately Correct (PAC) learning framework. Our work demonstrates that privacy-preserving mechanisms inherently confer PAC robustness. We present key theoretical results, including the formalization of the established LDP-MBP relationship, novel bounds between MBP and ABP, and a proof demonstrating PAC robustness from MBP. Furthermore, we establish a novel theoretical relationship quantifying how privacy leakage directly influences an algorithm’s input robustness. These results provide a unified theoretical framework for understanding and optimizing the privacy-robustness trade-off, paving the way for the development of more secure, trustworthy, and resilient machine learning systems.

[317] Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Bokun Wang, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou

Main category: cs.LG

TL;DR: FP8 training in federated learning reduces computational and communication costs while maintaining model accuracy.

Details

Motivation: To leverage FP8 for efficient on-device training and reduce client-server communication in federated learning.

Method: Combines FP8 client training with a global FP32 server model, supported by convergence analysis.

Result: Achieves at least 2.9x communication reduction compared to FP32 baseline, with same accuracy.

Conclusion: FP8 is effective for federated learning, offering significant efficiency gains.

Abstract: Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training neural networks with reduced computational cost compared to training in FP32/FP16. In this work, we investigate the use of FP8 training in a federated learning context. This approach brings not only the usual benefits of FP8 which are desirable for on-device training at the edge, but also reduces client-server communication costs due to significant weight compression. We present a novel method for combining FP8 client training while maintaining a global FP32 server model and provide convergence analysis. Experiments with various machine learning models and datasets show that our method consistently yields communication reductions of at least 2.9x across a variety of tasks and models compared to an FP32 baseline to achieve the same trained model accuracy.

[318] The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation for Healthcare QA

Eric Yang, Jonathan Amar, Jong Ha Lee, Bhawesh Kumar, Yugang Jia

Main category: cs.LG

TL;DR: QB-RAG enhances RAG systems in healthcare QA by pre-aligning queries with a curated database, ensuring accuracy and reliability.

Details

Motivation: To improve the accuracy and reliability of LLMs in healthcare question-answering by addressing the challenge of aligning user queries with answerable questions.

Method: Introduces QB-RAG, which includes an LLM-based filtering mechanism to curate a database of relevant questions and a framework for evaluating retrieval and response quality.

Result: QB-RAG outperforms existing retrieval methods in healthcare QA, demonstrating superior performance in faithfulness, relevance, and guideline adherence.

Conclusion: QB-RAG is a practical and effective solution for building trustworthy digital health applications.

Abstract: Deploying Large Language Models (LLMs) for healthcare question answering requires robust methods to ensure accuracy and reliability. This work introduces Query-Based Retrieval Augmented Generation (QB-RAG), a framework for enhancing Retrieval-Augmented Generation (RAG) systems in healthcare question-answering by pre-aligning user queries with a database of curated, answerable questions derived from healthcare content. A key component of QB-RAG is an LLM-based filtering mechanism that ensures that only relevant and answerable questions are included in the database, enabling reliable reference query generation at scale. We provide theoretical motivation for QB-RAG, conduct a comparative analysis of existing retrieval enhancement techniques, and introduce a generalizable, comprehensive evaluation framework that assesses both the retrieval effectiveness and the quality of the generated response based on faithfulness, relevance, and adherence to the guideline. Our empirical evaluation on a healthcare data set demonstrates the superior performance of QB-RAG compared to existing retrieval methods, highlighting its practical value in building trustworthy digital health applications for health question-answering.

[319] BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun

Main category: cs.LG

TL;DR: BlockFFN introduces a novel MoE architecture with improved routing and sparsity for efficient LLM acceleration, achieving high token- and chunk-level sparsity and significant speedup on end-side devices.

Details

Motivation: Address the computational inefficiency and inflexibility of vanilla MoE architectures, particularly for low-resource conditions and mainstream acceleration techniques.

Method: Proposes BlockFFN with ReLU-RMSNorm routing, CLS-aware training objectives, and efficient acceleration kernels combining sparsity and speculative decoding.

Result: Achieves over 80% token-level sparsity, 70% 8-token chunk-level sparsity, and up to 3.67x speedup on end-side devices.

Conclusion: BlockFFN outperforms MoE baselines, offering a practical solution for efficient LLM deployment.

Abstract: To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

[320] SDBA: A Stealthy and Long-Lasting Durable Backdoor Attack in Federated Learning

Minyeong Choe, Cheolhee Park, Changho Seo, Hyunil Kim

Main category: cs.LG

TL;DR: The paper introduces SDBA, a backdoor attack for NLP tasks in federated learning, showing high durability and stealth by targeting vulnerable layers in models like LSTM, GPT-2, and T5.

Details

Motivation: Addressing the vulnerability of federated learning to backdoor attacks in NLP tasks, where research is limited.

Method: SDBA uses layer-wise and top-k% gradient masking to inject backdoors into vulnerable layers of LSTM, GPT-2, and T5 models.

Result: SDBA outperforms existing backdoors in durability and bypasses defenses, especially in transformer models like GPT-2.

Conclusion: The findings emphasize the need for stronger defenses in NLP-based federated learning systems.

Abstract: Federated learning is a promising approach for training machine learning models while preserving data privacy. However, its distributed nature makes it vulnerable to backdoor attacks, particularly in NLP tasks, where related research remains limited. This paper introduces SDBA, a novel backdoor attack mechanism designed for NLP tasks in federated learning environments. Through a systematic analysis across LSTM and GPT-2 models, we identify the most vulnerable layers for backdoor injection and achieve both stealth and long-lasting durability by applying layer-wise gradient masking and top-k% gradient masking. Also, to evaluate the task generalizability of SDBA, we additionally conduct experiments on the T5 model. Experiments on next-token prediction, sentiment analysis, and question answering tasks show that SDBA outperforms existing backdoors in terms of durability and effectively bypasses representative defense mechanisms, demonstrating notable performance in transformer-based models such as GPT-2. These results highlight the urgent need for robust defense strategies in NLP-based federated learning systems.

[321] Inferring biological processes with intrinsic noise from cross-sectional data

Suryanarayana Maddu, Victor Chardès, Michael. J. Shelley

Main category: cs.LG

TL;DR: The paper introduces Probability Flow Inference (PFI), a method to infer dynamical models from cross-sectional omics data, addressing noise and outperforming existing approaches.

Details

Motivation: The challenge of inferring accurate dynamical models from noisy, stochastic biological data, where existing methods often compromise accuracy for simplicity.

Method: PFI infers the phase-space probability flow matching the underlying stochastic process, separating force from intrinsic noise while maintaining ODE-like inference ease.

Result: PFI provides unique solutions for Ornstein-Uhlenbeck processes and excels in high-dimensional stochastic reaction networks and cell differentiation dynamics.

Conclusion: PFI effectively balances accuracy and computational feasibility, outperforming state-of-the-art methods in noisy biological data inference.

Abstract: Inferring dynamical models from data continues to be a significant challenge in computational biology, especially given the stochastic nature of many biological processes. We explore a common scenario in omics, where statistically independent cross-sectional samples are available at a few time points, and the goal is to infer the underlying diffusion process that generated the data. Existing inference approaches often simplify or ignore noise intrinsic to the system, compromising accuracy for the sake of optimization ease. We circumvent this compromise by inferring the phase-space probability flow that shares the same time-dependent marginal distributions as the underlying stochastic process. Our approach, probability flow inference (PFI), disentangles force from intrinsic stochasticity while retaining the algorithmic ease of ODE inference. Analytically, we prove that for Ornstein-Uhlenbeck processes the regularized PFI formalism yields a unique solution in the limit of well-sampled distributions. In practical applications, we show that PFI enables accurate parameter and force estimation in high-dimensional stochastic reaction networks, and that it allows inference of cell differentiation dynamics with molecular noise, outperforming state-of-the-art approaches.

[322] Wavelet Meets Adam: Compressing Gradients for Memory-Efficient Training

Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li

Main category: cs.LG

TL;DR: Proposes Gradient Wavelet Transform (GWT) to reduce memory usage in LLM training without performance loss.

Details

Motivation: Address memory challenges in LLM training caused by large parameters and memory-intensive optimizers like Adam.

Method: Applies wavelet transforms to gradients to reduce optimizer state memory requirements.

Result: GWT achieves state-of-the-art performance in memory usage and training efficiency.

Conclusion: GWT enables efficient LLM training without compromising performance.

Abstract: Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.

[323] Unsupervised Learning in Echo State Networks for Input Reconstruction

Taiki Yamada, Yuichi Katori, Kantaro Fujiwara

Main category: cs.LG

TL;DR: Echo state networks (ESNs) can achieve input reconstruction (IR) via unsupervised learning (UL) without supervised targets, leveraging prior knowledge of ESN parameters. This reduces reliance on supervision and enables applications like dynamical system replication and noise filtering.

Details

Motivation: The study aims to explore unsupervised learning for input reconstruction in ESNs, reducing dependency on supervised targets and expanding their applicability.

Method: The readout layer is trained to reconstruct input time series using UL, assuming known and invertible ESN parameters.

Result: IR via UL is feasible with prior knowledge of ESN parameters, enabling unsupervised applications like noise filtering and system replication.

Conclusion: This work establishes a new principle for ESNs, highlighting the role of parameter knowledge in reducing supervision and offering insights into brain-like computational mechanisms.

Abstract: Echo state networks (ESNs) are a class of recurrent neural networks in which only the readout layer is trainable, while the recurrent and input layers are fixed. This architectural constraint enables computationally efficient processing of time-series data. Traditionally, the readout layer in ESNs is trained using supervised learning with target outputs. In this study, we focus on input reconstruction (IR), where the readout layer is trained to reconstruct the input time series fed into the ESN. We show that IR can be achieved through unsupervised learning (UL), without access to supervised targets, provided that the ESN parameters are known a priori and satisfy invertibility conditions. This formulation allows applications relying on IR, such as dynamical system replication and noise filtering, to be reformulated within the UL framework via straightforward integration with existing algorithms. Our results suggest that prior knowledge of ESN parameters can reduce reliance on supervision, thereby establishing a new principle: not only by fixing part of the network parameters but also by exploiting their specific values. Furthermore, our UL-based algorithms for input reconstruction and related tasks are suitable for autonomous processing, offering insights into how analogous computational mechanisms might operate in the brain in principle. These findings contribute to a deeper understanding of the mathematical foundations of ESNs and their relevance to models in computational neuroscience.

[324] Year-over-Year Developments in Financial Fraud Detection via Deep Learning: A Systematic Literature Review

Yisong Chen, Chuqing Zhao, Yixin Xu, Chuanhao Nie, Yixin Zhang

Main category: cs.LG

TL;DR: A systematic review of deep learning techniques for financial fraud detection, analyzing 57 studies (2019-2024). Highlights effective models (CNNs, LSTMs, transformers) and performance metrics, while addressing challenges like imbalanced data and ethical concerns.

Details

Motivation: To evaluate advancements and effectiveness of deep learning in financial fraud detection, addressing critical issues in the financial sector.

Method: Kitchenham systematic literature review approach, analyzing 57 studies from 2019-2024.

Result: Deep learning models (CNNs, LSTMs, transformers) show effectiveness across domains like credit card fraud and insurance claims. Challenges include data imbalance and interpretability.

Conclusion: Identifies gaps and future directions, offering insights for researchers and practitioners in advancing DL for fraud detection.

Abstract: This paper systematically reviews advancements in deep learning (DL) techniques for financial fraud detection, a critical issue in the financial sector. Using the Kitchenham systematic literature review approach, 57 studies published between 2019 and 2024 were analyzed. The review highlights the effectiveness of various deep learning models such as Convolutional Neural Networks, Long Short-Term Memory, and transformers across domains such as credit card transactions, insurance claims, and financial statement audits. Performance metrics such as precision, recall, F1-score, and AUC-ROC were evaluated. Key themes explored include the impact of data privacy frameworks and advancements in feature engineering and data preprocessing. The study emphasizes challenges such as imbalanced datasets, model interpretability, and ethical considerations, alongside opportunities for automation and privacy-preserving techniques such as blockchain integration and Principal Component Analysis. By examining trends over the past five years, this review identifies critical gaps and promising directions for advancing DL applications in financial fraud detection, offering actionable insights for researchers and practitioners.

[325] Utilizing Evolution Strategies to Train Transformers in Reinforcement Learning

Matyáš Lorenc, Roman Neruda

Main category: cs.LG

TL;DR: Evolution strategies successfully train transformer-based agents in RL, achieving strong performance in MuJoCo and Atari environments.

Details

Motivation: To test if evolution strategies can effectively train complex transformer-based models in reinforcement learning.

Method: Used OpenAI’s parallelizable evolution strategy to train Decision Transformer in MuJoCo and Atari environments.

Result: The strategy achieved strong results, producing high-performing agents.

Conclusion: Evolution strategies are capable of training complex models like transformers in RL.

Abstract: We explore the capability of evolution strategies to train an agent with a policy based on a transformer architecture in a reinforcement learning setting. We performed experiments using OpenAI’s highly parallelizable evolution strategy to train Decision Transformer in the MuJoCo Humanoid locomotion environment and in the environment of Atari games, testing the ability of this black-box optimization technique to train even such relatively large and complicated models (compared to those previously tested in the literature). The examined evolution strategy proved to be, in general, capable of achieving strong results and managed to produce high-performing agents, showcasing evolution’s ability to tackle the training of even such complex models.

[326] Lightweight Online Adaption for Time Series Foundation Model Forecasts

Thomas L. Lee, William Toner, Rajkarn Singh, Artjom Joosen, Martin Asenov

Main category: cs.LG

TL;DR: ELF is a lightweight mechanism for online adaptation of foundation models (FMs) in time series forecasting, improving performance by efficiently using online feedback.

Details

Motivation: Deployed FMs fail to adapt to current data despite available feedback, prompting the need for a method to enhance FM performance.

Method: ELF consists of ELF-Forecaster (learns current data distribution) and ELF-Weighter (combines FM and ELF-Forecaster forecasts).

Result: ELF consistently improves performance across standard time series datasets when combined with various FMs.

Conclusion: Efficient usage of online feedback via ELF enhances FM forecasts, demonstrating its practical value.

Abstract: Foundation models (FMs) have emerged as a promising approach for time series forecasting. While effective, FMs typically remain fixed during deployment due to the high computational costs of learning them online. Consequently, deployed FMs fail to adapt their forecasts to current data characteristics, despite the availability of online feedback from newly arriving data. This raises the question of whether FM performance can be enhanced by the efficient usage of this feedback. We propose ELF to answer this question. ELF is a lightweight mechanism for the online adaption of FM forecasts in response to online feedback. ELF consists of two parts: a) the ELF-Forecaster which is used to learn the current data distribution; and b) the ELF-Weighter which is used to combine the forecasts of the FM and the ELF-Forecaster. We evaluate the performance of ELF in conjunction with several recent FMs across a suite of standard time series datasets. In all of our experiments we find that using ELF improves performance. This work demonstrates how efficient usage of online feedback can be used to improve FM forecasts.

[327] Koopman-Based Generalization of Deep Reinforcement Learning With Application to Wireless Communications

Atefeh Termehchi, Ekram Hossain, Isaac Woungang

Main category: cs.LG

TL;DR: The paper proposes a novel method to evaluate the generalizability of Deep Reinforcement Learning (DRL) using the Koopman operator and spectral analysis, addressing challenges of interpretability and non-i.i.d. data.

Details

Motivation: DRL's limited interpretability and generalizability, especially with non-i.i.d. sequential data, hinder its broader application. Traditional methods fail to address these challenges.

Method: Model DRL state-action dynamics as stochastic functions, approximate them with the Koopman operator, and develop interpretable representations for generalizability analysis using spectral features and the H_∞ norm.

Result: A rigorous mathematical framework for DRL generalizability is developed and applied to compare soft actor-critic and proximal policy optimization in a UAV-assisted mmWave communication scenario.

Conclusion: The proposed method provides a systematic way to evaluate DRL generalizability, enhancing interpretability and applicability in complex scenarios like wireless communication.

Abstract: Deep Reinforcement Learning (DRL) is a key machine learning technology driving progress across various scientific and engineering fields, including wireless communication. However, its limited interpretability and generalizability remain major challenges. In supervised learning, generalizability is commonly evaluated through the generalization error using information-theoretic methods. In DRL, the training data is sequential and not independent and identically distributed (i.i.d.), rendering traditional information-theoretic methods unsuitable for generalizability analysis. To address this challenge, this paper proposes a novel analytical method for evaluating the generalizability of DRL. Specifically, we first model the evolution of states and actions in trained DRL algorithms as unknown discrete, stochastic, and nonlinear dynamical functions. Then, we employ a data-driven identification method, the Koopman operator, to approximate these functions, and propose two interpretable representations. Based on these interpretable representations, we develop a rigorous mathematical approach to evaluate the generalizability of DRL algorithms. This approach is formulated using the spectral feature analysis of the Koopman operator, leveraging the H_\infty norm. Finally, we apply this generalization analysis to compare the soft actor-critic method, widely recognized as a robust DRL approach, against the proximal policy optimization algorithm for an unmanned aerial vehicle-assisted mmWave wireless communication scenario.

[328] Local Mixtures of Experts: Essentially Free Test-Time Training via Model Merging

Ryo Bertolissi, Jonas Hübotter, Ido Hakimi, Andreas Krause

Main category: cs.LG

TL;DR: TTMM scales MoE models with more experts efficiently, approximating TTT but with much lower test-time cost.

Details

Motivation: Current MoE models use few experts due to high training/inference costs; TTMM aims to scale this without overhead.

Method: Proposes TTMM, merging models at test-time to approximate TTT, avoiding its computational expense.

Result: TTMM improves with more experts, nearing TTT performance, and is 100x faster with a 1B parameter model.

Conclusion: TTMM is a cost-effective way to scale test-time training, balancing performance and efficiency.

Abstract: Mixture of expert (MoE) models are a promising approach to increasing model capacity without increasing inference cost, and are core components of many state-of-the-art language models. However, current MoE models typically use only few experts due to prohibitive training and inference cost. We propose Test-Time Model Merging (TTMM) which scales the MoE paradigm to an order of magnitude more experts and uses model merging to avoid almost any test-time overhead. We show that TTMM is an approximation of test-time training (TTT), which fine-tunes an expert model for each prediction task, i.e., prompt. TTT has recently been shown to significantly improve language models, but is computationally expensive. We find that performance of TTMM improves with more experts and approaches the performance of TTT. Moreover, we find that with a 1B parameter base model, TTMM is more than 100x faster than TTT at test-time by amortizing the cost of TTT at train-time. Thus, TTMM offers a promising cost-effective approach to scale test-time training.

[329] Adaptive State-Space Mamba for Real-Time Sensor Data Anomaly Detection

Alice Zhang, Chao Li

Main category: cs.LG

TL;DR: Proposes an Adaptive State-Space Mamba (ASSM) framework for real-time sensor data anomaly detection, outperforming baselines with adaptive gating for efficiency.

Details

Motivation: State-space models excel in sequence analysis but are underexplored for real-time sensor anomaly detection. ASSM addresses this gap.

Method: Introduces an adaptive gating mechanism to dynamically update hidden states based on context and learned cues, ensuring computational efficiency.

Result: Superior detection performance on real-world and synthetic sensor datasets compared to existing baselines.

Conclusion: ASSM is effective for real-time anomaly detection and scalable for other time-series tasks requiring rapid detection.

Abstract: State-space modeling has emerged as a powerful paradigm for sequence analysis in various tasks such as natural language processing, time-series forecasting, and signal processing. In this work, we propose an \emph{Adaptive State-Space Mamba} (\textbf{ASSM}) framework for real-time sensor data anomaly detection. While state-space models have been previously employed for image processing applications (e.g., style transfer \cite{wang2024stylemamba}), our approach leverages the core idea of sequential hidden states to tackle a significantly different domain: detecting anomalies on streaming sensor data. In particular, we introduce an adaptive gating mechanism that dynamically modulates the hidden state update based on contextual and learned statistical cues. This design ensures that our model remains computationally efficient and scalable, even under rapid data arrival rates. Extensive experiments on real-world and synthetic sensor datasets demonstrate that our method achieves superior detection performance compared to existing baselines. Our approach is easily extensible to other time-series tasks that demand rapid and reliable detection capabilities.

[330] Outcome-based Reinforcement Learning to Predict the Future

Benjamin Turtel, Danny Franklin, Kris Skotheim, Luke Hewitt, Philipp Schoenegger

Main category: cs.LG

TL;DR: RLVR improves forecasting accuracy and calibration in event prediction using a compact 14B model, outperforming frontier models and yielding 10% ROI in simulations.

Details

Motivation: To enhance RL's effectiveness in noisy, delayed real-world event forecasting by leveraging verifiable rewards.

Method: Applied RLVR to a novel dataset of prediction market questions and news headlines, using synthetic data augmentation, learning stability guardrails, and median prediction sampling.

Result: The model matched/surpassed frontier models’ accuracy, improved calibration, and achieved a 10% ROI in simulations.

Conclusion: RLVR is a promising approach for real-world event forecasting, offering practical benefits and outperforming larger models.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models’ reasoning in domains such as coding and mathematics. Here, we apply RLVR methods towards forecasting future real-world events - a challenging task for RL due to the very noisy (and delayed) outcomes involved. Using a novel dataset of recent questions from a prediction market, and accompanying relevant news headlines, we show that a compact (14B) reasoning model can be trained to match or surpass the predictive accuracy of frontier models like o1, while greatly improving probabilistic calibration. The model’s performance is also practically meaningful: in a Polymarket trading simulation, we estimate that its bets would have yielded a return on investment of over 10% across all questions in the test set. We detail and compare approaches used in training our model, including augmenting our training-data with synthetic prediction questions, guardrails for learning stability, and median prediction sampling at inference-time.

[331] Uncertainty-Aware Graph Self-Training with Expectation-Maximization Regularization

Emily Wang, Michael Chen, Chao Li

Main category: cs.LG

TL;DR: Proposes an uncertainty-aware graph self-training method for semi-supervised node classification, using EM regularization to refine pseudo-labels and improve model robustness.

Details

Motivation: To address limitations of conventional graph self-training pipelines that rely on fixed pseudo-labels and struggle with noisy graph structures and features.

Method: Introduces an EM regularization scheme for uncertainty-aware pseudo-label generation and iterative model retraining, focusing on reliable graph regions.

Result: Outperforms baselines by up to 2.5% in accuracy with lower performance variance across multiple runs.

Conclusion: The uncertainty-aware approach enhances semi-supervised node classification by effectively handling noise and improving reliability.

Abstract: In this paper, we propose a novel \emph{uncertainty-aware graph self-training} approach for semi-supervised node classification. Our method introduces an Expectation-Maximization (EM) regularization scheme to incorporate an uncertainty mechanism during pseudo-label generation and model retraining. Unlike conventional graph self-training pipelines that rely on fixed pseudo-labels, our approach iteratively refines label confidences with an EM-inspired uncertainty measure. This ensures that the predictive model focuses on reliable graph regions while gradually incorporating ambiguous nodes. Inspired by prior work on uncertainty-aware self-training techniques~\cite{wang2024uncertainty}, our framework is designed to handle noisy graph structures and feature spaces more effectively. Through extensive experiments on several benchmark graph datasets, we demonstrate that our method outperforms strong baselines by a margin of up to 2.5% in accuracy while maintaining lower variance in performance across multiple runs.

[332] Curvature Dynamic Black-box Attack: revisiting adversarial robustness via dynamic curvature estimation

Peiran Sun

Main category: cs.LG

TL;DR: The paper proposes a query-efficient method, Dynamic Curvature Estimation (DCE), to estimate decision boundary curvature in black-box settings, linking it to adversarial robustness. It also introduces a new attack method, Curvature Dynamic Black-box Attack (CDBA).

Details

Motivation: Existing curvature-based approaches focus on loss function or model parameters, not decision boundary curvature, which is harder to estimate. The paper aims to address this gap.

Method: Proposes DCE, a method to estimate decision boundary curvature using CGBA (a black-box attack). Also introduces CDBA, an improved attack leveraging dynamic curvature.

Result: Statistical connection found between decision boundary curvature and adversarial robustness. CDBA shows improved performance.

Conclusion: DCE effectively estimates decision boundary curvature, revealing its role in robustness. CDBA enhances attack efficiency.

Abstract: Adversarial attack reveals the vulnerability of deep learning models. For about a decade, countless attack and defense methods have been proposed, leading to robustified classifiers and better understanding of models. Among these methods, curvature-based approaches have attracted attention because it is assumed that high curvature may give rise to rough decision boundary. However, the most commonly used \textit{curvature} is the curvature of loss function, scores or other parameters from within the model as opposed to decision boundary curvature, since the former can be relatively easily formed using second order derivative. In this paper, we propose a new query-efficient method, dynamic curvature estimation(DCE), to estimate the decision boundary curvature in a black-box setting. Our approach is based on CGBA, a black-box adversarial attack. By performing DCE on a wide range of classifiers, we discovered, statistically, a connection between decision boundary curvature and adversarial robustness. We also propose a new attack method, curvature dynamic black-box attack(CDBA) with improved performance using the dynamically estimated curvature.

[333] Graph-Based Uncertainty-Aware Self-Training with Stochastic Node Labeling

Tom Liu, Anna Wu, Chao Li

Main category: cs.LG

TL;DR: Proposes GUST, a graph-based uncertainty-aware self-training framework for node classification, addressing over-confidence in pseudo-labels by integrating uncertainty and graph topology.

Details

Motivation: Over-confidence in pseudo-labels is a challenge in self-training for semi-supervised learning.

Method: Uses Bayesian-inspired uncertainty estimation, EM-like pseudo-label generation, and iterative updates of node embeddings and adjacency transformations.

Result: Achieves state-of-the-art performance, especially with sparse labeled data.

Conclusion: GUST effectively addresses over-confidence and improves node classification in semi-supervised settings.

Abstract: Self-training has become a popular semi-supervised learning technique for leveraging unlabeled data. However, the over-confidence of pseudo-labels remains a key challenge. In this paper, we propose a novel \emph{graph-based uncertainty-aware self-training} (GUST) framework to combat over-confidence in node classification. Drawing inspiration from the uncertainty integration idea introduced by Wang \emph{et al.}~\cite{wang2024uncertainty}, our method largely diverges from previous self-training approaches by focusing on \emph{stochastic node labeling} grounded in the graph topology. Specifically, we deploy a Bayesian-inspired module to estimate node-level uncertainty, incorporate these estimates into the pseudo-label generation process via an expectation-maximization (EM)-like step, and iteratively update both node embeddings and adjacency-based transformations. Experimental results on several benchmark graph datasets demonstrate that our GUST framework achieves state-of-the-art performance, especially in settings where labeled data is extremely sparse.

[334] Unsupervised Learning: Comparative Analysis of Clustering Techniques on High-Dimensional Data

Vishnu Vardhan Baligodugula, Fathi Amsaad

Main category: cs.LG

TL;DR: A comparative analysis of K-means, DBSCAN, and Spectral Clustering on high-dimensional data, showing UMAP preprocessing improves clustering quality, with Spectral Clustering excelling on complex structures.

Details

Motivation: To evaluate and compare clustering algorithms on high-dimensional datasets and determine the impact of dimensionality reduction techniques.

Method: Comparative analysis using PCA, t-SNE, and UMAP for dimensionality reduction, tested on MNIST, Fashion-MNIST, and UCI HAR datasets with quantitative metrics.

Result: UMAP preprocessing enhances clustering quality; Spectral Clustering performs best on complex structures, while K-means is efficient and DBSCAN handles irregular clusters.

Conclusion: Algorithm selection should consider data characteristics, with UMAP preprocessing recommended for improved clustering performance.

Abstract: This paper presents a comprehensive comparative analysis of prominent clustering algorithms K-means, DBSCAN, and Spectral Clustering on high-dimensional datasets. We introduce a novel evaluation framework that assesses clustering performance across multiple dimensionality reduction techniques (PCA, t-SNE, and UMAP) using diverse quantitative metrics. Experiments conducted on MNIST, Fashion-MNIST, and UCI HAR datasets reveal that preprocessing with UMAP consistently improves clustering quality across all algorithms, with Spectral Clustering demonstrating superior performance on complex manifold structures. Our findings show that algorithm selection should be guided by data characteristics, with Kmeans excelling in computational efficiency, DBSCAN in handling irregular clusters, and Spectral Clustering in capturing complex relationships. This research contributes a systematic approach for evaluating and selecting clustering techniques for high dimensional data applications.

[335] Mitigating loss of variance in ensemble data assimilation: machine learning-based and distance-free localization

Vinicius L. S. Silva, Gabriel S. Seabra, Alexandre A. Emerick

Main category: cs.LG

TL;DR: Two machine learning-based methods for tabular data and distance-free localization improve covariance estimation in ensemble data assimilation, reducing variance loss and enhancing accuracy.

Details

Motivation: To mitigate variance loss due to sampling errors in ensemble data assimilation and improve covariance estimation.

Method: Proposes two distance-free localization techniques using machine learning, integrated into the ES-MDA framework, and evaluates various ML models for suitability.

Result: Improved covariance accuracy, reduced variance loss, and better data assimilation results. Certain ML models outperform others in balancing accuracy and computational cost.

Conclusion: The study introduces practical, easy-to-implement methods that enhance ensemble-based data assimilation without requiring additional simulations or hyperparameter tuning.

Abstract: We propose two new methods based/inspired by machine learning for tabular data and distance-free localization to enhance the covariance estimations in an ensemble data assimilation. The main goal is to enhance the data assimilation results by mitigating loss of variance due to sampling errors. We also analyze the suitability of several machine learning models and the balance between accuracy and computational cost of the covariance estimations. We introduce two distance-free localization techniques leveraging machine learning methods specifically tailored for tabular data. The methods are integrated into the Ensemble Smoother with Multiple Data Assimilation (ES-MDA) framework. The results show that the proposed localizations improve covariance accuracy and enhance data assimilation and uncertainty quantification results. We observe reduced variance loss for the input variables using the proposed methods. Furthermore, we compare several machine learning models, assessing their suitability for the problem in terms of computational cost, and quality of the covariance estimation and data match. The influence of ensemble size is also investigated, providing insights into balancing accuracy and computational efficiency. Our findings demonstrate that certain machine learning models are more suitable for this problem. This study introduces two novel methods that mitigate variance loss for model parameters in ensemble-based data assimilation, offering practical solutions that are easy to implement and do not require any additional numerical simulation or hyperparameter tuning.

[336] Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers

Jake Grigsby, Yuqi Xie, Justin Sasek, Steven Zheng, Yuke Zhu

Main category: cs.LG

TL;DR: A pipeline reconstructs first-person battle perspectives from third-person logs, enabling offline training of sequence models for Competitive Pokémon Singles. These models outperform heuristic search and LLM agents, achieving top 10% rankings online.

Details

Motivation: To study adaptive policies in CPS using large datasets of real human battles, overcoming limitations of heuristic search and self-play.

Method: Develop a pipeline to convert spectator logs into first-person data, then train sequence models via imitation learning, offline RL, and self-play fine-tuning.

Result: Agents outperform heuristic search and LLM approaches, ranking in the top 10% of online players.

Conclusion: Offline training on real battle data yields competitive AI agents for CPS, with potential for broader applications.

Abstract: Competitive Pok'emon Singles (CPS) is a popular strategy game where players learn to exploit their opponent based on imperfect information in battles that can last more than one hundred stochastic turns. AI research in CPS has been led by heuristic tree search and online self-play, but the game may also create a platform to study adaptive policies trained offline on large datasets. We develop a pipeline to reconstruct the first-person perspective of an agent from logs saved from the third-person perspective of a spectator, thereby unlocking a dataset of real human battles spanning more than a decade that grows larger every day. This dataset enables a black-box approach where we train large sequence models to adapt to their opponent based solely on their input trajectory while selecting moves without explicit search of any kind. We study a progression from imitation learning to offline RL and offline fine-tuning on self-play data in the hardcore competitive setting of Pok'emon’s four oldest (and most partially observed) game generations. The resulting agents outperform a recent LLM Agent approach and a strong heuristic search engine. While playing anonymously in online battles against humans, our best agents climb to rankings inside the top 10% of active players. All agent checkpoints, training details, datasets, and baselines are available at https://metamon.tech.

[337] Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

Adwait Datar, Nihat Ay

Main category: cs.LG

TL;DR: The paper analyzes gradient-based optimization of KL divergence in dual coordinate systems, comparing Euclidean GD and natural GD, showing NGD’s advantages in discrete time.

Details

Motivation: To understand how parameterization impacts convergence in KL divergence minimization and compare the performance of gradient descent methods in different coordinate systems.

Method: Study Euclidean GD in θ and η coordinates versus coordinate-invariant NGD, analyzing convergence rates and robustness in continuous and discrete time.

Result: In continuous time, GD rates in θ and η coordinates bound NGD’s rate. NGD outperforms GD in discrete time with faster convergence and noise robustness.

Conclusion: NGD’s advantages are more pronounced in discrete time, making it superior for practical optimization despite its invariance in continuous time.

Abstract: The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($\theta$ coordinates) and the mixture family ($\eta$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the underlying statistical model. In continuous time, we prove that the convergence rates of GD in the $\theta$ and $\eta$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $\eta$ and $\theta$ coordinates can be scaled to $2c$ and $\frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them. Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.

[338] Repetition Makes Perfect: Recurrent Graph Neural Networks Match Message Passing Limit

Eran Rosenbluth, Martin Grohe

Main category: cs.LG

TL;DR: Recurrent GNNs with finite-precision parameters, sum aggregation, and ReLU activation can match the expressive power of the Color Refinement algorithm, unlike non-recurrent GNNs. They can also express all graph algorithms on connected graphs with random initialization.

Details

Motivation: To precisely characterize the expressivity of recurrent GNNs and compare it with non-recurrent GNNs, especially in terms of their ability to match the Color Refinement algorithm's invariance.

Method: Analyze recurrent GNNs with finite-precision parameters, sum aggregation, and ReLU activation, and compare their expressivity to non-recurrent GNNs. Introduce random initialization for connected graphs.

Result: Recurrent GNNs match the expressive power of the Color Refinement algorithm and can express all graph algorithms on connected graphs with random initialization, with polynomial overhead.

Conclusion: Recurrent GNNs are as expressive as the Color Refinement algorithm and can emulate any polynomial-time graph algorithm on connected graphs, highlighting their superior expressivity over non-recurrent GNNs.

Abstract: We precisely characterize the expressivity of computable Recurrent Graph Neural Networks (recurrent GNNs). We prove that recurrent GNNs with finite-precision parameters, sum aggregation, and ReLU activation, can compute any graph algorithm that respects the natural message-passing invariance induced by the Color Refinement (or Weisfeiler-Leman) algorithm. While it is well known that the expressive power of GNNs is limited by this invariance [Morris et al., AAAI 2019; Xu et al., ICLR 2019], we establish that recurrent GNNs can actually match this limit. This is in contrast to non-recurrent GNNs, which have the power of Weisfeiler-Leman only in a very weak, “non-uniform”, sense where each graph size requires a different GNN to compute with. Our construction introduces only a polynomial overhead in both time and space. Furthermore, we show that by incorporating random initialization, for connected graphs recurrent GNNs can express all graph algorithms. In particular, any polynomial-time graph algorithm can be emulated on connected graphs in polynomial time by a recurrent GNN with random initialization.

[339] Ownership Verification of DNN Models Using White-Box Adversarial Attacks with Specified Probability Manipulation

Teruki Sano, Minoru Kuribayashi, Masao Sakai, Shuji Isobe, Eisuke Koizumi

Main category: cs.LG

TL;DR: A novel framework for verifying ownership of DNN models in image classification using adversarial attacks, without needing the original model.

Details

Motivation: To address the issue of unauthorized copying and use of DNN models in cloud environments, ensuring rightful ownership verification.

Method: Uses a white-box adversarial attack (iterative FGSM with control parameters) to manipulate output probabilities for verification.

Result: Experimental results confirm the framework’s effectiveness in identifying DNN models.

Conclusion: The proposed method successfully verifies model ownership in gray-box scenarios using adversarial attacks.

Abstract: In this paper, we propose a novel framework for ownership verification of deep neural network (DNN) models for image classification tasks. It allows verification of model identity by both the rightful owner and third party without presenting the original model. We assume a gray-box scenario where an unauthorized user owns a model that is illegally copied from the original model, provides services in a cloud environment, and the user throws images and receives the classification results as a probability distribution of output classes. The framework applies a white-box adversarial attack to align the output probability of a specific class to a designated value. Due to the knowledge of original model, it enables the owner to generate such adversarial examples. We propose a simple but effective adversarial attack method based on the iterative Fast Gradient Sign Method (FGSM) by introducing control parameters. Experimental results confirm the effectiveness of the identification of DNN models using adversarial attack.

[340] MLMC-based Resource Adequacy Assessment with Active Learning Trained Surrogate Models

Ruiqi Zhang, Simon H. Tindemans

Main category: cs.LG

TL;DR: The paper introduces a speed metric for MLMC efficiency, incorporating training time, and proposes active learning to reduce labeling calls, improving variance reduction in power system reliability assessments.

Details

Motivation: Pre-labeled datasets are often unavailable in resource adequacy assessments, and the time for labeling training data can offset efficiency gains from surrogate models in MLMC.

Method: A speed metric accounting for training time is introduced, and a vote-by-committee active learning approach is proposed to minimize labeling calls under limited time budgets.

Result: A case study shows that active learning combined with MLMC significantly reduces variance within a given computational budget.

Conclusion: Active learning enhances MLMC efficiency by reducing labeling efforts, making it more practical for large-scale power system reliability assessments.

Abstract: Multilevel Monte Carlo (MLMC) is a flexible and effective variance reduction technique for accelerating reliability assessments of complex power system. Recently, data-driven surrogate models have been proposed as lower-level models in the MLMC framework due to their high correlation and negligible execution time once trained. However, in resource adequacy assessments, pre-labeled datasets are typically unavailable. For large-scale systems, the efficiency gains from surrogate models are often offset by the substantial time required for labeling training data. Therefore, this paper introduces a speed metric that accounts for training time in evaluating MLMC efficiency. Considering the total time budget is limited, a vote-by-committee active learning approach is proposed to reduce the required labeling calls. A case study demonstrates that, within a given computational budget, active learning in combination with MLMC can result in a substantial reduction variance.

[341] Neural Networks as Universal Finite-State Machines: A Constructive ReLU Simulation Framework for NFAs

Sahil Rajesh Dhayalkar

Main category: cs.LG

TL;DR: A framework for simulating NFAs using ReLU neural networks, achieving exact recognition of regular languages with practical trainability.

Details

Motivation: To bridge symbolic automata theory and neural networks, enabling precise and interpretable symbolic computation with modern architectures.

Method: Symbolically encodes NFA states as binary vectors and transitions as sparse linear transformations, using shared ReLU layers for nondeterministic branching.

Result: Proves exact recognition of regular languages by ReLU networks, validated by experiments showing perfect or near-perfect agreement.

Conclusion: Demonstrates that feedforward ReLU networks can perform interpretable and trainable symbolic computation, linking automata theory and neural architectures.

Abstract: We present a formal and constructive simulation framework for nondeterministic finite automata (NFAs) using standard feedforward ReLU neural networks. Unlike prior approaches that rely on recurrent architectures or post hoc extraction methods, our formulation symbolically encodes automaton states as binary vectors, transitions as sparse linear transformations, and nondeterministic branching - including {\epsilon}-closures - as compositions of shared ReLU layers. We prove that every regular language can be recognized exactly by a depth-unrolled ReLU network with shared parameters, independent of input length. Our construction yields not only formal equivalence between NFAs and ReLU networks, but also practical trainability: we demonstrate that the networks can learn NFA acceptance behavior through gradient descent using standard supervised data. Extensive experiments validate all theoretical results, achieving perfect or near-perfect agreement on acceptance, state propagation, and closure dynamics. This work establishes a new bridge between symbolic automata theory and modern neural architectures, showing that feedforward networks can perform precise, interpretable, and trainable symbolic computation.

[342] Trajectory First: A Curriculum for Discovering Diverse Policies

Cornelius V. Braun, Sayantan Auddy, Marc Toussaint

Main category: cs.LG

TL;DR: A curriculum-based approach improves diversity in RL by first exploring trajectories before learning step-based policies, addressing under-exploration in complex tasks like robotic manipulation.

Details

Motivation: Diverse task-solving enhances robustness and avoids local optima, but current constrained-diversity RL methods under-explore in complex tasks, limiting policy diversity.

Method: Proposes a curriculum that starts with trajectory-level exploration before transitioning to step-based policy learning.

Result: Empirical evaluation shows the curriculum enhances skill diversity and highlights shortcomings of skill-based diversity optimization.

Conclusion: The curriculum effectively improves diversity in RL, particularly in complex tasks, by addressing exploration limitations.

Abstract: Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has emerged as a powerful reinforcement learning (RL) framework to train a diverse set of agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robotic manipulation, leading to a lack in policy diversity. To improve diversity optimization in RL, we therefore propose a curriculum that first explores at the trajectory level before learning step-based policies. In our empirical evaluation, we provide novel insights into the shortcoming of skill-based diversity optimization, and demonstrate empirically that our curriculum improves the diversity of the learned skills.

[343] Fully data-driven inverse hyperelasticity with hyper-network neural ODE fields

Vahidullah Taç, Amirhossein Amiri-Hezaveh, Manuel K. Rausch, Grace N. Bechtel, Francisco Sahli Costabal, Adrian Buganza Tepole

Main category: cs.LG

TL;DR: A neural network framework with Fourier features and NODEs identifies mechanical properties of heterogeneous materials without closed-form equations, validated by numerical examples.

Details

Motivation: To address the challenge of identifying mechanical properties in heterogeneous materials without relying on closed-form constitutive equations.

Method: Uses neural networks with Fourier features for strain field approximation and NODEs for constitutive equation discovery, optimized via a multi-objective loss function.

Result: Demonstrates robustness in identifying mechanical properties under various conditions, including noise and experimental data.

Conclusion: Proposes a general and robust alternative to classical inverse methods for heterogeneous material analysis.

Abstract: We propose a new framework for identifying mechanical properties of heterogeneous materials without a closed-form constitutive equation. Given a full-field measurement of the displacement field, for instance as obtained from digital image correlation (DIC), a continuous approximation of the strain field is obtained by training a neural network that incorporates Fourier features to effectively capture sharp gradients in the data. A physics-based data-driven method built upon ordinary neural differential equations (NODEs) is employed to discover constitutive equations. The NODE framework can represent arbitrary materials while satisfying constraints in the theory of constitutive equations by default. To account for heterogeneity, a hyper-network is defined, where the input is the material coordinate system, and the output is the NODE-based constitutive equation. The parameters of the hyper-network are optimized by minimizing a multi-objective loss function that includes penalty terms for violations of the strong form of the equilibrium equations of elasticity and the associated Neumann boundary conditions. We showcase the framework with several numerical examples, including heterogeneity arising from variations in material parameters, spatial transitions from isotropy to anisotropy, material identification in the presence of noise, and, ultimately, application to experimental data. As the numerical results suggest, the proposed approach is robust and general in identifying the mechanical properties of heterogeneous materials with very few assumptions, making it a suitable alternative to classical inverse methods.

[344] The Effect of Stochasticity in Score-Based Diffusion Sampling: a KL Divergence Analysis

Bernardo P. Schaeffer, Ricardo M. S. Rosa, Glauco Valle

Main category: cs.LG

TL;DR: The paper analyzes the impact of stochasticity in score-based diffusion models on generation quality, using KL divergence bounds and examples.

Details

Motivation: To understand how stochasticity affects sampling in diffusion models and its role in error correction or amplification.

Method: Theoretical analysis using KL divergence bounds, log-Sobolev inequalities, and numerical/analytical examples for linear forward SDEs with additive noise.

Result: Stochasticity can correct errors with exact scores but may amplify errors with approximate scores, depending on error structure.

Conclusion: Stochasticity’s effect varies; it can improve or degrade performance based on score accuracy and error characteristics.

Abstract: Sampling in score-based diffusion models can be performed by solving either a reverse-time stochastic differential equation (SDE) parameterized by an arbitrary time-dependent stochasticity parameter or a probability flow ODE, corresponding to the stochasticity parameter set to zero. In this work, we study the effect of this stochasticity on the generation process through bounds on the Kullback-Leibler (KL) divergence, complementing the analysis with numerical and analytical examples. Our main results apply to linear forward SDEs with additive noise and Lipschitz-continuous score functions, and quantify how errors from the prior distribution and score approximation propagate under different choices of the stochasticity parameter. The theoretical bounds are derived using log-Sobolev inequalities for the marginals of the forward process, which enable a more effective control of the KL divergence decay along sampling. For exact score functions, we find that stochasticity acts as an error-correcting mechanism, decreasing KL divergence along the sampling trajectory. For an approximate score function, there is a trade-off between error correction and score error amplification, so that stochasticity can either improve or worsen the performance, depending on the structure of the score error. Numerical experiments on simple datasets and a fully analytical example are included to illustrate and enlighten the theoretical results.

[345] Floating-Point Neural Networks Are Provably Robust Universal Approximators

Geonho Hwang, Wonyeol Lee, Yeachan Park, Sejun Park, Feras Saad

Main category: cs.LG

TL;DR: The paper introduces the first interval universal approximation (IUA) theorem for floating-point neural networks, proving their ability to perfectly capture the direct image map of any rounded target function, with no expressiveness limits.

Details

Motivation: The motivation is to address the gap in classical UA and IUA theorems, which assume infinitely precise real numbers, by exploring their validity in the practical floating-point setting.

Method: The paper presents a theoretical proof for the IUA theorem in the floating-point setting, highlighting differences from the real-valued model.

Result: The result shows that floating-point neural networks can perfectly approximate the direct image map of any rounded target function, with surprising corollaries like provable robustness and computational completeness.

Conclusion: The conclusion is that floating-point neural networks exhibit no expressiveness limits and have significant implications, including robustness and computational completeness.

Abstract: The classical universal approximation (UA) theorem for neural networks establishes mild conditions under which a feedforward neural network can approximate a continuous function $f$ with arbitrary accuracy. A recent result shows that neural networks also enjoy a more general interval universal approximation (IUA) theorem, in the sense that the abstract interpretation semantics of the network using the interval domain can approximate the direct image map of $f$ (i.e., the result of applying $f$ to a set of inputs) with arbitrary accuracy. These theorems, however, rest on the unrealistic assumption that the neural network computes over infinitely precise real numbers, whereas their software implementations in practice compute over finite-precision floating-point numbers. An open question is whether the IUA theorem still holds in the floating-point setting. This paper introduces the first IUA theorem for floating-point neural networks that proves their remarkable ability to perfectly capture the direct image map of any rounded target function $f$, showing no limits exist on their expressiveness. Our IUA theorem in the floating-point setting exhibits material differences from the real-valued setting, which reflects the fundamental distinctions between these two computational models. This theorem also implies surprising corollaries, which include (i) the existence of provably robust floating-point neural networks; and (ii) the computational completeness of the class of straight-line programs that use only floating-point additions and multiplications for the class of all floating-point programs that halt.

[346] SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen

Main category: cs.LG

TL;DR: SmallThinker is a family of LLMs designed for local devices, addressing computational, memory, and storage constraints with innovative architecture and co-designed inference, achieving high performance on consumer CPUs.

Details

Motivation: To enable LLM deployment on local devices with weak computational power, limited memory, and slow storage, challenging the reliance on GPU-powered cloud infrastructure.

Method: Introduces a deployment-aware architecture with a two-level sparse structure, pre-attention router for I/O efficiency, and NoPE-RoPE hybrid sparse attention for memory efficiency.

Result: SmallThinker models achieve state-of-the-art performance, outperforming larger LLMs, and run efficiently on consumer CPUs with minimal memory usage.

Conclusion: SmallThinker demonstrates that LLMs can be optimized for local devices without sacrificing performance, reducing dependency on cloud infrastructure.

Abstract: While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.

[347] RocketStack: Level-aware deep recursive ensemble learning framework with adaptive feature fusion and model pruning dynamics

Çağatay Demirel

Main category: cs.LG

TL;DR: RocketStack introduces a level-aware recursive ensemble framework for deep stacking, addressing complexity and redundancy through pruning, noise addition, and feature compression, achieving higher accuracy and efficiency.

Details

Motivation: To overcome the challenges of deep stacking, such as model complexity, feature redundancy, and computational burden, by enabling deeper recursive ensembling without excessive overhead.

Method: RocketStack uses incremental pruning of weaker learners at each level, adds mild Gaussian noise to OOF scores, and explores feature compression techniques (attention-based selection, SFE filter, autoencoders).

Result: Achieved 97.08% accuracy for binary and 98.60% for multi-class datasets at level 10, outperforming baselines by 5.14% and 6.11%, respectively, while reducing runtime and feature dimensionality.

Conclusion: Mild randomization and periodic compression are effective for deep recursive ensembling, enabling RocketStack to achieve high performance with tractable complexity.

Abstract: Ensemble learning remains a cornerstone of machine learning, with stacking used to integrate predictions from multiple base learners through a meta-model. However, deep stacking remains rare, as most designs prioritize horizontal diversity over recursive depth due to model complexity, feature redundancy, and computational burden. To address these challenges, RocketStack, a level-aware recursive ensemble framework, is introduced and explored up to ten stacking levels, extending beyond prior architectures. The framework incrementally prunes weaker learners at each level, enabling deeper stacking without excessive complexity. To mitigate early performance saturation, mild Gaussian noise is added to out-of-fold (OOF) scores before pruning, and compared against strict OOF pruning. Further both per-level and periodic feature compressions are explored using attention-based selection, Simple, Fast, Efficient (SFE) filter, and autoencoders. Across 33 datasets (23 binary, 10 multi-class), linear-trend tests confirmed rising accuracy with depth in most variants, and the top performing meta-model at each level increasingly outperformed the strongest standalone ensemble. In the binary subset, periodic SFE with mild OOF-score randomization reached 97.08% at level 10, 5.14% above the strict-pruning configuration and cut runtime by 10.5% relative to no compression. In the multi-class subset, periodic attention selection reached 98.60% at level 10, exceeding the strongest baseline by 6.11%, while reducing runtime by 56.1% and feature dimensionality by 74% compared to no compression. These findings highlight mild randomization as an effective regularizer and periodic compression as a stabilizer. Echoing the design of multistage rockets in aerospace (prune, compress, propel) RocketStack achieves deep recursive ensembling with tractable complexity.

[348] High-Resolution Live Fuel Moisture Content (LFMC) Maps for Wildfire Risk from Multimodal Earth Observation Data

Patrick Alan Johnson, Gabriel Tseng, Yawen Zhang, Heather Heward, Virginia Sjahli, Favyen Bastani, Joseph Redmon, Patrick Beukema

Main category: cs.LG

TL;DR: The paper explores using a pretrained multimodal earth-observation model to create large-scale LFMC maps, improving accuracy by 20% over previous methods.

Details

Motivation: Wildfires are intensifying, and LFMC is a key risk factor, but ground-based sampling is costly and sparse. AI and satellite data offer a scalable solution.

Method: Uses a pretrained multimodal earth-observation model to generate spatially complete LFMC maps, with an automated pipeline for rapid updates.

Result: Achieves a 20% reduction in RMSE compared to randomly initialized models, demonstrated in wildfire-impacted regions.

Conclusion: The approach provides a scalable, accurate solution for LFMC monitoring, aiding wildfire research and response.

Abstract: Wildfires are increasing in intensity and severity at an alarming rate. Recent advances in AI and publicly available satellite data enable monitoring critical wildfire risk factors globally, at high resolution and low latency. Live Fuel Moisture Content (LFMC) is a critical wildfire risk factor and is valuable for both wildfire research and operational response. However, ground-based LFMC samples are both labor intensive and costly to acquire, resulting in sparse and infrequent updates. In this work, we explore the use of a pretrained, highly-multimodal earth-observation model for generating large-scale spatially complete (wall-to-wall) LFMC maps. Our approach achieves significant improvements over previous methods using randomly initialized models (20 reduction in RMSE). We provide an automated pipeline that enables rapid generation of these LFMC maps across the United States, and demonstrate its effectiveness in two regions recently impacted by wildfire (Eaton and Palisades).

[349] A case for data valuation transparency via DValCards

Keziah Naggita, Julienne LaChance

Main category: cs.LG

TL;DR: The paper highlights biases and instability in data valuation methods, showing how pre-processing, subsampling, and undervaluation of underrepresented groups can impact fairness. It proposes the DValCards framework for transparency.

Details

Motivation: To address the biases and instability in data valuation methods, which can lead to unfair compensation and misuse in data markets and ML systems.

Method: Analyzed 9 tabular classification datasets and 6 data valuation methods to demonstrate biases under algorithmic choices.

Result: Found that pre-processing alters data values, subsampling increases class imbalance, and underrepresented groups are undervalued.

Conclusion: Advocates for transparency via DValCards to mitigate misuse and build trust in responsible ML systems.

Abstract: Following the rise in popularity of data-centric machine learning (ML), various data valuation methods have been proposed to quantify the contribution of each datapoint to desired ML model performance metrics (e.g., accuracy). Beyond the technical applications of data valuation methods (e.g., data cleaning, data acquisition, etc.), it has been suggested that within the context of data markets, data buyers might utilize such methods to fairly compensate data owners. Here we demonstrate that data valuation metrics are inherently biased and unstable under simple algorithmic design choices, resulting in both technical and ethical implications. By analyzing 9 tabular classification datasets and 6 data valuation methods, we illustrate how (1) common and inexpensive data pre-processing techniques can drastically alter estimated data values; (2) subsampling via data valuation metrics may increase class imbalance; and (3) data valuation metrics may undervalue underrepresented group data. Consequently, we argue in favor of increased transparency associated with data valuation in-the-wild and introduce the novel Data Valuation Cards (DValCards) framework towards this aim. The proliferation of DValCards will reduce misuse of data valuation metrics, including in data pricing, and build trust in responsible ML systems.

[350] Graph Collaborative Attention Network for Link Prediction in Knowledge Graphs

Thanh Hoang-Minh

Main category: cs.LG

TL;DR: The paper compares rule-based and deep learning methods for link prediction in knowledge graphs, introducing GCAT, an improved model over KBGAT, which outperforms traditional and neural methods.

Details

Motivation: To systematically compare traditional rule-based and modern deep learning approaches for link prediction in knowledge graphs, and to improve upon existing neural models like KBGAT.

Method: Introduces GCAT, a refined graph neural network model using multi-head attention for better context aggregation and interaction between heterogeneous nodes.

Result: GCAT outperforms rule-based methods and achieves competitive or superior performance to existing neural models on four benchmark datasets.

Conclusion: Attention-based architectures like GCAT are effective for capturing complex relational patterns in knowledge graph completion tasks.

Abstract: Knowledge graphs offer a structured representation of real-world entities and their relationships, enabling a wide range of applications from information retrieval to automated reasoning. In this paper, we conduct a systematic comparison between traditional rule-based approaches and modern deep learning methods for link prediction. We focus on KBGAT, a graph neural network model that leverages multi-head attention to jointly encode both entity and relation features within local neighborhood structures. To advance this line of research, we introduce \textbf{GCAT} (Graph Collaborative Attention Network), a refined model that enhances context aggregation and interaction between heterogeneous nodes. Experimental results on four widely-used benchmark datasets demonstrate that GCAT not only consistently outperforms rule-based methods but also achieves competitive or superior performance compared to existing neural embedding models. Our findings highlight the advantages of attention-based architectures in capturing complex relational patterns for knowledge graph completion tasks.

[351] Compression Method for Deep Diagonal State Space Model Based on $H^2$ Optimal Reduction

Hiroki Sakamoto, Kazuhiro Sato

Main category: cs.LG

TL;DR: Proposes an efficient parameter reduction method for linear SSMs in deep learning, achieving 1/32 compression without performance loss.

Details

Motivation: Address the challenge of deploying large-parameter linear SSMs on resource-constrained devices.

Method: Applies $H^{2}$ model order reduction techniques from control theory to linear SSM components.

Result: Outperforms Balanced Truncation, reducing parameters to 1/32 without performance loss.

Conclusion: The method effectively compresses SSMs while maintaining performance, aiding deployment on constrained devices.

Abstract: Deep learning models incorporating linear SSMs have gained attention for capturing long-range dependencies in sequential data. However, their large parameter sizes pose challenges for deployment on resource-constrained devices. In this study, we propose an efficient parameter reduction method for these models by applying $H^{2}$ model order reduction techniques from control theory to their linear SSM components. In experiments, the LRA benchmark results show that the model compression based on our proposed method outperforms an existing method using the Balanced Truncation, while successfully reducing the number of parameters in the SSMs to $1/32$ without sacrificing the performance of the original models.

[352] Emergence of Quantised Representations Isolated to Anisotropic Functions

George Bird

Main category: cs.LG

TL;DR: A novel method for studying representational alignment in autoencoders reveals that activation function symmetries can induce discrete or continuous representations, impacting interpretability and reconstruction error.

Details

Motivation: To understand how discrete representations emerge in autoencoders and whether function-driven symmetries act as implicit inductive biases.

Method: A controlled ablation study altering only the activation function, using the Spotlight Resonance method to analyze representational alignment.

Result: Discrete activation functions lead to discretized representations, while continuous ones maintain continuity, confirming symmetry-induced biases. Quantization correlates with increased reconstruction error.

Conclusion: Algebraic symmetries in network primitives can unintentionally bias representations, influencing interpretability. This tool aids emergent interpretability research and suggests reassessing common functional forms.

Abstract: This paper presents a novel methodology for determining representational alignment, which builds upon the existing Spotlight Resonance method. Particularly, this new tool is used to gain insight into how discrete representations can emerge and organise in autoencoder models, through a controlled ablation study in which only the activation function is altered. Using this technique, the validity of whether function-driven symmetries can act as implicit inductive biases on representations is determined. Representations are found to tend to discretise when the activation functions are defined through a discrete algebraic permutation-equivariant symmetry. In contrast, they remain continuous under a continuous algebraic orthogonal-equivariant definition. This confirms the hypothesis: algebraic symmetries of network primitives can carry unintended inductive biases which produce task-independent artefactual structures in representations. The discrete symmetry of contemporary forms is shown to be a strong predictor for the induction of discrete representations transformed from otherwise continuous structures – a quantisation effect. This motivates further reassessment of functional forms in common usage. Moreover, this supports a general causal model for one mode in which discrete representations may form, and could constitute a prerequisite for downstream interpretability phenomena, including grandmother neurons, discrete coding schemes, general linear features and possibly Superposition. Hence, this tool and proposed mechanism for the influence of functional form on representations may provide insights into emergent interpretability research. Finally, preliminary results indicate that quantisation of representations appears to correlate with a measurable increase in reconstruction error, reinforcing previous conjectures that this collapse can be detrimental.

[353] Provable Low-Frequency Bias of In-Context Learning of Representations

Yongyi Yang, Hidenori Tanaka, Wei Hu

Main category: cs.LG

TL;DR: The paper explains how in-context learning (ICL) in large language models (LLMs) works through a double convergence framework, leading to smooth representations and robustness to noise.

Details

Motivation: To uncover the mechanisms behind ICL in LLMs, which surpass pretraining by internalizing data-generating processes, but lack theoretical understanding.

Method: Introduces a unified framework of double convergence (convergence over context and across layers) and analyzes its implicit bias towards smooth representations.

Result: The theory explains empirical observations like structured but locally distorted geometry and decaying energy, and predicts robustness to high-frequency noise, confirmed empirically.

Conclusion: Provides insights into ICL mechanisms and a theoretical foundation for further study, applicable to broader data distributions and settings.

Abstract: In-context learning (ICL) enables large language models (LLMs) to acquire new behaviors from the input sequence alone without any parameter updates. Recent studies have shown that ICL can surpass the original meaning learned in pretraining stage through internalizing the structure the data-generating process (DGP) of the prompt into the hidden representations. However, the mechanisms by which LLMs achieve this ability is left open. In this paper, we present the first rigorous explanation of such phenomena by introducing a unified framework of double convergence, where hidden representations converge both over context and across layers. This double convergence process leads to an implicit bias towards smooth (low-frequency) representations, which we prove analytically and verify empirically. Our theory explains several open empirical observations, including why learned representations exhibit globally structured but locally distorted geometry, and why their total energy decays without vanishing. Moreover, our theory predicts that ICL has an intrinsic robustness towards high-frequency noise, which we empirically confirm. These results provide new insights into the underlying mechanisms of ICL, and a theoretical foundation to study it that hopefully extends to more general data distributions and settings.

[354] Rethinking Individual Fairness in Deepfake Detection

Aryana Hou, Li Lin, Justin Li, Shu Hu

Main category: cs.LG

TL;DR: The paper addresses fairness in deepfake detection, focusing on individual fairness, and proposes a framework to improve it while maintaining detection performance.

Details

Motivation: Deepfake detection lacks fairness, especially individual fairness, which is unexplored and critical for unbiased predictions.

Method: The authors propose a generalizable framework to enhance individual fairness in existing deepfake detectors.

Result: Experiments show the framework improves individual fairness and maintains robust detection, outperforming state-of-the-art methods.

Conclusion: The work fills a critical gap in deepfake detection fairness, offering a practical solution with demonstrated effectiveness.

Abstract: Generative AI models have substantially improved the realism of synthetic media, yet their misuse through sophisticated DeepFakes poses significant risks. Despite recent advances in deepfake detection, fairness remains inadequately addressed, enabling deepfake markers to exploit biases against specific populations. While previous studies have emphasized group-level fairness, individual fairness (i.e., ensuring similar predictions for similar individuals) remains largely unexplored. In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. To mitigate it, we propose the first generalizable framework that can be integrated into existing deepfake detectors to enhance individual fairness and generalization. Extensive experiments conducted on leading deepfake datasets demonstrate that our approach significantly improves individual fairness while maintaining robust detection performance, outperforming state-of-the-art methods. The code is available at https://github.com/Purdue-M2/Individual-Fairness-Deepfake-Detection.

[355] Distributional Unlearning: Forgetting Distributions, Not Just Samples

Youssef Allouah, Rachid Guerraoui, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper introduces distributional unlearning, a framework to remove entire sub-populations from trained models efficiently, ensuring minimal residual signal and negligible impact on retained performance.

Details

Motivation: Existing unlearning tools focus on individual samples, leaving residual signals for downstream recovery. The need arises to delete entire domains (e.g., for GDPR compliance or copyright issues) without compromising retained data quality.

Method: The proposed framework uses Kullback-Leibler divergence to quantify removal and preservation, deriving the Pareto frontier for Gaussian cases. A distance-based selection rule reduces deletion budget quadratically compared to random removal.

Result: Experiments show 15-72% fewer deletions than random removal, with negligible impact on retained performance across synthetic Gaussians, Jigsaw Toxic Comments, SMS spam, and CIFAR-10 datasets.

Conclusion: Distributional unlearning provides a data-centric, model-agnostic solution for efficient domain-level unlearning, outperforming sample-oriented methods in deletion efficiency and retained performance.

Abstract: Machine unlearning seeks to remove unwanted information from trained models, initially at the individual-sample level, but increasingly at the level of entire sub-populations. In many deployments, models must delete whole topical domains to satisfy privacy, legal, or quality requirements, e.g., removing several users’ posts under GDPR or copyrighted web content. Existing unlearning tools remain largely sample-oriented, and straightforward point deletion often leaves enough residual signal for downstream learners to recover the unwanted domain. We introduce distributional unlearning, a data-centric, model-agnostic framework that asks: Given examples from an unwanted distribution and a retained distribution, what is the smallest set of points whose removal makes the edited dataset far from the unwanted domain yet close to the retained one? Using Kullback-Leibler divergence to quantify removal and preservation, we derive the exact Pareto frontier in the Gaussian case and prove that any model retrained on the edited data incurs log-loss shifts bounded by the divergence thresholds. We propose a simple distance-based selection rule satisfying these constraints with a quadratic reduction in deletion budget compared to random removal. Experiments on synthetic Gaussians, Jigsaw Toxic Comments, SMS spam, and CIFAR-10 show 15-72% fewer deletions than random, with negligible impact on retained performance.

[356] TempRe: Template generation for single and direct multi-step retrosynthesis

Nguyen Xuan-Vu, Daniel P Armstrong, Zlatko Jončev, Philippe Schwaller

Main category: cs.LG

TL;DR: TempRe is a generative framework for retrosynthesis planning, combining the scalability of template-based methods with the flexibility of generative approaches, outperforming existing methods.

Details

Motivation: Traditional retrosynthesis methods face scalability and generalization issues, while template-free approaches risk invalid reactions. TempRe aims to address these limitations.

Method: TempRe reformulates template-based retrosynthesis as sequence generation, enabling scalable and chemically plausible planning. It is evaluated on single-step and multi-step tasks.

Result: TempRe outperforms template classification and SMILES-based methods, achieving strong accuracy on multi-step benchmarks and enabling efficient multi-step synthesis route generation.

Conclusion: TempRe demonstrates the potential of template generative modeling as a powerful tool for computer-aided synthesis planning.

Abstract: Retrosynthesis planning remains a central challenge in molecular discovery due to the vast and complex chemical reaction space. While traditional template-based methods offer tractability, they suffer from poor scalability and limited generalization, and template-free generative approaches risk generating invalid reactions. In this work, we propose TempRe, a generative framework that reformulates template-based approaches as sequence generation, enabling scalable, flexible, and chemically plausible retrosynthesis. We evaluated TempRe across single-step and multi-step retrosynthesis tasks, demonstrating its superiority over both template classification and SMILES-based generation methods. On the PaRoutes multi-step benchmark, TempRe achieves strong top-k route accuracy. Furthermore, we extend TempRe to direct multi-step synthesis route generation, providing a lightweight and efficient alternative to conventional single-step and search-based approaches. These results highlight the potential of template generative modeling as a powerful paradigm in computer-aided synthesis planning.

cs.MA

[357] From Cloud-Native to Trust-Native: A Protocol for Verifiable Multi-Agent Systems

Muyang Li

Main category: cs.MA

TL;DR: TrustTrack is a protocol embedding verifiable identity, policy commitments, and tamper-resistant logs into LLM-powered autonomous agents to ensure trust-native autonomy.

Details

Motivation: The proliferation of LLM-powered agents in high-stakes domains necessitates verifiability and trust, moving beyond just intelligence.

Method: Introduces TrustTrack, a protocol with structural guarantees (identity, policies, logs) integrated into agent infrastructure.

Result: Enables trust-native autonomy, treating compliance as a design constraint, with applications in pharmaceuticals, legal workflows, and AI collaboration.

Conclusion: TrustTrack represents the next architectural layer for autonomous systems, shifting focus from AI to trust.

Abstract: As autonomous agents powered by large language models (LLMs) proliferate in high-stakes domains – from pharmaceuticals to legal workflows – the challenge is no longer just intelligence, but verifiability. We introduce TrustTrack, a protocol that embeds structural guarantees – verifiable identity, policy commitments, and tamper-resistant behavioral logs – directly into agent infrastructure. This enables a new systems paradigm: trust-native autonomy. By treating compliance as a design constraint rather than post-hoc oversight, TrustTrack reframes how intelligent agents operate across organizations and jurisdictions. We present the protocol design, system requirements, and use cases in regulated domains such as pharmaceutical R&D, legal automation, and AI-native collaboration. We argue that the Cloud -> AI -> Agent -> Trust transition represents the next architectural layer for autonomous systems.

[358] Successor Features for Transfer in Alternating Markov Games

Sunny Amatya, Yi Ren, Zhe Xu, Wenlong Zhang

Main category: cs.MA

TL;DR: The paper introduces successor features and the GGPI algorithm for knowledge transfer in games, showing improved performance over baselines.

Details

Motivation: Address the limitations of value and equilibrium transfers in games by leveraging successor features for better knowledge transfer.

Method: Proposes the Game Generalized Policy Improvement (GGPI) algorithm for transferring learning values and policies in Markov games.

Result: GGPI achieves high-reward interactions, one-shot policy transfer, and higher success rates with improved path efficiency.

Conclusion: Successor features and GGPI effectively enable knowledge transfer in games, outperforming traditional methods.

Abstract: This paper explores successor features for knowledge transfer in zero-sum, complete-information, and turn-based games. Prior research in single-agent systems has shown that successor features can provide a ``jump start" for agents when facing new tasks with varying reward structures. However, knowledge transfer in games typically relies on value and equilibrium transfers, which heavily depends on the similarity between tasks. This reliance can lead to failures when the tasks differ significantly. To address this issue, this paper presents an application of successor features to games and presents a novel algorithm called Game Generalized Policy Improvement (GGPI), designed to address Markov games in multi-agent reinforcement learning. The proposed algorithm enables the transfer of learning values and policies across games. An upper bound of the errors for transfer is derived as a function the similarity of the task. Through experiments with a turn-based pursuer-evader game, we demonstrate that the GGPI algorithm can generate high-reward interactions and one-shot policy transfer. When further tested in a wider set of initial conditions, the GGPI algorithm achieves higher success rates with improved path efficiency compared to those of the baseline algorithms.

[359] Physics-Informed EvolveGCN: Satellite Prediction for Multi Agent Systems

Timothy Jacob Huber, Madhur Tiwari, Camilo A. Riano-Rios

Main category: cs.MA

TL;DR: A novel method using EvolveGCN for dynamic graph-based prediction of agent positions in multi-agent systems, enhanced by physics-constrained loss functions.

Details

Motivation: To improve prediction accuracy and reliability in multi-agent systems by modeling evolving inter-agent relationships dynamically.

Method: Uses EvolveGCN (dynamic graph convolutional network) and incorporates physics-constrained loss functions based on Clohessy-Wiltshire equations.

Result: Enhanced reliability of future state estimations in multi-agent scenarios.

Conclusion: The proposed approach effectively combines dynamic graph modeling with physical constraints for better prediction in autonomous systems.

Abstract: In the rapidly evolving domain of autonomous systems, interaction among agents within a shared environment is both inevitable and essential for enhancing overall system capabilities. A key requirement in such multi-agent systems is the ability of each agent to reliably predict the future positions of its nearest neighbors. Traditionally, graphs and graph theory have served as effective tools for modeling inter agent communication and relationships. While this approach is widely used, the present work proposes a novel method that leverages dynamic graphs in a forward looking manner. Specifically, the employment of EvolveGCN, a dynamic graph convolutional network, to forecast the evolution of inter-agent relationships over time. To improve prediction accuracy and ensure physical plausibility, this research incorporates physics constrained loss functions based on the Clohessy-Wiltshire equations of motion. This integrated approach enhances the reliability of future state estimations in multi-agent scenarios.

[360] Multi-Agent Path Finding Among Dynamic Uncontrollable Agents with Statistical Safety Guarantees

Kegan J. Strawn, Thomy Phan, Eric Wang, Nora Ayanian, Sven Koenig, Lars Lindemann

Main category: cs.MA

TL;DR: A novel MAPF solver, CP-Solver, integrates uncertainty quantification for uncontrollable agents using conformal prediction, ensuring collision-free paths in dynamic environments.

Details

Motivation: Existing MAPF solvers lack handling for uncertain behavior of uncontrollable agents, necessitating a robust solution for dynamic environments.

Method: Combines a learned predictor for uncontrollable agents’ movement, conformal prediction for uncertainty quantification, and integrates these into a modified ECBS solver.

Result: CP-Solver provides statistical guarantees for collision-free paths, scales to lifelong missions, and performs competitively in warehouse and game maps.

Conclusion: CP-Solver effectively addresses uncertainty in MAPF, offering scalable and collision-free solutions for dynamic environments.

Abstract: Existing multi-agent path finding (MAPF) solvers do not account for uncertain behavior of uncontrollable agents. We present a novel variant of Enhanced Conflict-Based Search (ECBS), for both one-shot and lifelong MAPF in dynamic environments with uncontrollable agents. Our method consists of (1) training a learned predictor for the movement of uncontrollable agents, (2) quantifying the prediction error using conformal prediction (CP), a tool for statistical uncertainty quantification, and (3) integrating these uncertainty intervals into our modified ECBS solver. Our method can account for uncertain agent behavior, comes with statistical guarantees on collision-free paths for one-shot missions, and scales to lifelong missions with a receding horizon sequence of one-shot instances. We run our algorithm, CP-Solver, across warehouse and game maps, with competitive throughput and reduced collisions.

Hsien-Tsung Lin, Pei-Cing Huang, Chan-Tung Ku, Chan Hsu, Pei-Xuan Shieh, Yihuang Kang

Main category: cs.MA

TL;DR: LLM-based multi-agent simulations can mimic human social dynamics like conformity and polarization, with smaller models showing more conformity and reasoning-optimized models resisting social influence.

Details

Motivation: To explore if LLMs can replicate human social interactions observed in online forums.

Method: Structured simulation framework evaluating conformity, group polarization, and fragmentation across model scales and reasoning capabilities.

Result: Smaller models exhibit higher conformity; reasoning-optimized models resist social influence.

Conclusion: LLMs can simulate human social dynamics, with model scale and reasoning affecting outcomes.

Abstract: Recent advancements in Large Language Models offer promising capabilities to simulate complex human social interactions. We investigate whether LLM-based multi-agent simulations can reproduce core human social dynamics observed in online forums. We evaluate conformity dynamics, group polarization, and fragmentation across different model scales and reasoning capabilities using a structured simulation framework. Our findings indicate that smaller models exhibit higher conformity rates, whereas models optimized for reasoning are more resistant to social influence.

cs.MM

[362] GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, Hongtao Xie

Main category: cs.MM

TL;DR: The paper introduces GestureHYDRA, a hybrid-modality system for co-speech gesture generation, focusing on semantically explicit hand gestures. It uses a diffusion transformer and a retrieval-augmented strategy for improved performance.

Details

Motivation: Previous works neglect explicit semantic hand gestures in co-speech synthesis. This paper aims to generate gestures with instructional value, common in live streaming.

Method: A hybrid-modality diffusion transformer architecture (GestureHYDRA) is proposed, with motion-style injective layers and a cascaded retrieval-augmented generation strategy using a semantic gesture repository.

Result: The approach outperforms counterparts in quantitative and qualitative experiments, demonstrating superior gesture activation and efficiency.

Conclusion: GestureHYDRA effectively generates semantically explicit hand gestures, enhancing co-speech gesture synthesis with improved modeling and synchronization.

Abstract: While increasing attention has been paid to co-speech gesture synthesis, most previous works neglect to investigate hand gestures with explicit and essential semantics. In this paper, we study co-speech gesture generation with an emphasis on specific hand gesture activation, which can deliver more instructional information than common body movements. To achieve this, we first build a high-quality dataset of 3D human body movements including a set of semantically explicit hand gestures that are commonly used by live streamers. Then we present a hybrid-modality gesture generation system GestureHYDRA built upon a hybrid-modality diffusion transformer architecture with novelly designed motion-style injective transformer layers, which enables advanced gesture modeling ability and versatile gesture operations. To guarantee these specific hand gestures can be activated, we introduce a cascaded retrieval-augmented generation strategy built upon a semantic gesture repository annotated for each subject and an adaptive audio-gesture synchronization mechanism, which substantially improves semantic gesture activation and production efficiency. Quantitative and qualitative experiments demonstrate that our proposed approach achieves superior performance over all the counterparts. The project page can be found at https://mumuwei.github.io/GestureHYDRA/.

eess.AS

[363] Scaling and Distilling Transformer Models for sEMG

Nicholas Mehlman, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Kelvin Niu, Alexander H. Miller, Shagun Sodhani

Main category: eess.AS

TL;DR: Vanilla transformers scaled up on sEMG data improve cross-user performance up to 110M parameters, with distillation enabling efficient smaller models.

Details

Motivation: Limited training data and computational constraints hinder scaling sEMG models, but transformers show promise.

Method: Scale vanilla transformers on sEMG data and distill large models into smaller ones.

Result: Models up to 110M parameters outperform smaller ones, and distillation reduces size by 50x with <1.5% performance loss.

Conclusion: Scaled transformers and distillation enable efficient, high-performance sEMG models for real-world use.

Abstract: Surface electromyography (sEMG) signals offer a promising avenue for developing innovative human-computer interfaces by providing insights into muscular activity. However, the limited volume of training data and computational constraints during deployment have restricted the investigation of scaling up the model size for solving sEMG tasks. In this paper, we demonstrate that vanilla transformer models can be effectively scaled up on sEMG data and yield improved cross-user performance up to 110M parameters, surpassing the model size regime investigated in other sEMG research (usually <10M parameters). We show that >100M-parameter models can be effectively distilled into models 50x smaller with minimal loss of performance (<1.5% absolute). This results in efficient and expressive models suitable for complex real-time sEMG tasks in real-world environments.

[364] Tiny Noise-Robust Voice Activity Detector for Voice Assistants

Hamed Jafarzadeh Asl, Mahsa Ghazvini Nejad, Amin Edraki, Masoud Asgharian, Vahid Partovi Nia

Main category: eess.AS

TL;DR: A noise-robust Voice Activity Detection (VAD) method is proposed, enhancing accuracy in noisy environments without increasing model size or requiring fine-tuning.

Details

Motivation: Accurate VAD is crucial for speech processing applications like voice assistants, especially in noisy AIoT devices, where existing models struggle with low signal-to-noise ratios.

Method: The approach combines a light-weight VAD with data pre-processing and post-processing modules to handle background noise.

Result: The method significantly improves VAD accuracy in noisy environments and also enhances clean speech detection, outperforming baselines.

Conclusion: The proposed noise-robust VAD is effective and practical for real-world applications with background noise.

Abstract: Voice Activity Detection (VAD) in the presence of background noise remains a challenging problem in speech processing. Accurate VAD is essential in automatic speech recognition, voice-to-text, conversational agents, etc, where noise can severely degrade the performance. A modern application includes the voice assistant, specially mounted on Artificial Intelligence of Things (AIoT) devices such as cell phones, smart glasses, earbuds, etc, where the voice signal includes background noise. Therefore, VAD modules must remain light-weight due to their practical on-device limitation. The existing models often struggle with low signal-to-noise ratios across diverse acoustic environments. A simple VAD often detects human voice in a clean environment, but struggles to detect the human voice in noisy conditions. We propose a noise-robust VAD that comprises a light-weight VAD, with data pre-processing and post-processing added modules to handle the background noise. This approach significantly enhances the VAD accuracy in noisy environments and requires neither a larger model, nor fine-tuning. Experimental results demonstrate that our approach achieves a notable improvement compared to baselines, particularly in environments with high background noise interference. This modified VAD additionally improving clean speech detection.

[365] The Risks and Detection of Overestimated Privacy Protection in Voice Anonymisation

Michele Panariello, Sarina Meyer, Pierre Champion, Xiaoxiao Miao, Massimiliano Todisco, Ngoc Thang Vu, Nicholas Evans

Main category: eess.AS

TL;DR: The paper highlights the risk of overestimating voice anonymization performance due to poorly trained speaker verification systems and provides a method to detect untrustworthy assessments.

Details

Motivation: To address the potential overestimation of privacy protection in voice anonymization caused by mismatched or poorly trained verification systems.

Method: The authors demonstrate the issue with examples from literature, introduce a detection method for untrustworthy performance assessments, and provide an open-source solution.

Result: They found cases where anonymization performance was overestimated by up to 74% and successfully detected all such scenarios.

Conclusion: The paper proposes a practical solution to ensure accurate performance assessment in voice anonymization, available as an open-source tool.

Abstract: Voice anonymisation aims to conceal the voice identity of speakers in speech recordings. Privacy protection is usually estimated from the difficulty of using a speaker verification system to re-identify the speaker post-anonymisation. Performance assessments are therefore dependent on the verification model as well as the anonymisation system. There is hence potential for privacy protection to be overestimated when the verification system is poorly trained, perhaps with mismatched data. In this paper, we demonstrate the insidious risk of overestimating anonymisation performance and show examples of exaggerated performance reported in the literature. For the worst case we identified, performance is overestimated by 74% relative. We then introduce a means to detect when performance assessment might be untrustworthy and show that it can identify all overestimation scenarios presented in the paper. Our solution is openly available as a fork of the 2024 VoicePrivacy Challenge evaluation toolkit.

[366] Modeling Multi-Level Hearing Loss for Speech Intelligibility Prediction

Xiajie Zhou, Candy Olivia Mawalim, Masashi Unoki

Main category: eess.AS

TL;DR: A method for predicting speech intelligibility by simulating auditory degradations from hearing loss, outperforming HASPI v2 in accuracy.

Details

Motivation: Standard audiometry fails to capture frequency and temporal resolution deficits in hearing loss, necessitating a better prediction method.

Method: Simulates hearing loss effects via cochlear filter broadening and modulation filtering, uses STM representations and NCC matrices, and trains a Vision Transformer model.

Result: Outperforms HASPI v2, reducing error by 16.5% (mild) and 6.1% (moderate-to-severe hearing loss).

Conclusion: Explicit modeling of auditory degradations improves speech intelligibility prediction and interpretability.

Abstract: The diverse perceptual consequences of hearing loss severely impede speech communication, but standard clinical audiometry, which is focused on threshold-based frequency sensitivity, does not adequately capture deficits in frequency and temporal resolution. To address this limitation, we propose a speech intelligibility prediction method that explicitly simulates auditory degradations according to hearing loss severity by broadening cochlear filters and applying low-pass modulation filtering to temporal envelopes. Speech signals are subsequently analyzed using the spectro-temporal modulation (STM) representations, which reflect how auditory resolution loss alters the underlying modulation structure. In addition, normalized cross-correlation (NCC) matrices quantify the similarity between the STM representations of clean speech and speech in noise. These auditory-informed features are utilized to train a Vision Transformer-based regression model that integrates the STM maps and NCC embeddings to estimate speech intelligibility scores. Evaluations on the Clarity Prediction Challenge corpus show that the proposed method outperforms the Hearing-Aid Speech Perception Index v2 (HASPI v2) in both mild and moderate-to-severe hearing loss groups, with a relative root mean squared error reduction of 16.5% for the mild group and a 6.1% reduction for the moderate-to-severe group. These results highlight the importance of explicitly modeling listener-specific frequency and temporal resolution degradations to improve speech intelligibility prediction and provide interpretability in auditory distortions.

[367] A k-space approach to modeling multi-channel parametric array loudspeaker systems

Tao Zhuang, Longbiao He, Feng Niu, Jia-Xin Zhong, Jing Lu

Main category: eess.AS

TL;DR: A k-space method is proposed for efficient and accurate modeling of multi-channel parametric array loudspeakers (MCPAL) systems, achieving significant speed-up and computational efficiency.

Details

Motivation: Efficient and accurate prediction of sound fields in MCPAL systems is challenging due to nonlinear behavior and multi-channel signal processing.

Method: The method uses a k-space approach, solving the linear ultrasound field with the angular spectrum method and computing the quasilinear audio sound field efficiently in k-space using 3D FFTs.

Result: The method achieves a speed-up of over four orders of magnitude compared to direct integration, maintaining accuracy without paraxial approximation.

Conclusion: The proposed approach enables advanced simulation and design of MCPAL systems.

Abstract: Multi-channel parametric array loudspeaker (MCPAL) systems offer enhanced flexibility and promise for generating highly directional audio beams in real-world applications. However, efficient and accurate prediction of their generated sound fields remains a major challenge due to the complex nonlinear behavior and multi-channel signal processing involved. To overcome this obstacle, we propose a k-space approach for modeling arbitrary MCPAL systems arranged on a baffled planar surface. In our method, the linear ultrasound field is first solved using the angular spectrum approach, and the quasilinear audio sound field is subsequently computed efficiently in k-space. By leveraging three-dimensional fast Fourier transforms, our approach not only achieves high computational and memory efficiency but also maintains accuracy without relying on the paraxial approximation. For typical configurations studied, the proposed method demonstrates a speed-up of more than four orders of magnitude compared to the direct integration method. Our proposed approach paved the way for simulating and designing advanced MCPAL systems.

[368] MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

Sungwoo Cho, Jeongsoo Choi, Sungnyun Kim, Se-Young Yun

Main category: eess.AS

TL;DR: A zero-shot AV2AV translation model using conditional flow matching (CFM) with dual audio-visual guidance to maintain speaker consistency and enhance translation.

Details

Motivation: Addressing the challenge of speaker consistency in AV2AV translation by leveraging multimodal guidance.

Method: Proposes a CFM-based renderer with x-vectors for audio and emotional cues for visual rendering, independent of semantic content.

Result: Improved speaker consistency and translation performance, evidenced by better LSE and FID scores.

Conclusion: The model effectively handles zero-shot AV2AV translation, enhancing both speech and facial generation quality.

Abstract: Despite recent advances in text-to-speech (TTS) models, audio-visual-to-audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. By leveraging multimodal guidance with CFM, our model robustly preserves speaker-specific characteristics and enhances zero-shot AV2AV translation abilities. For the audio modality, we enhance the CFM process by integrating robust speaker embeddings with x-vectors, which serve to bolster speaker consistency. Additionally, we convey emotional nuances to the face rendering module. The guidance provided by both audio and visual cues remains independent of semantic or linguistic content, allowing our renderer to effectively handle zero-shot translation tasks for monolingual speakers in different languages. We empirically demonstrate that the inclusion of high-quality mel-spectrograms conditioned on facial information not only enhances the quality of the synthesized speech but also positively influences facial generation, leading to overall performance improvements in LSE and FID score. Our code is available at https://github.com/Peter-SungwooCho/MAVFlow.

[369] Uncovering the role of semantic and acoustic cues in normal and dichotic listening

Sai Samrat Kankanala, Akshara Soman, Sriram Ganapathy

Main category: eess.AS

TL;DR: The paper quantifies the roles of acoustic and semantic cues in speech comprehension using EEG data and a match-mismatch task, proposing a deep-learning model (STEM) that outperforms existing methods.

Details

Motivation: To understand how acoustic and semantic information streams contribute to speech comprehension in complex listening conditions.

Method: A match-mismatch classification task is designed, using EEG data, speech envelope (acoustic), and textual representations (semantic). A multimodal deep-learning model (STEM) is proposed.

Result: Speech perception is fragmented by word boundaries. Acoustic and semantic cues perform similarly in natural listening, but semantic cues outperform in dichotic listening. STEM shows significant improvements over prior models.

Conclusion: The study quantifies the roles of acoustic and semantic cues in speech comprehension, supporting right-ear advantage in dichotic listening and advancing EEG-based speech analysis.

Abstract: Speech comprehension is an involuntary task for the healthy human brain, yet the understanding of the mechanisms underlying this brain functionality remains obscure. In this paper, we aim to quantify the role of acoustic and semantic information streams in complex listening conditions. We propose a paradigm to understand the encoding of the speech cues in electroencephalogram (EEG) data, by designing a match-mismatch (MM) classification task. The MM task involves identifying whether the stimulus (speech) and response (EEG) correspond to each other. We build a multimodal deep-learning based sequence model STEM, which is input with acoustic stimulus (speech envelope), semantic stimulus (textual representations of speech), and the neural response (EEG data). We perform extensive experiments on two separate conditions, i) natural passive listening and, ii) a dichotic listening requiring auditory attention. Using the MM task as the analysis framework, we observe that - a) speech perception is fragmented based on word boundaries, b) acoustic and semantic cues offer similar levels of MM task performance in natural listening conditions, and c) semantic cues offer significantly improved MM classification over acoustic cues in dichotic listening task. The comparison of the STEM with previously proposed MM models shows significant performance improvements for the proposed approach. The analysis and understanding from this study allows the quantification of the roles played by acoustic and semantic cues in diverse listening tasks and in providing further evidences of right-ear advantage in dichotic listening.

[370] CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

Nian Shao, Rui Zhou, Pengyu Wang, Xian Li, Ying Fang, Yujie Yang, Xiaofei Li

Main category: eess.AS

TL;DR: CleanMel is a Mel-spectrogram denoising and dereverberation network that improves speech quality and ASR performance by processing noisy inputs into clean Mel-spectrograms.

Details

Motivation: To enhance speech quality and ASR performance by addressing noise and reverberation in single-channel recordings.

Method: Uses interleaved cross-band and narrow-band processing in the Mel-frequency domain to learn spectral patterns and signal properties.

Result: Demonstrates significant improvements in speech quality and ASR performance across multiple datasets.

Conclusion: Mel-spectrogram enhancement is effective for speech quality and ASR, offering a compact and learnable representation.

Abstract: In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to the speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on five English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model.Code and audio examples of our model are available online.

[371] Controllable joint noise reduction and hearing loss compensation using a differentiable auditory model

Philippe Gonzalez, Torsten Dau, Tobias May

Main category: eess.AS

TL;DR: The paper proposes a multi-task learning approach for joint noise reduction (NR) and hearing loss compensation (HLC) using a differentiable auditory model, achieving comparable performance to single-task systems while allowing flexible balance adjustment during inference.

Details

Motivation: The challenge in HLC is the lack of ground-truth targets and inflexibility in existing methods. Differentiable auditory models offer direct optimization but previous works lacked task balancing or focused on individual listeners.

Method: A multi-task learning framework is used to simultaneously predict denoised and compensated signals from noisy speech and audiograms, leveraging a differentiable auditory model.

Result: The system matches the performance of single-task systems in objective metrics and allows dynamic adjustment of NR and HLC balance during inference.

Conclusion: The proposed multi-task approach effectively combines NR and HLC, offering flexibility and comparable performance to specialized systems.

Abstract: Deep learning-based hearing loss compensation (HLC) seeks to enhance speech intelligibility and quality for hearing impaired listeners using neural networks. One major challenge of HLC is the lack of a ground-truth target. Recent works have used neural networks to emulate non-differentiable auditory peripheral models in closed-loop frameworks, but this approach lacks flexibility. Alternatively, differentiable auditory models allow direct optimization, yet previous studies focused on individual listener profiles, or joint noise reduction (NR) and HLC without balancing each task. This work formulates NR and HLC as a multi-task learning problem, training a system to simultaneously predict denoised and compensated signals from noisy speech and audiograms using a differentiable auditory model. Results show the system achieves similar objective metric performance to systems trained for each task separately, while being able to adjust the balance between NR and HLC during inference.

eess.IV

[372] A Segmentation Framework for Accurate Diagnosis of Amyloid Positivity without Structural Images

Penghan Zhu, Shurui Mei, Shushan Chen, Xiaobo Chu, Shanbo He, Ziyi Liu

Main category: eess.IV

TL;DR: A deep learning framework using a 3D U-Net for brain region segmentation and amyloid positivity classification from PET images alone, achieving high accuracy and potential for clinical use.

Details

Motivation: To reduce reliance on structural MRI or CT for brain region segmentation and amyloid classification, enabling scalable and reproducible analysis in PET-only settings.

Method: A 3D U-Net with four layers was trained on 200 F18-florbetapir amyloid-PET scans (130/20/50 split). Performance was evaluated using Dice scores and normalized root mean square error for segmentation, and ROC AUC for classification.

Result: Segmentation Dice scores ranged from 0.45 to 0.88, with low normalized RMSE (0.0011). Classification accuracy was 0.98, with an AUC of 0.99.

Conclusion: The model shows promise for PET-only diagnostic pipelines, reducing manual efforts and structural imaging dependence. Future work includes clinical validation and extension to other PET tracers.

Abstract: This study proposes a deep learning-based framework for automated segmentation of brain regions and classification of amyloid positivity using positron emission tomography (PET) images alone, without the need for structural MRI or CT. A 3D U-Net architecture with four layers of depth was trained and validated on a dataset of 200 F18-florbetapir amyloid-PET scans, with an 130/20/50 train/validation/test split. Segmentation performance was evaluated using Dice similarity coefficients across 30 brain regions, with scores ranging from 0.45 to 0.88, demonstrating high anatomical accuracy, particularly in subcortical structures. Quantitative fidelity of PET uptake within clinically relevant regions. Precuneus, prefrontal cortex, gyrus rectus, and lateral temporal cortex was assessed using normalized root mean square error, achieving values as low as 0.0011. Furthermore, the model achieved a classification accuracy of 0.98 for amyloid positivity based on regional uptake quantification, with an area under the ROC curve (AUC) of 0.99. These results highlight the model’s potential for integration into PET only diagnostic pipelines, particularly in settings where structural imaging is not available. This approach reduces dependence on coregistration and manual delineation, enabling scalable, reliable, and reproducible analysis in clinical and research applications. Future work will focus on clinical validation and extension to diverse PET tracers including C11 PiB and other F18 labeled compounds.

[373] Whole-brain Transferable Representations from Large-Scale fMRI Data Improve Task-Evoked Brain Activity Decoding

Yueh-Po Peng, Vincent K. M. Cheung, Li Su

Main category: eess.IV

TL;DR: STDA-SwiFT, a transformer-based model, improves fMRI task-evoked activity decoding using spatial-temporal attention and self-supervised learning, leveraging large-scale data for transfer learning.

Details

Motivation: Decoding mental states from fMRI data is challenging due to high dimensionality, noise, and limited data. The goal is to improve decoding performance.

Method: Proposes STDA-SwiFT, a transformer model with spatial-temporal divided attention and self-supervised contrastive learning, pretrained on HCP data.

Result: Substantial improvement in decoding task-evoked activity across sensory and cognitive domains, even with minimal preprocessing.

Conclusion: Transfer learning with large-scale datasets effectively addresses fMRI decoding challenges.

Abstract: A fundamental challenge in neuroscience is to decode mental states from brain activity. While functional magnetic resonance imaging (fMRI) offers a non-invasive approach to capture brain-wide neural dynamics with high spatial precision, decoding from fMRI data – particularly from task-evoked activity – remains challenging due to its high dimensionality, low signal-to-noise ratio, and limited within-subject data. Here, we leverage recent advances in computer vision and propose STDA-SwiFT, a transformer-based model that learns transferable representations from large-scale fMRI datasets via spatial-temporal divided attention and self-supervised contrastive learning. Using pretrained voxel-wise representations from 995 subjects in the Human Connectome Project (HCP), we show that our model substantially improves downstream decoding performance of task-evoked activity across multiple sensory and cognitive domains, even with minimal data preprocessing. We demonstrate performance gains from larger receptor fields afforded by our memory-efficient attention mechanism, as well as the impact of functional relevance in pretraining data when fine-tuning on small samples. Our work showcases transfer learning as a viable approach to harness large-scale datasets to overcome challenges in decoding brain activity from fMRI data.

Tianyi Liu, Kejun Wu, Chen Cai, Yi Wang, Kim-Hui Yap, Lap-Pui Chau

Main category: eess.IV

TL;DR: A novel blind bitstream-corrupted video recovery framework integrates visual foundation models with a recovery model, eliminating manual annotation and improving recovery quality.

Details

Motivation: Bitstream-corrupted videos degrade quality significantly, and existing methods require labor-intensive annotation, making recovery impractical.

Method: Proposes a framework with Detect Any Corruption (DAC) model and Corruption-aware Feature Completion (CFC) module, leveraging visual foundation models and bitstream prompts.

Result: Achieves outstanding recovery performance without manual masks, suppressing artifacts and enhancing residuals.

Conclusion: The method improves user experience and reliability in multimedia systems, enabling wider applications.

Abstract: Video signals are vulnerable in multimedia communication and storage systems, as even slight bitstream-domain corruption can lead to significant pixel-domain degradation. To recover faithful spatio-temporal content from corrupted inputs, bitstream-corrupted video recovery has recently emerged as a challenging and understudied task. However, existing methods require time-consuming and labor-intensive annotation of corrupted regions for each corrupted video frame, resulting in a large workload in practice. In addition, high-quality recovery remains difficult as part of the local residual information in corrupted frames may mislead feature completion and successive content recovery. In this paper, we propose the first blind bitstream-corrupted video recovery framework that integrates visual foundation models with a recovery model, which is adapted to different types of corruption and bitstream-level prompts. Within the framework, the proposed Detect Any Corruption (DAC) model leverages the rich priors of the visual foundation model while incorporating bitstream and corruption knowledge to enhance corruption localization and blind recovery. Additionally, we introduce a novel Corruption-aware Feature Completion (CFC) module, which adaptively processes residual contributions based on high-level corruption understanding. With VFM-guided hierarchical feature augmentation and high-level coordination in a mixture-of-residual-experts (MoRE) structure, our method suppresses artifacts and enhances informative residuals. Comprehensive evaluations show that the proposed method achieves outstanding performance in bitstream-corrupted video recovery without requiring a manually labeled mask sequence. The demonstrated effectiveness will help to realize improved user experience, wider application scenarios, and more reliable multimedia communication and storage systems.

[375] Learned Off-aperture Encoding for Wide Field-of-view RGBD Imaging

Haoyu Wei, Xin Liu, Yuhui Liu, Qiang Fu, Wolfgang Heidrich, Edmund Y. Lam, Yifan Peng

Main category: eess.IV

TL;DR: The paper proposes an off-aperture DOE design to enhance imaging fidelity in wide fields of view, outperforming traditional on-aperture systems by over 5 dB in PSNR.

Details

Motivation: Existing E2E imaging systems struggle with high computational complexity and off-axis aberrations, limiting image fidelity at wide fields of view.

Method: The work positions a DOE off-aperture for local wavefront control, combining differentiable ray and wave optics modeling in hybrid refractive-diffractive systems.

Result: Off-aperture DOE improves imaging quality by over 5 dB PSNR at ~45° FoV and enables color/depth recovery at ~28° FoV with compound optics.

Conclusion: The off-aperture DOE design is effective and versatile, validated by physical prototypes for enhanced imaging performance.

Abstract: End-to-end (E2E) designed imaging systems integrate coded optical designs with decoding algorithms to enhance imaging fidelity for diverse visual tasks. However, existing E2E designs encounter significant challenges in maintaining high image fidelity at wide fields of view, due to high computational complexity, as well as difficulties in modeling off-axis wave propagation while accounting for off-axis aberrations. In particular, the common approach of placing the encoding element into the aperture or pupil plane results in only a global control of the wavefront. To overcome these limitations, this work explores an additional design choice by positioning a DOE off-aperture, enabling a spatial unmixing of the degrees of freedom and providing local control over the wavefront over the image plane. Our approach further leverages hybrid refractive-diffractive optical systems by linking differentiable ray and wave optics modeling, thereby optimizing depth imaging quality and demonstrating system versatility. Experimental results reveal that the off-aperture DOE enhances the imaging quality by over 5 dB in PSNR at a FoV of approximately $45^\circ$ when paired with a simple thin lens, outperforming traditional on-aperture systems. Furthermore, we successfully recover color and depth information at nearly $28^\circ$ FoV using off-aperture DOE configurations with compound optics. Physical prototypes for both applications validate the effectiveness and versatility of the proposed method.

[376] trAIce3D: A Prompt-Driven Transformer Based U-Net for Semantic Segmentation of Microglial Cells from Large-Scale 3D Microscopy Images

MohammadAmin Alamalhoda, Arsalan Firoozi, Alessandro Venturino, Sandra Siegert

Main category: eess.IV

TL;DR: trAIce3D is a deep-learning model for precise 3D microglia segmentation, addressing challenges like overlapping structures and noise. It uses a two-stage U-Net with transformers and cross-attention, achieving high accuracy and scalability.

Details

Motivation: Current segmentation methods fail to accurately capture microglia structures, limiting clinical insights. trAIce3D aims to overcome these limitations for better analysis of neurodegenerative diseases.

Method: A two-stage 3D U-Net with vision transformers and cross-attention blocks. First stage detects somas; second refines branches using soma prompts. Training involves self-supervised and prompt-based phases.

Result: Evaluated on 41,230 microglial cells, trAIce3D improves segmentation accuracy and generalization, enabling scalable analysis of complex morphologies.

Conclusion: trAIce3D excels in microglia segmentation and can adapt to other cell types, enhancing neurobiological research.

Abstract: The shape of a cell contains essential information about its function within the biological system. Segmenting these structures from large-scale 3D microscopy images is challenging, limiting clinical insights especially for microglia, immune-associated cells involved in neurodegenerative diseases. Existing segmentation methods mainly focus on cell bodies, struggle with overlapping structures, perform poorly on noisy images, require hyperparameter tuning for each new dataset, or rely on tedious semi-automated approaches. We introduce trAIce3D, a deep-learning architecture designed for precise microglia segmentation, capturing both somas and branches. It employs a two-stage approach: first, a 3D U-Net with vision transformers in the encoder detects somas using a sliding-window technique to cover the entire image. Then, the same architecture, enhanced with cross-attention blocks in skip connections, refines each soma and its branches by using soma coordinates as a prompt and a 3D window around the target cell as input. Training occurs in two phases: self-supervised Soma Segmentation, followed by prompt-based Branch Segmentation, leveraging pre-trained weights from the first phase. Trained and evaluated on a dataset of 41,230 microglial cells, trAIce3D significantly improves segmentation accuracy and generalization, enabling scalable analysis of complex cellular morphologies. While optimized for microglia, its architecture can extend to other intricate cell types, such as neurons and astrocytes, broadening its impact on neurobiological research.

[377] A Dual-Feature Extractor Framework for Accurate Back Depth and Spine Morphology Estimation from Monocular RGB Images

Yuxin Wei, Yue Zhang, Moxin Zhao, Chang Shi, Jason P. Y. Cheung, Teng Zhang, Nan Meng

Main category: eess.IV

TL;DR: A novel pipeline using depth and surface information for spine morphology estimation in scoliosis, overcoming limitations of X-rays and RGB images, achieves high accuracy.

Details

Motivation: Current AIS assessment tools like X-rays have radiation risks and accessibility issues, while RGB images are unstable due to environmental factors.

Method: Proposes GAMA-Net for depth estimation, using dual encoders and attention modules, then integrates depth and surface data for spine analysis.

Result: Depth estimation scores 78.2%, 93.6%, and 97.5% on metrics; spine curve generation achieves 97% accuracy.

Conclusion: The integrated approach significantly improves spine morphology estimation, offering a safer and more reliable alternative to X-rays.

Abstract: Scoliosis is a prevalent condition that impacts both physical health and appearance, with adolescent idiopathic scoliosis (AIS) being the most common form. Currently, the main AIS assessment tool, X-rays, poses significant limitations, including radiation exposure and limited accessibility in poor and remote areas. To address this problem, the current solutions are using RGB images to analyze spine morphology. However, RGB images are highly susceptible to environmental factors, such as lighting conditions, compromising model stability and generalizability. Therefore, in this study, we propose a novel pipeline to accurately estimate the depth information of the unclothed back, compensating for the limitations of 2D information, and then estimate spine morphology by integrating both depth and surface information. To capture the subtle depth variations of the back surface with precision, we design an adaptive multiscale feature learning network named Grid-Aware Multiscale Adaptive Network (GAMA-Net). This model uses dual encoders to extract both patch-level and global features, which are then interacted by the Patch-Based Hybrid Attention (PBHA) module. The Adaptive Multiscale Feature Fusion (AMFF) module is used to dynamically fuse information in the decoder. As a result, our depth estimation model achieves remarkable accuracy across three different evaluation metrics, with scores of nearly 78.2%, 93.6%, and 97.5%, respectively. To further validate the effectiveness of the predicted depth, we integrate both surface and depth information for spine morphology estimation. This integrated approach enhances the accuracy of spine curve generation, achieving an impressive performance of up to 97%.

[378] Optimizing Federated Learning Configurations for MRI Prostate Segmentation and Cancer Detection: A Simulation Study

Ashkan Moradi, Fadila Zerka, Joeran S. Bosma, Mohammed R. S. Sunoqrot, Bendik S. Abrahamsen, Derya Yakar, Jeroen Geerdink, Henkjan Huisman, Tone Frost Bathen, Mattijs Elschot

Main category: eess.IV

TL;DR: A federated learning framework was optimized for MRI prostate segmentation and csPCa detection, showing improved performance over local models.

Details

Motivation: To enhance the accuracy and generalizability of prostate segmentation and csPCa detection using federated learning across multiple clients.

Method: Used Flower FL to train a nnU-Net-based model, optimizing local epochs, federated rounds, and aggregation strategies for two tasks: segmentation (T2-weighted MRIs) and csPCa detection (biparametric MRIs).

Result: Optimized FL configurations (FedMedian for segmentation, FedAdagrad for detection) outperformed local models, with significant improvements in csPCa detection but no notable difference in segmentation.

Conclusion: FL improves performance and generalizability for prostate segmentation and csPCa detection, with further gains from configuration optimization, especially in lesion detection.

Abstract: Purpose: To develop and optimize a federated learning (FL) framework across multiple clients for biparametric MRI prostate segmentation and clinically significant prostate cancer (csPCa) detection. Materials and Methods: A retrospective study was conducted using Flower FL to train a nnU-Net-based architecture for MRI prostate segmentation and csPCa detection, using data collected from January 2010 to August 2021. Model development included training and optimizing local epochs, federated rounds, and aggregation strategies for FL-based prostate segmentation on T2-weighted MRIs (four clients, 1294 patients) and csPCa detection using biparametric MRIs (three clients, 1440 patients). Performance was evaluated on independent test sets using the Dice score for segmentation and the Prostate Imaging: Cancer Artificial Intelligence (PI-CAI) score, defined as the average of the area under the receiver operating characteristic curve and average precision, for csPCa detection. P-values for performance differences were calculated using permutation testing. Results: The FL configurations were independently optimized for both tasks, showing improved performance at 1 epoch 300 rounds using FedMedian for prostate segmentation and 5 epochs 200 rounds using FedAdagrad, for csPCa detection. Compared with the average performance of the clients, the optimized FL model significantly improved performance in prostate segmentation and csPCa detection on the independent test set. The optimized FL model showed higher lesion detection performance compared to the FL-baseline model, but no evidence of a difference was observed for prostate segmentation. Conclusions: FL enhanced the performance and generalizability of MRI prostate segmentation and csPCa detection compared with local models, and optimizing its configuration further improved lesion detection performance.

[379] Beyond Image Prior: Embedding Noise Prior into Conditional Denoising Transformer

Yuanfei Huang, Hua Huang

Main category: eess.IV

TL;DR: The paper introduces a conditional optimization framework for denoising, using a Locally Noise Prior Estimation (LoNPE) algorithm and a Conditional Denoising Transformer (Condformer) to improve generalization and flexibility.

Details

Motivation: Existing denoising methods struggle with real-world noise variability. The paper addresses this by separating noise and image priors.

Method: Develops LoNPE for noise prior estimation from a single noisy image and Condformer, a transformer incorporating noise priors via conditional self-attention.

Result: Outperforms state-of-the-art methods on synthetic and real-world datasets.

Conclusion: The proposed framework enhances denoising by explicitly leveraging noise priors, improving generalization and flexibility.

Abstract: Existing learning-based denoising methods typically train models to generalize the image prior from large-scale datasets, suffering from the variability in noise distributions encountered in real-world scenarios. In this work, we propose a new perspective on the denoising challenge by highlighting the distinct separation between noise and image priors. This insight forms the basis for our development of conditional optimization framework, designed to overcome the constraints of traditional denoising framework. To this end, we introduce a Locally Noise Prior Estimation (LoNPE) algorithm, which accurately estimates the noise prior directly from a single raw noisy image. This estimation acts as an explicit prior representation of the camera sensor’s imaging environment, distinct from the image prior of scenes. Additionally, we design an auxiliary learnable LoNPE network tailored for practical application to sRGB noisy images. Leveraging the estimated noise prior, we present a novel Conditional Denoising Transformer (Condformer), by incorporating the noise prior into a conditional self-attention mechanism. This integration allows the Condformer to segment the optimization process into multiple explicit subspaces, significantly enhancing the model’s generalization and flexibility. Extensive experimental evaluations on both synthetic and real-world datasets, demonstrate that the proposed method achieves superior performance over current state-of-the-art methods. The source code is available at https://github.com/YuanfeiHuang/Condformer.

[380] Unsupervised Multi-Parameter Inverse Solving for Reducing Ring Artifacts in 3D X-Ray CBCT

Qing Wu, Hongjiang Wei, Jingyi Yu, Yuyao Zhang

Main category: eess.IV

TL;DR: Riner is an unsupervised method for reducing ring artifacts in 3D CBCT, outperforming supervised methods by modeling the problem as a multi-parameter inverse problem and learning artifact-free images directly from data.

Details

Motivation: Ring artifacts degrade CBCT image quality, and supervised methods fail to generalize well or scale efficiently to 3D.

Method: Riner reformulates ring artifact reduction as a solvable inverse problem, using a differentiable forward model to learn artifact-free images and physical parameters without external data.

Result: Riner outperforms SOTA supervised methods on simulated and real-world datasets and is memory-efficient for 3D CBCT.

Conclusion: Riner provides a scalable, unsupervised solution for ring artifact reduction, addressing limitations of supervised approaches.

Abstract: Ring artifacts are prevalent in 3D cone-beam computed tomography (CBCT) due to non-ideal responses of X-ray detectors, substantially affecting image quality and diagnostic reliability. Existing state-of-the-art (SOTA) ring artifact reduction (RAR) methods rely on supervised learning with large-scale paired CT datasets. While effective in-domain, supervised methods tend to struggle to fully capture the physical characteristics of ring artifacts, leading to pronounced performance drops in complex real-world acquisitions. Moreover, their scalability to 3D CBCT is limited by high memory demands. In this work, we propose Riner, a new unsupervised RAR method. Based on a theoretical analysis of ring artifact formation, we reformulate RAR as a multi-parameter inverse problem, where the non-ideal responses of X-ray detectors are parameterized as solvable physical variables. Using a new differentiable forward model, Riner can jointly learn the implicit neural representation of artifact-free images and estimate the physical parameters directly from CT measurements, without external training data. Additionally, Riner is memory-friendly due to its ray-based optimization, enhancing its usability in large-scale 3D CBCT. Experiments on both simulated and real-world datasets show Riner outperforms existing SOTA supervised methods.

[381] Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

Du Chen, Liyi Chen, Zhengqiang Zhang, Lei Zhang

Main category: eess.IV

TL;DR: The paper proposes GSASR, a method using Gaussian Splatting (GS) for Arbitrary-scale Super-Resolution (ASR), overcoming limitations of Implicit Neural Representations (INR) by generalizing GS for ASR with novel techniques.

Details

Motivation: INR-based models for ASR suffer from insufficient representation capability and low computational efficiency. GS shows advantages in 3D tasks, prompting exploration of its use for ASR.

Method: Two novel techniques: (1) an architecture to predict image-conditioned Gaussians for input images, and (2) a differentiable 2D GPU/CUDA-based rasterization for rendering.

Result: GSASR achieves effective ASR for any image and unseen scaling factors, validated through extensive experiments.

Conclusion: GSASR demonstrates superior performance in ASR by leveraging GS, offering a promising alternative to INR-based methods.

Abstract: Implicit Neural Representations (INR) have been successfully employed for Arbitrary-scale Super-Resolution (ASR). However, INR-based models need to query the multi-layer perceptron module numerous times and render a pixel in each query, resulting in insufficient representation capability and low computational efficiency. Recently, Gaussian Splatting (GS) has shown its advantages over INR in both visual quality and rendering speed in 3D tasks, which motivates us to explore whether GS can be employed for the ASR task. However, directly applying GS to ASR is exceptionally challenging because the original GS is an optimization-based method through overfitting each single scene, while in ASR we aim to learn a single model that can generalize to different images and scaling factors. We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Each Gaussian can fit the shape and direction of an area of complex textures, showing powerful representation capability. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted continuous Gaussians. Via end-to-end training, our optimized network, namely GSASR, can perform ASR for any image and unseen scaling factors. Extensive experiments validate the effectiveness of our proposed method. The code and models are available at https://github.com/ChrisDud0257/GSASR.

[382] Clinical Utility of Foundation Segmentation Models in Musculoskeletal MRI: Biomarker Fidelity and Predictive Outcomes

Gabrielle Hoyer, Michelle W Tong, Rupsa Bhattacharjee, Valentina Pedoia, Sharmila Majumdar

Main category: eess.IV

TL;DR: Evaluation of three segmentation models (SAM, SAM2, MedSAM) in MSK MRI across diverse anatomical contexts, assessing accuracy, generalizability, and biomarker reliability. Finetuned models matched expert measurements, and AutoLabel enabled scalable segmentation with trade-offs. Applied to knee MRI triage and osteoarthritis prediction.

Details

Motivation: To address the lack of evaluation of foundation segmentation models for accuracy and biomarker fidelity in diverse MSK MRI contexts.

Method: Evaluated three models across 11 MSK MRI datasets, assessing zero-shot and finetuned performance, segmentation accuracy, generalizability, and biomarker reliability. Used AutoLabel for scalable segmentation.

Result: Finetuned models agreed with expert measurements for biomarkers like cartilage thickness and muscle volume. AutoLabel enabled scalable segmentation with moderate accuracy trade-offs.

Conclusion: The framework provides a transparent method for benchmarking segmentation tools and linking model performance to clinical priorities, demonstrated in knee MRI triage and osteoarthritis prediction.

Abstract: Effective segmentation is fundamental for quantitative medical imaging; however, foundation segmentation models remain insufficiently evaluated for accuracy and biomarker fidelity across the diverse anatomical contexts and imaging protocols encountered in musculoskeletal (MSK) MRI. We evaluate three widely used segmentation models (SAM, SAM2, MedSAM) across eleven MSK MRI datasets spanning the knee, hip, spine, shoulder, and thigh. Our framework assesses both zero-shot and finetuned performance, with attention to segmentation accuracy, generalizability across imaging protocols, and reliability of derived quantitative biomarkers. Finetuned models showed consistent agreement with expert measurements for biomarkers including cartilage thickness, disc height, muscle volume, and compositional T1rho/T2 values. Automated prompting through the AutoLabel system enabled scalable segmentation, with moderate trade-offs in accuracy. As proof of concept, we applied the validated system to (i) a three-stage knee MRI triage cascade and (ii) a longitudinal landmark model that predicts total knee replacement and incident osteoarthritis. The framework offers a transparent method for benchmarking segmentation tools and connecting model performance to clinical imaging priorities.

[383] Skull-stripping induces shortcut learning in MRI-based Alzheimer’s disease classification

Christian Tinauer, Maximilian Sackl, Rudolf Stollberger, Reinhold Schmidt, Stefan Ropele, Christian Langkammer

Main category: eess.IV

TL;DR: The study evaluates the impact of T1w MRI preprocessing (skull-stripping and binarization) on AD classification, revealing that models rely on volumetric features (brain contours) rather than texture, highlighting potential biases.

Details

Motivation: To clarify which MRI features (texture, volume, preprocessing artifacts) contribute to AD classification accuracy in deep learning models, addressing interpretability gaps.

Method: Used 990 ADNI T1w MRIs, varied preprocessing (skull-stripping, binarization), trained 3D CNNs, and analyzed feature relevance with Layer-wise Relevance Propagation and clustering.

Result: Classification performance remained stable across preprocessing, with models relying on brain contours (volumetric features) rather than texture, indicating shortcut learning.

Conclusion: Preprocessing artifacts can introduce biases (Clever Hans effect), underscoring the need for interpretability tools to ensure robust medical imaging AI.

Abstract: Objectives: High classification accuracy of Alzheimer’s disease (AD) from structural MRI has been achieved using deep neural networks, yet the specific image features contributing to these decisions remain unclear. In this study, the contributions of T1-weighted (T1w) gray-white matter texture, volumetric information, and preprocessing – particularly skull-stripping – were systematically assessed. Methods: A dataset of 990 matched T1w MRIs from AD patients and cognitively normal controls from the ADNI database were used. Preprocessing was varied through skull-stripping and intensity binarization to isolate texture and shape contributions. A 3D convolutional neural network was trained on each configuration, and classification performance was compared using exact McNemar tests with discrete Bonferroni-Holm correction. Feature relevance was analyzed using Layer-wise Relevance Propagation, image similarity metrics, and spectral clustering of relevance maps. Results: Despite substantial differences in image content, classification accuracy, sensitivity, and specificity remained stable across preprocessing conditions. Models trained on binarized images preserved performance, indicating minimal reliance on gray-white matter texture. Instead, volumetric features – particularly brain contours introduced through skull-stripping – were consistently used by the models. Conclusions: This behavior reflects a shortcut learning phenomenon, where preprocessing artifacts act as potentially unintended cues. The resulting Clever Hans effect emphasizes the critical importance of interpretability tools to reveal hidden biases and to ensure robust and trustworthy deep learning in medical imaging.

[384] Automated MRI Tumor Segmentation using hybrid U-Net with Transformer and Efficient Attention

Syed Haider Ali, Asrar Ahmad, Muhammad Ali, Asifullah Khan, Nadeem Shaukat

Main category: eess.IV

TL;DR: The paper proposes a hybrid UNet-Transformer model for accurate tumor segmentation in MRI datasets from a local hospital, achieving competitive results with limited data.

Details

Motivation: Existing AI-based segmentation models lack adaptability to local patient populations, necessitating research on local datasets for clinical integration.

Method: A hybrid UNet-Transformer model with attention modules (efficient attention, SE blocks, CBAM, ResNeXt) was trained on 6080 local MRI images using pretrained weights and dual GPUs.

Result: Achieved a Dice score of 0.764 and IoU of 0.736, showing robust performance despite limited data.

Conclusion: Site-specific model development is crucial for clinical deployment, and the proposed method is effective for local datasets.

Abstract: Cancer is an abnormal growth with potential to invade locally and metastasize to distant organs. Accurate auto-segmentation of the tumor and surrounding normal tissues is required for radiotherapy treatment plan optimization. Recent AI-based segmentation models are generally trained on large public datasets, which lack the heterogeneity of local patient populations. While these studies advance AI-based medical image segmentation, research on local datasets is necessary to develop and integrate AI tumor segmentation models directly into hospital software for efficient and accurate oncology treatment planning and execution. This study enhances tumor segmentation using computationally efficient hybrid UNet-Transformer models on magnetic resonance imaging (MRI) datasets acquired from a local hospital under strict privacy protection. We developed a robust data pipeline for seamless DICOM extraction and preprocessing, followed by extensive image augmentation to ensure model generalization across diverse clinical settings, resulting in a total dataset of 6080 images for training. Our novel architecture integrates UNet-based convolutional neural networks with a transformer bottleneck and complementary attention modules, including efficient attention, Squeeze-and-Excitation (SE) blocks, Convolutional Block Attention Module (CBAM), and ResNeXt blocks. To accelerate convergence and reduce computational demands, we used a maximum batch size of 8 and initialized the encoder with pretrained ImageNet weights, training the model on dual NVIDIA T4 GPUs via checkpointing to overcome Kaggle’s runtime limits. Quantitative evaluation on the local MRI dataset yielded a Dice similarity coefficient of 0.764 and an Intersection over Union (IoU) of 0.736, demonstrating competitive performance despite limited data and underscoring the importance of site-specific model development for clinical deployment.

[385] DMCIE: Diffusion Model with Concatenation of Inputs and Errors to Improve the Accuracy of the Segmentation of Brain Tumors in MRI Images

Sara Yavari, Rahul Nitin Pandya, Jacob Furst

Main category: eess.IV

TL;DR: A novel diffusion model (DMCIE) improves brain tumor segmentation in MRI by using error maps and multimodal inputs, outperforming state-of-the-art methods.

Details

Motivation: Accurate brain tumor segmentation in MRI is critical for diagnosis and treatment. Diffusion models show promise for such tasks.

Method: DMCIE combines a 3D U-Net for initial segmentation with a diffusion model guided by error maps and original MRI inputs (T1, T1ce, T2, FLAIR).

Result: Achieves Dice Score of 93.46 and HD95 of 5.94 mm on BraTS2020, surpassing other diffusion-based methods.

Conclusion: Error-guided diffusion enhances segmentation accuracy, proving effective for precise brain tumor delineation.

Abstract: Accurate segmentation of brain tumors in MRI scans is essential for reliable clinical diagnosis and effective treatment planning. Recently, diffusion models have demonstrated remarkable effectiveness in image generation and segmentation tasks. This paper introduces a novel approach to corrective segmentation based on diffusion models. We propose DMCIE (Diffusion Model with Concatenation of Inputs and Errors), a novel framework for accurate brain tumor segmentation in multi-modal MRI scans. We employ a 3D U-Net to generate an initial segmentation mask, from which an error map is generated by identifying the differences between the prediction and the ground truth. The error map, concatenated with the original MRI images, are used to guide a diffusion model. Using multimodal MRI inputs (T1, T1ce, T2, FLAIR), DMCIE effectively enhances segmentation accuracy by focusing on misclassified regions, guided by the original inputs. Evaluated on the BraTS2020 dataset, DMCIE outperforms several state-of-the-art diffusion-based segmentation methods, achieving a Dice Score of 93.46 and an HD95 of 5.94 mm. These results highlight the effectiveness of error-guided diffusion in producing precise and reliable brain tumor segmentations.

Today’s Research Highlights

Table of Contents

cs.CL

[1] IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian

[2] Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

[3] A Scalable Pipeline for Estimating Verb Frame Frequencies Using Large Language Models

[4] Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors

[5] The role of media memorability in facilitating startups’ access to venture capital funding

[6] How Well Does First-Token Entropy Approximate Word Entropy as a Psycholinguistic Predictor?

[7] Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment

[8] RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation

[9] Meaning-infused grammar: Gradient Acceptability Shapes the Geometric Representations of Constructions in LLMs

[10] Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations

[11] A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

[12] PATENTWRITER: A Benchmarking Study for Patent Drafting with LLMs

[13] BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition

[14] Question Generation for Assessing Early Literacy Reading Comprehension

[15] NeedleChain: Measuring Intact Long-Context Reasoning Capability of Large Language Models

[16] AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini

[17] Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

[18] What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models

[19] IFEvalCode: Controlled Code Generation

[20] SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

[21] CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

[22] A Benchmark Dataset and Evaluation Framework for Vietnamese Large Language Models in Customer Support

[23] ControlMed: Adding Reasoning Control to Medical Language Model

[24] Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

[25] Unveiling the Influence of Amplifying Language-Specific Neurons

[26] BALSAM: A Platform for Benchmarking Arabic Large Language Models

[27] Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

[28] Multilingual Political Views of Large Language Models: Identification and Steering

[29] From Sufficiency to Reflection: Reinforcement-Guided Thinking Quality in Retrieval-Augmented Reasoning for LLMs

[30] Investigating Hallucination in Conversations for Low Resource Languages

[31] Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning

[32] Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index

[33] CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

[34] Opportunities and Challenges of LLMs in Education: An NLP Perspective

[35] MASCA: LLM based-Multi Agents System for Credit Assessment

[36] DBLPLink 2.0 – An Entity Linker for the DBLP Scholarly Knowledge Graph

[37] Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization

[38] Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning

[39] Towards the Law of Capacity Gap in Distilling Language Models

[40] Instruction-tuned Large Language Models for Machine Translation in the Medical Domain

[41] Past Meets Present: Creating Historical Analogy with Large Language Models

[42] Neutral Residues: Revisiting Adapters for Model Extension

[43] Natural Language Processing for the Legal Domain: A Survey of Tasks, Datasets, Models, and Challenges

[44] Yankari: A Monolingual Yoruba Dataset

[45] Efficient Continual Learning for Small Language Models with a Discrete Key-Value Bottleneck

[46] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

[47] Modeling Story Expectations to Understand Engagement: A Generative Framework Using LLMs

[48] Rationale-guided Prompting for Knowledge-based Visual Question Answering

[49] FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

[50] GneissWeb: Preparing High Quality Data for LLMs at Scale

[51] QE4PE: Word-level Quality Estimation for Human Post-Editing

[52] Cross-Modal State-Space Graph Reasoning for Structured Summarization

[53] Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

[54] Voices of Freelance Professional Writers on AI: Limitations, Expectations, and Fears

[55] Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation

[56] IterKey: Iterative Keyword Generation with LLMs for Enhanced Retrieval Augmented Generation

[57] What Are They Talking About? A Benchmark of Knowledge-Grounded Discussion Summarization

[58] Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

[59] MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

[60] Leveraging Large Language Models for Bengali Math Word Problem Solving with Chain of Thought Reasoning

[61] MuSciClaims: Multimodal Scientific Claim Verification

[62] LLM-as-a-qualitative-judge: automating error analysis in natural language generation

[63] MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations

[64] Probing Information Distribution in Transformer Architectures through Entropy Analysis

[65] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

[66] Reservoir Computing as a Language Model

[67] Basic Reading Distillation

[68] FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models

[69] Training language models to be warm and empathetic makes them less reliable and more sycophantic

[70] DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router

cs.CV

[71] Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go?

[72] Trade-offs in Image Generation: How Do Different Dimensions Interact?

[73] AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock

[74] Color as the Impetus: Transforming Few-Shot Learner

[75] Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

[76] Enhancing efficiency in paediatric brain tumour segmentation using a pathologically diverse single-center clinical dataset