Daily arXiv Papers - 2025-07-21

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Adaptive Linguistic Prompting (ALP) Enhances Phishing Webpage Detection in Multimodal Large Language Models

Atharva Bhargude, Ishan Gonehal, Chandler Haney, Dave Yoon, Kevin Zhu, Aaron Sandoval, Sean O’Brien, Kaustubh Vinnakota

Main category: cs.CL

TL;DR: The paper introduces ALP, a method using LLMs like GPT-4o and Gemini 1.5 Pro for phishing detection, achieving high accuracy (F1-score 0.93) by analyzing linguistic patterns, urgency cues, and manipulative diction.

DetailsMotivation: Phishing attacks are a major cybersecurity threat, requiring advanced detection techniques.

Method: ALP guides LLMs to analyze phishing content through structured semantic reasoning, integrating textual, visual, and URL-based analysis.

Result: ALP enhances detection accuracy, achieving an F1-score of 0.93, outperforming traditional methods.

Conclusion: ALP-integrated multimodal LLMs offer a robust, interpretable, and adaptive solution for phishing detection.

Abstract: Phishing attacks represent a significant cybersecurity threat, necessitating adaptive detection techniques. This study explores few-shot Adaptive Linguistic Prompting (ALP) in detecting phishing webpages through the multimodal capabilities of state-of-the-art large language models (LLMs) such as GPT-4o and Gemini 1.5 Pro. ALP is a structured semantic reasoning method that guides LLMs to analyze textual deception by breaking down linguistic patterns, detecting urgency cues, and identifying manipulative diction commonly found in phishing content. By integrating textual, visual, and URL-based analysis, we propose a unified model capable of identifying sophisticated phishing attempts. Our experiments demonstrate that ALP significantly enhances phishing detection accuracy by guiding LLMs through structured reasoning and contextual analysis. The findings highlight the potential of ALP-integrated multimodal LLMs to advance phishing detection frameworks, achieving an F1-score of 0.93, surpassing traditional approaches. These results establish a foundation for more robust, interpretable, and adaptive linguistic-based phishing detection systems using LLMs.

[2] Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition

Keito Inoshita, Rushia Harada

Main category: cs.CL

TL;DR: PersonaGen is a framework for generating emotionally rich text using LLMs with multi-stage persona-based conditioning, addressing the scarcity of diverse emotional datasets.

DetailsMotivation: The challenge of high-quality, diverse emotional datasets due to subjectivity and ethical/practical constraints in data collection.

Method: PersonaGen uses layered virtual personas (demographics, socio-cultural backgrounds, situational contexts) to guide emotion expression generation.

Result: Outperforms baselines in diversity, coherence, and discriminative emotion expression, validated through clustering, quality scoring, and downstream tasks.

Conclusion: PersonaGen is a robust alternative for augmenting or replacing real-world emotional datasets.

Abstract: In the field of emotion recognition, the development of high-performance models remains a challenge due to the scarcity of high-quality, diverse emotional datasets. Emotional expressions are inherently subjective, shaped by individual personality traits, socio-cultural backgrounds, and contextual factors, making large-scale, generalizable data collection both ethically and practically difficult. To address this issue, we introduce PersonaGen, a novel framework for generating emotionally rich text using a Large Language Model (LLM) through multi-stage persona-based conditioning. PersonaGen constructs layered virtual personas by combining demographic attributes, socio-cultural backgrounds, and detailed situational contexts, which are then used to guide emotion expression generation. We conduct comprehensive evaluations of the generated synthetic data, assessing semantic diversity through clustering and distributional metrics, human-likeness via LLM-based quality scoring, realism through comparison with real-world emotion corpora, and practical utility in downstream emotion classification tasks. Experimental results show that PersonaGen significantly outperforms baseline methods in generating diverse, coherent, and discriminative emotion expressions, demonstrating its potential as a robust alternative for augmenting or replacing real-world emotional datasets.

[3] SAFT: Structure-Aware Fine-Tuning of LLMs for AMR-to-Text Generation

Rafiq Kamel, Filippo Guerranti, Simon Geisler, Stephan Günnemann

Main category: cs.CL

TL;DR: SAFT introduces a structure-aware fine-tuning method for LLMs to handle graph-structured inputs like AMRs, improving text generation performance by 3.5 BLEU.

DetailsMotivation: Current methods for AMR-to-text generation discard structural cues or use incompatible architectures, limiting LLM performance.

Method: SAFT injects graph topology into LLMs using direction-sensitive positional encodings from the magnetic Laplacian of transformed AMRs, without altering the LLM architecture.

Result: SAFT achieves a 3.5 BLEU improvement on AMR 3.0, with gains scaling with graph complexity.

Conclusion: SAFT provides a general and effective way to integrate structured data with LLMs, enhancing their performance on tasks like AMR-to-text generation.

Abstract: Large Language Models (LLMs) are increasingly applied to tasks involving structured inputs such as graphs. Abstract Meaning Representations (AMRs), which encode rich semantics as directed graphs, offer a rigorous testbed for evaluating LLMs on text generation from such structures. Yet, current methods often arbitrarily linearize AMRs, discarding key structural cues, or rely on architectures incompatible with standard LLMs. We introduce SAFT, a structure-aware fine-tuning approach that injects graph topology into pretrained LLMs without architectural changes. We compute direction-sensitive positional encodings from the magnetic Laplacian of transformed AMRs and project them into the embedding space of the LLM. While possibly applicable to any graph-structured inputs, we focus on AMR-to-text generation as a representative and challenging benchmark. SAFT sets a new state-of-the-art on AMR 3.0 with a 3.5 BLEU improvement over baselines. Gains scale with graph complexity, highlighting the value of structure-aware representations in enhancing LLM performance. SAFT offers a general and effective pathway for bridging structured data and language models.

[4] Context-Based Fake News Detection using Graph Based Approach: ACOVID-19 Use-case

Chandrashekar Muniyappa, Sirisha Velampalli

Main category: cs.CL

TL;DR: A graph-based approach using NLP and MDL-based GBAD algorithm to detect fake news by identifying anomalous patterns in contextual graphs.

DetailsMotivation: Address the rapid spread of fake news in the digital world by leveraging contextual data and graph structures.

Method: Transform news articles into contextual graphs using NLP, then apply MDL-based GBAD algorithm for anomaly detection.

Result: Identifies normative and anomalous patterns in news articles, enhancing fake news detection.

Conclusion: The proposed graph-based method effectively detects fake news by uncovering deviations from normative patterns.

Abstract: In today's digital world, fake news is spreading with immense speed. Its a significant concern to address. In this work, we addressed that challenge using novel graph based approach. We took dataset from Kaggle that contains real and fake news articles. To test our approach we incorporated recent covid-19 related news articles that contains both genuine and fake news that are relevant to this problem. This further enhances the dataset as well instead of relying completely on the original dataset. We propose a contextual graph-based approach to detect fake news articles. We need to convert news articles into appropriate schema, so we leverage Natural Language Processing (NLP) techniques to transform news articles into contextual graph structures. We then apply the Minimum Description Length (MDL)-based Graph-Based Anomaly Detection (GBAD) algorithm for graph mining. Graph-based methods are particularly effective for handling rich contextual data, as they enable the discovery of complex patterns that traditional query-based or statistical techniques might overlook. Our proposed approach identifies normative patterns within the dataset and subsequently uncovers anomalous patterns that deviate from these established norms.

[5] PARAM-1 BharatGen 2.9B Model

Kundeshwar Pundalik, Piyush Sawarkar, Nihar Sahoo, Abhishek Shinde, Prateek Chanda, Vedant Goswami, Ajay Nagpal, Atul Singh, Viraj Thakur, Vijay Dewane, Aamod Thakur, Bhargav Patel, Smita Gautam, Bhagwan Panditi, Shyam Pawar, Madhav Kotcha, Suraj Racha, Saral Sureka, Pankaj Singh, Rishi Bal, Rohit Saluja, Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: PARAM-1 is a 2.9B parameter LLM designed for Indian linguistic diversity, trained on Hindi and English with equitable representation, fair tokenization, and culturally aligned evaluation.

DetailsMotivation: Address the under-representation of linguistically diverse regions like India in LLMs, which are dominated by English-centric designs.

Method: Trained on a bilingual Hindi-English dataset with 25% Indic language allocation, adapted tokenizer, and culturally aligned benchmarks.

Result: PARAM-1 serves as a competent general-purpose model and robust baseline for India-centric applications.

Conclusion: PARAM-1 provides a design-first blueprint for equitable foundation modeling by embedding diversity at the pretraining level.

Abstract: Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks. By embedding diversity at the pretraining level-rather than deferring it to post-hoc alignment-PARAM-1 offers a design-first blueprint for equitable foundation modeling. Our results demonstrate that it serves as both a competent general-purpose model and a robust baseline for India-centric applications.

[6] TopicImpact: Improving Customer Feedback Analysis with Opinion Units for Topic Modeling and Star-Rating Prediction

Emil Häglund, Johanna Björklund

Main category: cs.CL

TL;DR: The paper improves topic modeling by using opinion units (text excerpts with sentiment scores) for better coherence and interpretability, linking topics and sentiments to business metrics like star ratings.

DetailsMotivation: To enhance the extraction of insights from customer reviews by restructuring the topic modeling pipeline to focus on opinion units, improving topic coherence and sentiment capture.

Method: Restructures the topic modeling pipeline to operate on opinion units (extracted using large language models), correlates topics and sentiments with business metrics, and evaluates the system’s effectiveness.

Result: Improved topic modeling performance, coherent and interpretable topics, and accurate sentiment-linked insights impacting business outcomes.

Conclusion: The proposed system outperforms other solutions, offering better topic coherence and sentiment integration, and effectively predicts star ratings.

Abstract: We improve the extraction of insights from customer reviews by restructuring the topic modelling pipeline to operate on opinion units - distinct statements that include relevant text excerpts and associated sentiment scores. Prior work has demonstrated that such units can be reliably extracted using large language models. The result is a heightened performance of the subsequent topic modeling, leading to coherent and interpretable topics while also capturing the sentiment associated with each topic. By correlating the topics and sentiments with business metrics, such as star ratings, we can gain insights on how specific customer concerns impact business outcomes. We present our system’s implementation, use cases, and advantages over other topic modeling and classification solutions. We also evaluate its effectiveness in creating coherent topics and assess methods for integrating topic and sentiment modalities for accurate star-rating prediction.

[7] Feature-based analysis of oral narratives from Afrikaans and isiXhosa children

Emma Sharratt, Annelien Smith, Retief Louw, Daleen Klop, Febe de Wet, Herman Kamper

Main category: cs.CL

TL;DR: The study uses machine learning to analyze oral narratives of Afrikaans- and isiXhosa-speaking children, identifying lexical diversity and utterance length as key indicators of typical development, while specific verbs and auxiliaries predict reduced intervention need.

DetailsMotivation: To identify features of oral narratives that predict literacy development and intervention needs in multilingual children.

Method: Simple machine learning analysis of recorded stories from four- and five-year-old children speaking Afrikaans and isiXhosa.

Result: Lexical diversity and utterance length indicate typical development; specific verbs and auxiliaries correlate with reduced intervention likelihood. Language-specific and shared predictors were found.

Conclusion: The study highlights language-specific and universal narrative features for early assessment in multilingual contexts.

Abstract: Oral narrative skills are strong predictors of later literacy development. This study examines the features of oral narratives from children who were identified by experts as requiring intervention. Using simple machine learning methods, we analyse recorded stories from four- and five-year-old Afrikaans- and isiXhosa-speaking children. Consistent with prior research, we identify lexical diversity (unique words) and length-based features (mean utterance length) as indicators of typical development, but features like articulation rate prove less informative. Despite cross-linguistic variation in part-of-speech patterns, the use of specific verbs and auxiliaries associated with goal-directed storytelling is correlated with a reduced likelihood of requiring intervention. Our analysis of two linguistically distinct languages reveals both language-specific and shared predictors of narrative proficiency, with implications for early assessment in multilingual contexts.

[8] Mitigating Stylistic Biases of Machine Translation Systems via Monolingual Corpora Only

Xuanqi Gao, Weipeng Jiang, Juan Zhai, Shiqing Ma, Siyi Xie, Xinyang Yin, Chao Shen

Main category: cs.CL

TL;DR: Babel is a novel framework for enhancing stylistic fidelity in NMT using monolingual corpora, achieving high precision and improved style preservation without parallel data.

DetailsMotivation: Preserving stylistic nuances in NMT is challenging, especially without parallel corpora. Babel addresses this gap.

Method: Babel uses a style detector and diffusion-based style applicator to refine translations post-processing, integrating with existing NMT systems.

Result: Babel achieves 88.21% precision in detecting inconsistencies, improves style preservation by 150%, and maintains semantic similarity (0.92 score).

Conclusion: Babel effectively preserves style in translations without parallel data, validated by human evaluation.

Abstract: The advent of neural machine translation (NMT) has revolutionized cross-lingual communication, yet preserving stylistic nuances remains a significant challenge. While existing approaches often require parallel corpora for style preservation, we introduce Babel, a novel framework that enhances stylistic fidelity in NMT using only monolingual corpora. Babel employs two key components: (1) a style detector based on contextual embeddings that identifies stylistic disparities between source and target texts, and (2) a diffusion-based style applicator that rectifies stylistic inconsistencies while maintaining semantic integrity. Our framework integrates with existing NMT systems as a post-processing module, enabling style-aware translation without requiring architectural modifications or parallel stylistic data. Extensive experiments on five diverse domains (law, literature, scientific writing, medicine, and educational content) demonstrate Babel’s effectiveness: it identifies stylistic inconsistencies with 88.21% precision and improves stylistic preservation by 150% while maintaining a high semantic similarity score of 0.92. Human evaluation confirms that translations refined by Babel better preserve source text style while maintaining fluency and adequacy.

[9] A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

Kirill Borodin, Nikita Vasiliev, Vasiliy Kudryavtsev, Maxim Maslov, Mikhail Gorodnichev, Oleg Rogov, Grach Mkrtchian

Main category: cs.CL

TL;DR: Balalaika, a new Russian speech dataset with 2,000+ hours of annotated studio-quality speech, improves synthesis and enhancement tasks.

DetailsMotivation: Addressing challenges in Russian speech synthesis like vowel reduction, consonant devoicing, and unnatural intonation.

Method: Introduces Balalaika dataset with detailed annotations (punctuation, stress markings) and describes its construction pipeline.

Result: Models trained on Balalaika outperform those using existing datasets in synthesis and enhancement.

Conclusion: Balalaika provides a high-quality resource for advancing Russian speech technology.

Abstract: Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.

[10] Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies

Carlos Mena, Pol Serra, Jacobo Romero, Abir Messaoudi, Jose Giraldo, Carme Armentano-Oller, Rodolfo Zevallos, Ivan Meza, Javier Hernando

Main category: cs.CL

TL;DR: Improving ASR for Catalan-Spanish code-switching using synthetic data, monolingual audio concatenation, and real CS data with language tokens.

DetailsMotivation: Code-switching (CS) challenges ASR due to scarce training data and linguistic similarities, especially in multilingual societies like those using Catalan-Spanish CS.

Method: Three strategies: (1) synthetic CS data generation, (2) monolingual audio concatenation, (3) real CS data with language tokens. Fine-tuned OpenAI’s Whisper models.

Result: Combining synthetic CS data with the dominant language token yields the best transcription performance.

Conclusion: Effective ASR for CS can be achieved by blending synthetic and real data, with language tokens enhancing performance.

Abstract: Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities. The lack of dedicated CS datasets limits ASR performance, as most models rely on monolingual or mixed-language corpora that fail to reflect real-world CS patterns. This issue is critical in multilingual societies where CS occurs in informal and formal settings. A key example is Catalan-Spanish CS, widely used in media and parliamentary speeches. In this work, we improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens. We extract CS data from Catalan speech corpora and fine-tune OpenAI’s Whisper models, making them available on Hugging Face. Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.

[11] Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O’Brien

Main category: cs.CL

TL;DR: The paper explores using sparse autoencoder (SAE) features to control the target language of multilingual LLMs in zero-shot settings, achieving up to 90% success by modifying a single feature.

DetailsMotivation: Controlling the output language of multilingual LLMs without explicit prompts or fine-tuning is challenging. The study investigates if SAE features can steer language generation.

Method: Pretrained SAEs on Gemma-2B and Gemma-9B residual streams identify features with significant activation differences between English and four target languages. A single SAE feature is modified to steer language.

Result: Language shifts are achieved with 90% success while preserving semantic fidelity. Steering is most effective in mid-to-late transformer layers and linked to specific attention heads.

Conclusion: Sparse feature steering is a lightweight, interpretable method for controllable multilingual generation.

Abstract: Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.

[12] Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic

Lilit Grigoryan, Nikolay Karpov, Enas Albasiri, Vitaly Lavrukhin, Boris Ginsburg

Main category: cs.CL

TL;DR: The paper introduces a universal methodology for Arabic speech and text processing, training two FastConformer-based models for Modern Standard Arabic (MSA) and a unified MSA-Classical Arabic (CA) model, achieving SOTA results.

DetailsMotivation: Addressing the lack of attention to Arabic language variations and limited public ASR models, despite Arabic's widespread use.

Method: Developed a universal methodology for Arabic processing, trained two FastConformer models: one for MSA and another unified for MSA and CA.

Result: The MSA model set a new SOTA benchmark, while the unified model achieved SOTA accuracy for CA with diacritics and strong MSA performance.

Conclusion: The models and training recipes are open-sourced to promote reproducibility and advance Arabic ASR research.

Abstract: Despite Arabic being one of the most widely spoken languages, the development of Arabic Automatic Speech Recognition (ASR) systems faces significant challenges due to the language’s complexity, and only a limited number of public Arabic ASR models exist. While much of the focus has been on Modern Standard Arabic (MSA), there is considerably less attention given to the variations within the language. This paper introduces a universal methodology for Arabic speech and text processing designed to address unique challenges of the language. Using this methodology, we train two novel models based on the FastConformer architecture: one designed specifically for MSA and the other, the first unified public model for both MSA and Classical Arabic (CA). The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA. To promote reproducibility, we open-source the models and their training recipes.

[13] Aligning Knowledge Graphs and Language Models for Factual Accuracy

Nur A Zarin Nishat, Andrea Coletta, Luigi Bellomarini, Kossi Amouzouvi, Jens Lehmann, Sahar Vahdati

Main category: cs.CL

TL;DR: ALIGNed-LLM integrates Knowledge Graphs (KGs) into language models to reduce hallucination by aligning entity and text embeddings, improving factuality in tasks like question answering.

DetailsMotivation: Large language models (LLMs) suffer from hallucination despite their NLP advancements. KGs offer structured, reliable data to address this.

Method: ALIGNed-LLM aligns KG embeddings (e.g., TransE) with text embeddings via a trainable projection layer, enhancing entity distinction and factual grounding.

Result: Tested on QA benchmarks and a financial use case, ALIGNed-LLM significantly improved LLM factuality and accuracy.

Conclusion: ALIGNed-LLM effectively reduces hallucination in LLMs by leveraging KGs, demonstrating practical utility in high-stakes domains like finance.

Abstract: Large language models like GPT-4, Gemini, and Claude have transformed natural language processing (NLP) tasks such as question answering, dialogue generation, summarization, and so forth; yet their susceptibility to hallucination stands as one of the major challenges. Among numerous approaches to overcome this challenge, integration of Knowledge Graphs (KGs) into language models has emerged as a promising solution as it provides structured, reliable, domain-specific, and up-to-date external information to the language models. In this paper, we introduce ALIGNed-LLM, a simple yet effective approach to improve language models’ factuality via a lean strategy to infuse KGs into the latent space of language models inspired by LLaVA where visual and textual information is infused. We use embeddings from a pre-trained Knowledge Graph Embedding (KGE) model, such as TransE, and a trainable projection layer to align entity and text embeddings. This alignment enables the language model to distinguish between similar entities improving factual grounding and reducing hallucination. We tested our approach on three popular questions-answering benchmark datasets alongside language models of varying sizes, showing significant improvement. Furthermore, we applied our approach to a real-world financial use case from a large central bank in Europe, which demands high accuracy and precision, demonstrating a substantial improvement of the LLM answers.

[14] Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers

Liang Lin, Zhihao Xu, Xuehai Tang, Shi Liu, Biyu Zhou, Fuqing Zhu, Jizhong Han, Songlin Hu

Main category: cs.CL

TL;DR: The paper introduces a novel jailbreaking method, Paper Summary Attack (PSA), exploiting LLMs’ trust in authoritative sources to achieve high attack success rates on well-aligned models.

DetailsMotivation: To investigate vulnerabilities in LLMs due to their propensity to trust authoritative sources like academic papers.

Method: Proposes PSA, which synthesizes content from safety papers to craft adversarial prompts with harmful queries.

Result: PSA achieves 97-98% attack success rates on models like Claude3.5-Sonnet and Deepseek-R1, revealing model-specific vulnerabilities.

Conclusion: The findings highlight significant LLM vulnerabilities and suggest future research directions for adversarial methods and safety alignment.

Abstract: The safety of large language models (LLMs) has garnered significant research attention. In this paper, we argue that previous empirical studies demonstrate LLMs exhibit a propensity to trust information from authoritative sources, such as academic papers, implying new possible vulnerabilities. To verify this possibility, a preliminary analysis is designed to illustrate our two findings. Based on this insight, a novel jailbreaking method, Paper Summary Attack (\llmname{PSA}), is proposed. It systematically synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template, while strategically infilling harmful query as adversarial payloads within predefined subsections. Extensive experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1. PSA achieves a 97% attack success rate (ASR) on well-aligned models like Claude3.5-Sonnet and an even higher 98% ASR on Deepseek-R1. More intriguingly, our work has further revealed diametrically opposed vulnerability bias across different base models, and even between different versions of the same model, when exposed to either attack-focused or defense-focused papers. This phenomenon potentially indicates future research clues for both adversarial methodologies and safety alignment.Code is available at https://github.com/233liang/Paper-Summary-Attack

[15] Revisiting LLM Value Probing Strategies: Are They Robust and Expressive?

Siqi Shen, Mehar Singh, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, Rada Mihalcea

Main category: cs.CL

TL;DR: The paper evaluates the robustness and expressiveness of value representations in LLMs, highlighting vulnerabilities in probing methods and weak alignment with real-world actions.

DetailsMotivation: To address gaps in understanding LLM value orientations, including the lack of systematic comparison of probing methods and unclear alignment with real-world behaviors.

Method: Evaluated three probing strategies using prompt and option variations, and introduced tasks to assess demographic context responsiveness and alignment with value-based actions.

Result: Found large variances under input perturbations, minimal effect of demographic context on free-text generation, and weak correlation between probed values and real-world actions.

Conclusion: Emphasizes the need for more careful examination of LLM value probing and awareness of its limitations.

Abstract: There has been extensive research on assessing the value orientation of Large Language Models (LLMs) as it can shape user experiences across demographic groups. However, several challenges remain. First, while the Multiple Choice Question (MCQ) setting has been shown to be vulnerable to perturbations, there is no systematic comparison of probing methods for value probing. Second, it is unclear to what extent the probed values capture in-context information and reflect models’ preferences for real-world actions. In this paper, we evaluate the robustness and expressiveness of value representations across three widely used probing strategies. We use variations in prompts and options, showing that all methods exhibit large variances under input perturbations. We also introduce two tasks studying whether the values are responsive to demographic context, and how well they align with the models’ behaviors in value-related scenarios. We show that the demographic context has little effect on the free-text generation, and the models’ values only weakly correlate with their preference for value-based actions. Our work highlights the need for a more careful examination of LLM value probing and awareness of its limitations.

[16] Automatically assessing oral narratives of Afrikaans and isiXhosa children

Retief Louw, Emma Sharratt, Febe de Wet, Christiaan Jacobs, Annelien Smith, Herman Kamper

Main category: cs.CL

TL;DR: A system for automatically assessing preschool children’s oral narratives in Afrikaans and isiXhosa using speech recognition and machine learning, with LLMs outperforming simpler models and matching human expert accuracy in identifying children needing intervention.

DetailsMotivation: Early childhood narrative and comprehension skills are crucial for literacy, but teachers in large preschool classrooms struggle to identify students needing intervention.

Method: Uses automatic speech recognition and machine learning (linear model vs. LLM) to score oral narratives and predict comprehension scores.

Result: LLM-based system outperforms the linear model and is comparable to human experts in flagging children for intervention.

Conclusion: The system provides a foundation for automatic oral assessments, freeing teachers to focus on personalized learning support.

Abstract: Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children’s learning.

[17] Encoding syntactic objects and Merge operations in function spaces

Matilde Marcolli, Robert C. Berwick

Main category: cs.CL

TL;DR: The paper demonstrates a mathematical framework for representing syntactic objects in a function space, compatible with neurocomputational models, using wavelets and operad algebra.

DetailsMotivation: To theoretically justify the neurocomputational feasibility of syntactic structures by embedding them in a function space with specific algebraic properties.

Method: Constructs a commutative non-associative semiring using Renyi entropy, models syntactic operations as operad algebra circuits, and implements Merge via Hopf algebra Markov chains.

Result: A faithful representation of syntactic objects in function space, with Merge expressible as cross-frequency phase synchronization or a semiring successor function.

Conclusion: The work provides a theoretical basis for neurocomputational syntax realization, linking Merge to algebraic and phase-synchronization models.

Abstract: We provide a mathematical argument showing that, given a representation of lexical items as functions (wavelets, for instance) in some function space, it is possible to construct a faithful representation of arbitrary syntactic objects in the same function space. This space can be endowed with a commutative non-associative semiring structure built using the second Renyi entropy. The resulting representation of syntactic objects is compatible with the magma structure. The resulting set of functions is an algebra over an operad, where the operations in the operad model circuits that transform the input wave forms into a combined output that encodes the syntactic structure. The action of Merge on workspaces is faithfully implemented as action on these circuits, through a coproduct and a Hopf algebra Markov chain. The results obtained here provide a constructive argument showing the theoretical possibility of a neurocomputational realization of the core computational structure of syntax. We also present a particular case of this general construction where this type of realization of Merge is implemented as a cross frequency phase synchronization on sinusoidal waves. This also shows that Merge can be expressed in terms of the successor function of a semiring, thus clarifying the well known observation of its similarities with the successor function of arithmetic.

[18] A Computational Approach to Modeling Conversational Systems: Analyzing Large-Scale Quasi-Patterned Dialogue Flows

Mohamed Achref Ben Ammar, Mohamed Taha Bennani

Main category: cs.CL

TL;DR: A novel framework for analyzing conversational dynamics using graph simplification (Filter & Reconnect) improves semantic metrics and structural clarity in dialogue modeling.

DetailsMotivation: To address the challenge of analyzing loosely organized dialogues in large language model-based systems, ensuring semantic coherence and structural integrity.

Method: Proposes the Filter & Reconnect technique for simplifying conversational graphs, combined with large language models.

Result: Achieves a 2.06x improvement in semantic metric S and enforces tree-like structure with 0 δ-hyperbolicity.

Conclusion: The framework enhances analysis of large-scale dialogue datasets, benefiting applications like chatbots and user behavior analytics.

Abstract: The analysis of conversational dynamics has gained increasing importance with the rise of large language model-based systems, which interact with users across diverse contexts. In this work, we propose a novel computational framework for constructing conversational graphs that capture the flow and structure of loosely organized dialogues, referred to as quasi-patterned conversations. We introduce the Filter & Reconnect method, a novel graph simplification technique that minimizes noise while preserving semantic coherence and structural integrity of conversational graphs. Through comparative analysis, we demonstrate that the use of large language models combined with our graph simplification technique has resulted in semantic metric S increasing by a factor of 2.06 compared to previous approaches while simultaneously enforcing a tree-like structure with 0 {\delta}-hyperbolicity, ensuring optimal clarity in conversation modeling. This work provides a computational method for analyzing large-scale dialogue datasets, with practical applications related to monitoring automated systems such as chatbots, dialogue management tools, and user behavior analytics.

[19] Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder

Feng Chen, Weizhe Xu, Changye Li, Serguei Pakhomov, Alex Cohen, Simran Bhola, Sandy Yin, Sunny X Tang, Michael Mackinley, Lena Palaniyappan, Dror Ben-Zeev, Trevor Cohen

Main category: cs.CL

TL;DR: The study evaluates integrating pause features from ASR with semantic coherence metrics to assess FTD severity, showing improved predictive performance over semantic-only models.

DetailsMotivation: FTD assessment is resource-intensive; automated speech analysis offers scalable alternatives.

Method: Pause features and semantic coherence metrics were integrated across three datasets, using SVR to predict clinical FTD scores.

Result: Pause features alone robustly predict FTD severity; integration with semantic metrics enhances performance.

Conclusion: Combining temporal and semantic analyses improves FTD assessment, advancing automated speech analysis in psychosis.

Abstract: Formal thought disorder (FTD), a hallmark of schizophrenia spectrum disorders, manifests as incoherent speech and poses challenges for clinical assessment. Traditional clinical rating scales, though validated, are resource-intensive and lack scalability. Automated speech analysis with automatic speech recognition (ASR) allows for objective quantification of linguistic and temporal features of speech, offering scalable alternatives. The use of utterance timestamps in ASR captures pause dynamics, which are thought to reflect the cognitive processes underlying speech production. However, the utility of integrating these ASR-derived features for assessing FTD severity requires further evaluation. This study integrates pause features with semantic coherence metrics across three datasets: naturalistic self-recorded diaries (AVH, n = 140), structured picture descriptions (TOPSY, n = 72), and dream narratives (PsyCL, n = 43). We evaluated pause related features alongside established coherence measures, using support vector regression (SVR) to predict clinical FTD scores. Key findings demonstrate that pause features alone robustly predict the severity of FTD. Integrating pause features with semantic coherence metrics enhanced predictive performance compared to semantic-only models, with integration of independent models achieving correlations up to \r{ho} = 0.649 and AUC = 83.71% for severe cases detection (TOPSY, with best \r{ho} = 0.584 and AUC = 79.23% for semantic-only models). The performance gains from semantic and pause features integration held consistently across all contexts, though the nature of pause patterns was dataset-dependent. These findings suggest that frameworks combining temporal and semantic analyses provide a roadmap for refining the assessment of disorganized speech and advance automated speech analysis in psychosis.

[20] Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models

Sergio E. Zanotto, Segun Aroyehun

Main category: cs.CL

TL;DR: The study analyzes linguistic features of human-written and machine-generated texts across domains and models, finding simpler syntax and richer semantics in human texts, with newer LLMs showing homogenized outputs.

DetailsMotivation: To characterize and differentiate human-written and machine-generated texts using linguistic features, addressing gaps in existing research focused solely on classification.

Method: Analyzed texts from 8 domains and 11 LLMs, calculating features like dependency length and emotionality, and using statistical analysis and style embeddings.

Result: Human texts had simpler syntax and more semantic diversity. Newer LLMs produced homogenized outputs, while humans showed greater stylistic variation.

Conclusion: Human-written texts are linguistically distinct, with newer LLMs converging in style, highlighting trends in machine-generated text evolution.

Abstract: The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.

[21] Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters

Shanbo Cheng, Yu Bao, Qian Cao, Luyang Huang, Liyan Kang, Zhicheng Liu, Yu Lu, Wenhao Zhu, Zhichao Huang, Tao Li, Sitong Liu, Ningxin Peng, Shuaijie She, Lu Xu, Nuo Xu, Sen Yang, Runsheng Yu, Yiming Yu, Liehao Zou, Hang Li, Lu Lu, Yuxuan Wang, Yonghui Wu

Main category: cs.CL

TL;DR: Seed-X, a 7B-parameter open-source LLM family, excels in multilingual translation, matching top closed-source models like GPT-4o and outperforming larger open-source models.

DetailsMotivation: Addressing challenges in multilingual translation by leveraging diverse data and advanced techniques like CoT reasoning and RL.

Method: Pre-training on 28-language data, finetuning with CoT reasoning, and enhancing via reinforcement learning.

Result: Comparable to Gemini-2.5 and GPT-4o, outperforms larger open-source models in metrics and human evaluations.

Conclusion: Seed-X advances translation research by sharing optimized practices and public parameters.

Abstract: Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.

[22] CU-ICU: Customizing Unsupervised Instruction-Finetuned Language Models for ICU Datasets via Text-to-Text Transfer Transformer

Teerapong Panboonyuen

Main category: cs.CL

TL;DR: CU-ICU customizes unsupervised language models for ICU datasets using T5, improving accuracy and interpretability with minimal supervision.

DetailsMotivation: Challenges in adapting large language models to healthcare due to domain adaptation and limited labeled data.

Method: Sparse fine-tuning with few-shot prompting and selective parameter updates on T5 architecture.

Result: 15% increase in sepsis detection accuracy, 20% better clinical explanations, updating <1% parameters.

Conclusion: CU-ICU is scalable and efficient for accurate, interpretable clinical decision support in ICUs.

Abstract: Integrating large language models into specialized domains like healthcare presents unique challenges, including domain adaptation and limited labeled data. We introduce CU-ICU, a method for customizing unsupervised instruction-finetuned language models for ICU datasets by leveraging the Text-to-Text Transfer Transformer (T5) architecture. CU-ICU employs a sparse fine-tuning approach that combines few-shot prompting with selective parameter updates, enabling efficient adaptation with minimal supervision. Our evaluation across critical ICU tasks–early sepsis detection, mortality prediction, and clinical note generation–demonstrates that CU-ICU consistently improves predictive accuracy and interpretability over standard fine-tuning methods. Notably, CU-ICU achieves up to a 15% increase in sepsis detection accuracy and a 20% enhancement in generating clinically relevant explanations while updating fewer than 1% of model parameters in its most efficient configuration. These results establish CU-ICU as a scalable, low-overhead solution for delivering accurate and interpretable clinical decision support in real-world ICU environments.

[23] KiC: Keyword-inspired Cascade for Cost-Efficient Text Generation with LLMs

Woo-Chan Kim, Ji-Hoon Park, Seong-Whan Lee

Main category: cs.CL

TL;DR: KiC is a cost-efficient framework for free-form text generation, using a weaker model first and escalating to a stronger one based on semantic alignment, achieving near-GPT-4 accuracy with reduced costs.

DetailsMotivation: High-performing LLMs are costly via APIs, and existing cascade methods fail to reliably assess free-form outputs due to reliance on exact text matching.

Method: KiC identifies the most representative answer from a weaker model and evaluates semantic alignment to decide if escalation to a stronger model is needed.

Result: KiC achieves 97.53% of GPT-4’s accuracy, reduces API costs by 28.81%, and outperforms GPT-4 in one benchmark.

Conclusion: KiC offers a cost-effective solution for free-form text generation by leveraging semantic alignment, balancing performance and cost.

Abstract: Large language models (LLMs) have demonstrated state-of-the-art performance across a wide range of natural language processing tasks. However, high-performing models are typically accessible only via APIs, incurring substantial inference costs. Cascade methods address this by initially employing a cheaper model and escalating to a stronger one only when necessary. Nevertheless, existing cascade approaches struggle to select a reliable representative response and assess the overall reliability of free-form outputs, as they rely on exact text matching. To overcome these limitations, we propose Keyword-inspired Cascade (KiC), a novel framework for cost-efficient free-form text generation. KiC identifies the most representative answer among multiple outputs from a weaker model and evaluates the semantic alignment of other responses with it. Based on the degree of alignment, KiC determines whether to accept the weaker model’s output or escalate to a stronger model. Experiments on three free-form text generation benchmarks show that KiC achieves 97.53 percent of GPT-4’s accuracy while reducing API costs by 28.81 percent on average, and even outperforms GPT-4 in a specific benchmark.

[24] LoopServe: An Adaptive Dual-phase LLM Inference Acceleration System for Multi-Turn Dialogues

Haoyang Li, Zhanchao Xu, Yiming Li, Xuejia Chen, Darian Li, Anxin Tian, Qingfa Xiao, Cheng Deng, Jun Wang, Qing Li, Lei Chen, Mingxuan Yuan

Main category: cs.CL

TL;DR: LoopServe is an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues, improving efficiency and responsiveness by dynamically selecting important attention parts and compressing key value caches.

DetailsMotivation: Existing methods struggle with computational and memory challenges in long multi-turn dialogues due to fixed heuristics, prompting the need for adaptive solutions.

Method: LoopServe introduces online sparsification during prefilling and progressive key value compression during decoding, adapting to dynamic conversation patterns.

Result: LoopServe outperforms baselines, significantly accelerating LLM inference in long-context dialogue tasks.

Conclusion: LoopServe provides an effective and adaptive solution for accelerating large language models in multi-turn dialogues, supported by a new benchmark.

Abstract: Multi-turn dialogues are essential in many real-world applications of large language models, such as chatbots and virtual assistants. As conversation histories become longer, existing large language models face increasing computational and memory challenges, which hinder their ability to provide efficient and responsive interactions. Most current acceleration methods either compress the context or optimize key value caching, but they often rely on fixed or position-based heuristics that do not adapt well to the dynamic and unpredictable patterns found in actual multi-turn conversations. In this paper, we present LoopServe, an adaptive dual-phase inference acceleration framework for large language models in multi-turn dialogues. LoopServe introduces two main innovations. First, it performs online sparsification during the prefilling phase by dynamically selecting the most important parts of the attention matrix for each new input. Second, it uses progressive key value compression during decoding by adaptively maintaining a relevant and efficient cache based on the most recently generated output tokens. We also propose a \href{https://huggingface.co/datasets/TreeAILab/Multi-turn_Long-context_Benchmark_for_LLMs}{new benchmark} with eleven multi-turn datasets that reflect realistic query positions and conversational dependencies. Extensive experiments demonstrate that LoopServe consistently achieves superior effectiveness compared to existing baselines and significantly accelerates LLM inference across a wide range of long-context dialogue tasks.

[25] Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations

Cedric Waterschoot, Nava Tintarev, Francesco Barile

Main category: cs.CL

TL;DR: LLMs in Group Recommender Systems (GRS) often mimic Additive Utilitarian (ADD) recommendations but provide inconsistent explanations. Group structure doesn’t affect recommendations, but LLMs introduce extra criteria, impacting transparency.

DetailsMotivation: To evaluate LLMs as joint decision-makers and explanation generators in GRS, comparing them to social choice-based aggregation strategies.

Method: Comparison of LLM-generated recommendations and explanations with Additive Utilitarian (ADD) aggregation strategies, analyzing group structure and additional criteria introduced by LLMs.

Result: LLM recommendations resembled ADD aggregation, but explanations were inconsistent, introducing extra criteria like user/item similarity or undefined metrics. Group structure had no impact.

Conclusion: LLMs in GRS may undermine transparency due to inconsistent explanations, and standard aggregation methods might be inefficient for larger item sets.

Abstract: Large Language Models (LLMs) are increasingly being implemented as joint decision-makers and explanation generators for Group Recommender Systems (GRS). In this paper, we evaluate these recommendations and explanations by comparing them to social choice-based aggregation strategies. Our results indicate that LLM-generated recommendations often resembled those produced by Additive Utilitarian (ADD) aggregation. However, the explanations typically referred to averaging ratings (resembling but not identical to ADD aggregation). Group structure, uniform or divergent, did not impact the recommendations. Furthermore, LLMs regularly claimed additional criteria such as user or item similarity, diversity, or used undefined popularity metrics or thresholds. Our findings have important implications for LLMs in the GRS pipeline as well as standard aggregation strategies. Additional criteria in explanations were dependent on the number of ratings in the group scenario, indicating potential inefficiency of standard aggregation methods at larger item set sizes. Additionally, inconsistent and ambiguous explanations undermine transparency and explainability, which are key motivations behind the use of LLMs for GRS.

Guillaume Zambrano

Main category: cs.CL

TL;DR: Machine learning predicts child custody outcomes in French courts, showing judges’ individual patterns influence decisions, supporting legal realism. Specialist models outperform generalist ones.

DetailsMotivation: To challenge the assumption of judicial neutrality by examining if individual judges' decision-making patterns affect case outcomes.

Method: Analyzed 18,937 rulings using hybrid ML (LLMs for feature extraction, RF/XGB/SVC for prediction) and compared specialist (judge-specific) vs. generalist models.

Result: Specialist models (F1 up to 92.85%) outperformed generalist models (82.63%), showing stable, non-transferable judicial patterns.

Conclusion: Judicial identity measurably impacts legal outcomes, supporting legal realism. Data and code will be shared.

Abstract: This study examines the role of human judges in legal decision-making by using machine learning to predict child physical custody outcomes in French appellate courts. Building on the legal realism-formalism debate, we test whether individual judges’ decision-making patterns significantly influence case outcomes, challenging the assumption that judges are neutral variables that apply the law uniformly. To ensure compliance with French privacy laws, we implement a strict pseudonymization process. Our analysis uses 18,937 living arrangements rulings extracted from 10,306 cases. We compare models trained on individual judges’ past rulings (specialist models) with a judge-agnostic model trained on aggregated data (generalist models). The prediction pipeline is a hybrid approach combining large language models (LLMs) for structured feature extraction and ML models for outcome prediction (RF, XGB and SVC). Our results show that specialist models consistently achieve higher predictive accuracy than the general model, with top-performing models reaching F1 scores as high as 92.85%, compared to the generalist model’s 82.63% trained on 20x to 100x more samples. Specialist models capture stable individual patterns that are not transferable to other judges. In-Domain and Cross-Domain validity tests provide empirical support for legal realism, demonstrating that judicial identity plays a measurable role in legal outcomes. All data and code used will be made available.

[27] PRIDE – Parameter-Efficient Reduction of Identity Discrimination for Equality in LLMs

Maluna Menke, Thilo Hagendorff

Main category: cs.CL

TL;DR: The paper evaluates LoRA and soft-prompt tuning to reduce LGBTQIA+ biases in LLMs, finding LoRA effective with minimal computational cost.

DetailsMotivation: LLMs often reproduce biases against LGBTQIA+ identities, necessitating lightweight solutions for bias mitigation.

Method: Two PEFT techniques (LoRA and soft-prompt tuning) are tested on open-source LLMs using the WinoQueer benchmark and a QueerNews corpus.

Result: LoRA reduces bias scores by up to 50 points and increases neutrality to 36%, while soft-prompt tuning shows marginal improvements.

Conclusion: LoRA is a promising, efficient method for bias reduction, advocating for community-informed PEFT and larger queer-authored corpora.

Abstract: Large Language Models (LLMs) frequently reproduce the gender- and sexual-identity prejudices embedded in their training corpora, leading to outputs that marginalize LGBTQIA+ users. Hence, reducing such biases is of great importance. To achieve this, we evaluate two parameter-efficient fine-tuning (PEFT) techniques - Low-Rank Adaptation (LoRA) and soft-prompt tuning - as lightweight alternatives to full-model fine-tuning for mitigating such biases. Using the WinoQueer benchmark, we quantify bias in three open-source LLMs and observe baseline bias scores reaching up to 98 (out of 100) across a range of queer identities defined by gender and/or sexual orientation, where 50 would indicate neutrality. Fine-tuning with LoRA (< 0.1% additional parameters) on a curated QueerNews corpus reduces those scores by up to 50 points and raises neutrality from virtually 0% to as much as 36%. Soft-prompt tuning (10 virtual tokens) delivers only marginal improvements. These findings show that LoRA can deliver meaningful fairness gains with minimal computation. We advocate broader adoption of community-informed PEFT, the creation of larger queer-authored corpora, and richer evaluation suites beyond WinoQueer, coupled with ongoing audits to keep LLMs inclusive.

[28] Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models

Palash Nandi, Maithili Joshi, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: The paper explores how prompt design in Visual Language Models (VLMs) can be exploited to generate inappropriate content, identifying three key factors and proposing a framework to increase jailbreak success.

DetailsMotivation: To understand and mitigate the vulnerabilities of VLMs to prompt sensitivity, particularly in generating harmful content.

Method: Analyzes three prompt design factors (detailed visual info, adversarial examples, positively framed phrases) and proposes a skip-connection framework for jailbreak testing.

Result: VLMs struggle with multimodal inputs; each factor independently triggers jailbreaks, and memes can be as effective as toxic visuals.

Conclusion: VLMs have subtle vulnerabilities; the proposed framework highlights risks and potential mitigation strategies.

Abstract: Language models are highly sensitive to prompt formulations - small changes in input can drastically alter their output. This raises a critical question: To what extent can prompt sensitivity be exploited to generate inapt content? In this paper, we investigate how discrete components of prompt design influence the generation of inappropriate content in Visual Language Models (VLMs). Specifically, we analyze the impact of three key factors on successful jailbreaks: (a) the inclusion of detailed visual information, (b) the presence of adversarial examples, and (c) the use of positively framed beginning phrases. Our findings reveal that while a VLM can reliably distinguish between benign and harmful inputs in unimodal settings (text-only or image-only), this ability significantly degrades in multimodal contexts. Each of the three factors is independently capable of triggering a jailbreak, and we show that even a small number of in-context examples (as few as three) can push the model toward generating inappropriate outputs. Furthermore, we propose a framework that utilizes a skip-connection between two internal layers of the VLM, which substantially increases jailbreak success rates, even when using benign images. Finally, we demonstrate that memes, often perceived as humorous or harmless, can be as effective as toxic visuals in eliciting harmful content, underscoring the subtle and complex vulnerabilities of VLMs.

[29] An Enhanced Model-based Approach for Short Text Clustering

Enhao Cheng, Shoujia Zhang, Jianhua Yin, Xuemeng Song, Tian Gan, Liqiang Nie

Main category: cs.CL

TL;DR: The paper proposes GSDMM and its improved version GSDMM+ for short text clustering, addressing sparsity and high dimensionality while optimizing performance through noise reduction and adaptive word weighting.

DetailsMotivation: Short text clustering is challenging due to sparsity, high dimensionality, and computational intensity. Existing methods (topic models and deep learning) have limitations.

Method: GSDMM uses collapsed Gibbs Sampling for Dirichlet Multinomial Mixture. GSDMM+ reduces initialization noise, adjusts word weights adaptively, and employs strategic cluster merging.

Result: Experiments show GSDMM+ outperforms classical and state-of-the-art methods in efficiency and effectiveness.

Conclusion: GSDMM+ improves clustering granularity and aligns better with true category distributions, offering a robust solution for short text clustering.

Abstract: Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. To address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster. Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adaptively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at https://github.com/chehaoa/VEMC.

[30] Question-Answer Extraction from Scientific Articles Using Knowledge Graphs and Large Language Models

Hosein Azarbonyad, Zi Long Zhu, Georgios Cheirmpos, Zubair Afzal, Vikrant Yadav, Georgios Tsatsaronis

Main category: cs.CL

TL;DR: The paper proposes two methods for generating QA pairs from scientific articles: one using LLMs for direct content extraction and another leveraging a KG built from fine-tuned ER extraction. The KG-based method outperforms in capturing main ideas.

DetailsMotivation: Scholars need quick identification of key concepts in articles. The paper aims to automate this by extracting QA pairs to summarize contributions.

Method: Two approaches: (1) LLM-based QA generation from salient paragraphs, and (2) KG-based QA generation using fine-tuned ER extraction and triplet saliency metrics.

Result: The KG approach effectively captures main ideas, and fine-tuning the ER model is critical for high-quality triplet extraction.

Conclusion: The KG-based method is superior for QA generation, and fine-tuning ER models on scientific corpora enhances triplet quality.

Abstract: When deciding to read an article or incorporate it into their research, scholars often seek to quickly identify and understand its main ideas. In this paper, we aim to extract these key concepts and contributions from scientific articles in the form of Question and Answer (QA) pairs. We propose two distinct approaches for generating QAs. The first approach involves selecting salient paragraphs, using a Large Language Model (LLM) to generate questions, ranking these questions by the likelihood of obtaining meaningful answers, and subsequently generating answers. This method relies exclusively on the content of the articles. However, assessing an article’s novelty typically requires comparison with the existing literature. Therefore, our second approach leverages a Knowledge Graph (KG) for QA generation. We construct a KG by fine-tuning an Entity Relationship (ER) extraction model on scientific articles and using it to build the graph. We then employ a salient triplet extraction method to select the most pertinent ERs per article, utilizing metrics such as the centrality of entities based on a triplet TF-IDF-like measure. This measure assesses the saliency of a triplet based on its importance within the article compared to its prevalence in the literature. For evaluation, we generate QAs using both approaches and have them assessed by Subject Matter Experts (SMEs) through a set of predefined metrics to evaluate the quality of both questions and answers. Our evaluations demonstrate that the KG-based approach effectively captures the main ideas discussed in the articles. Furthermore, our findings indicate that fine-tuning the ER extraction model on our scientific corpus is crucial for extracting high-quality triplets from such documents.

[31] The Expressions of Depression and Anxiety in Chinese Psycho-counseling: Usage of First-person Singular Pronoun and Negative Emotional Words

Lizhi Ma, Tong Zhao, Shuai Zhang, Nirui Song, Hongliang He, Anqi Li, Ran Feng, Huachuan Qiu, Jingsong Ma, Zhenzhong Lan

Main category: cs.CL

TL;DR: The study examines how linguistic expressions (first-person pronouns and negative words) relate to depression and anxiety in Chinese counseling. Negative words correlate with severity, but first-person pronouns don’t, differing from Western findings due to cultural and conversational contexts.

DetailsMotivation: To understand how language reflects psychological states in Chinese counseling, addressing gaps in non-Western, collectivist contexts.

Method: Analyzed 735 online counseling sessions using LIWC and a general linear mixed-effect model.

Result: Negative words linked to severity of depression/anxiety; first-person pronouns showed no significant correlation, differing from Western studies.

Conclusion: Cultural and conversational contexts shape language use in mental health, offering insights for Chinese therapeutic practices.

Abstract: This study explores the relationship between linguistic expressions and psychological states of depression and anxiety within Chinese psycho-counseling interactions, focusing specifically on the usage of first-person singular pronouns and negative emotional words. Utilizing a corpus derived from 735 online counseling sessions, the analysis employed a general linear mixed-effect model to assess linguistic patterns quantified by the Linguistic Inquiry and Word Count (LIWC) software. Results indicate a significant positive correlation between the frequency of negative emotional words and the severity of both depressive and anxious states among clients. However, contrary to prior findings predominantly derived from English-language contexts, the usage frequency of first-person singular pronouns did not vary significantly with the clients’ psychological conditions. These outcomes are discussed within the framework of cultural distinctions between collectivist Chinese contexts and individualistic Western settings, as well as the interactive dynamics unique to psycho-counseling conversations. The findings highlight the nuanced influence of cultural and conversational contexts on language use in mental health communications, providing insights into psycholinguistic markers relevant to therapeutic practices in Chinese-speaking populations.

[32] Modeling Fair Play in Detective Stories with Language Models

Eitan Wagner, Renana Keydar, Omri Abend

Main category: cs.CL

TL;DR: A probabilistic framework for detective fiction defines fair play, balancing coherence and surprise, and evaluates LLM-generated stories, finding them lacking in this balance.

DetailsMotivation: To formalize the concept of fair play in detective fiction and assess its balance with surprise, especially in LLM-generated stories.

Method: Develop a probabilistic framework to define fair play and design metrics, then apply it to LLM-generated detective stories.

Result: LLM-generated stories are unpredictable but fail to balance surprise and fair play, leading to poor quality.

Conclusion: The framework highlights the tension between coherence and surprise, revealing shortcomings in LLM storytelling.

Abstract: Effective storytelling relies on a delicate balance between meeting the reader’s prior expectations and introducing unexpected developments. In the domain of detective fiction, this tension is known as fair play, which includes the implicit agreement between the writer and the reader as to the range of possible resolutions the mystery story may have. In this work, we present a probabilistic framework for detective fiction that allows us to define desired qualities. Using this framework, we formally define fair play and design appropriate metrics for it. Stemming from these definitions is an inherent tension between the coherence of the story, which measures how much it ``makes sense’’, and the surprise it induces. We validate the framework by applying it to LLM-generated detective stories. This domain is appealing since we have an abundance of data, we can sample from the distribution generating the story, and the story-writing capabilities of LLMs are interesting in their own right. Results show that while LLM-generated stories may be unpredictable, they generally fail to balance the trade-off between surprise and fair play, which greatly contributes to their poor quality.

[33] InTraVisTo: Inside Transformer Visualisation Tool

Nicolò Brunello, Davide Rigamonti, Andrea Sassella, Vincenzo Scotti, Mark James Carman

Main category: cs.CL

TL;DR: InTraVisTo is a visualization tool for Transformer-based LLMs, helping researchers trace token generation and understand internal computations.

DetailsMotivation: LLMs are complex and unpredictable, making their use in production challenging. A tool is needed to investigate their internal reasoning processes.

Method: InTraVisTo visualizes token embeddings and information flow in Transformer models using decoded embeddings and Sankey diagrams.

Result: The tool provides insights into internal patterns and reasoning processes of LLMs.

Conclusion: InTraVisTo aids in understanding LLM computations, potentially improving their reliability and predictability.

Abstract: The reasoning capabilities of Large Language Models (LLMs) have increased greatly over the last few years, as have their size and complexity. Nonetheless, the use of LLMs in production remains challenging due to their unpredictable nature and discrepancies that can exist between their desired behavior and their actual model output. In this paper, we introduce a new tool, InTraVisTo (Inside Transformer Visualisation Tool), designed to enable researchers to investigate and trace the computational process that generates each token in a Transformer-based LLM. InTraVisTo provides a visualization of both the internal state of the Transformer model (by decoding token embeddings at each layer of the model) and the information flow between the various components across the different layers of the model (using a Sankey diagram). With InTraVisTo, we aim to help researchers and practitioners better understand the computations being performed within the Transformer model and thus to shed some light on internal patterns and reasoning processes employed by LLMs.

[34] Label Unification for Cross-Dataset Generalization in Cybersecurity NER

Maciej Jalocha, Johan Hausted Schmidt, William Michelseen

Main category: cs.CL

TL;DR: The paper addresses label unification in cybersecurity NER, evaluates cross-dataset performance, and proposes alternative models, finding limited improvements.

DetailsMotivation: Standardized labels are lacking in cybersecurity NER, hindering dataset combination and usability.

Method: Coarse-grained label unification, cross-dataset evaluations with BiLSTM, and proposing multihead and graph-based transfer models.

Result: Models trained on unified datasets generalize poorly; proposed models show marginal or no significant improvements.

Conclusion: Label unification remains challenging, and alternative architectures offer limited gains in cross-dataset generalization.

Abstract: The field of cybersecurity NER lacks standardized labels, making it challenging to combine datasets. We investigate label unification across four cybersecurity datasets to increase data resource usability. We perform a coarse-grained label unification and conduct pairwise cross-dataset evaluations using BiLSTM models. Qualitative analysis of predictions reveals errors, limitations, and dataset differences. To address unification limitations, we propose alternative architectures including a multihead model and a graph-based transfer model. Results show that models trained on unified datasets generalize poorly across datasets. The multihead model with weight sharing provides only marginal improvements over unified training, while our graph-based transfer model built on BERT-base-NER shows no significant performance gains compared BERT-base-NER.

[35] Using LLMs to identify features of personal and professional skills in an open-response situational judgment test

Cole Walsh, Rodica Ivan, Muhammad Zafar Iqbal, Colleen Robb

Main category: cs.CL

TL;DR: The paper explores using large language models (LLMs) to automate scoring of Situational Judgment Tests (SJTs) for personal and professional skills, addressing scalability and construct validity issues.

DetailsMotivation: The need for scalable systems to measure and develop personal and professional skills alongside technical expertise in academic programs.

Method: A novel approach using LLMs to extract construct-relevant features from SJT responses, demonstrated with the Casper SJT.

Result: The study shows promise for automated scoring of SJTs, addressing past limitations in NLP-based systems.

Conclusion: This work lays the foundation for future advancements in automated scoring for personal and professional skills.

Abstract: Academic programs are increasingly recognizing the importance of personal and professional skills and their critical role alongside technical expertise in preparing students for future success in diverse career paths. With this growing demand comes the need for scalable systems to measure, evaluate, and develop these skills. Situational Judgment Tests (SJTs) offer one potential avenue for measuring these skills in a standardized and reliable way, but open-response SJTs have traditionally relied on trained human raters for evaluation, presenting operational challenges to delivering SJTs at scale. Past attempts at developing NLP-based scoring systems for SJTs have fallen short due to issues with construct validity of these systems. In this article, we explore a novel approach to extracting construct-relevant features from SJT responses using large language models (LLMs). We use the Casper SJT to demonstrate the efficacy of this approach. This study sets the foundation for future developments in automated scoring for personal and professional skills.

[36] Political Leaning and Politicalness Classification of Texts

Matous Volf, Jakub Simko

Main category: cs.CL

TL;DR: The paper focuses on improving text classification for political leaning and politicalness using transformer models, addressing poor generalization by compiling diverse datasets and benchmarking models.

DetailsMotivation: Current approaches for political text classification create siloed solutions with poor out-of-distribution performance, prompting the need for better generalization.

Method: Combined 12 datasets for political leaning and extended 18 datasets for politicalness, then benchmarked models using leave-one-in and leave-one-out methodologies.

Result: New models with enhanced generalization capabilities were trained and evaluated.

Conclusion: The study highlights the importance of diverse datasets and robust methodologies for improving political text classification.

Abstract: This paper addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.

[37] The Levers of Political Persuasion with Conversational AI

Kobi Hackenburg, Ben M. Tappin, Luke Hewitt, Ed Saunders, Sid Black, Hause Lin, Catherine Fist, Helen Margetts, David G. Rand, Christopher Summerfield

Main category: cs.CL

TL;DR: Current AI’s persuasive power stems more from post-training and prompting methods than personalization or scale, but these methods decrease factual accuracy.

DetailsMotivation: To address fears about AI's influence on human beliefs by evaluating LLMs' persuasiveness and factual accuracy.

Method: Three large-scale experiments with 19 LLMs, testing persuasiveness on 707 political issues and checking 466,769 claims for accuracy.

Result: Post-training and prompting boosted persuasiveness by up to 51% and 27%, but decreased factual accuracy.

Conclusion: AI’s persuasive power is driven by post-training and prompting, not scale, but these methods compromise accuracy.

Abstract: There are widespread fears that conversational AI could soon exert unprecedented influence over human beliefs. Here, in three large-scale experiments (N=76,977), we deployed 19 LLMs-including some post-trained explicitly for persuasion-to evaluate their persuasiveness on 707 political issues. We then checked the factual accuracy of 466,769 resulting LLM claims. Contrary to popular concerns, we show that the persuasive power of current and near-future AI is likely to stem more from post-training and prompting methods-which boosted persuasiveness by as much as 51% and 27% respectively-than from personalization or increasing model scale. We further show that these methods increased persuasion by exploiting LLMs’ unique ability to rapidly access and strategically deploy information and that, strikingly, where they increased AI persuasiveness they also systematically decreased factual accuracy.

[38] Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support

Jan Trienes, Anastasiia Derzhanskaia, Roland Schwarzkopf, Markus Mühling, Jörg Schlötterer, Christin Seifert

Main category: cs.CL

TL;DR: Marcel is a lightweight, open-source conversational agent for admission inquiries, using retrieval-augmented generation and an FAQ retriever to improve response quality and reduce staff workload.

DetailsMotivation: To support prospective students with admission-related questions while reducing the burden on university staff.

Method: Uses retrieval-augmented generation and an FAQ retriever to map user questions to knowledge-base entries, improving retrieval quality over standard methods.

Result: Designed for easy deployment in resource-constrained settings, with technical evaluation and real-world deployment insights provided.

Conclusion: Marcel effectively addresses admission inquiries with personalized, verifiable responses, easing staff workload.

Abstract: We present Marcel, a lightweight and open-source conversational agent designed to support prospective students with admission-related inquiries. The system aims to provide fast and personalized responses, while reducing workload of university staff. We employ retrieval-augmented generation to ground answers in university resources and to provide users with verifiable, contextually relevant information. To improve retrieval quality, we introduce an FAQ retriever that maps user questions to knowledge-base entries, allowing administrators to steer retrieval, and improving over standard dense/hybrid retrieval strategies. The system is engineered for easy deployment in resource-constrained academic settings. We detail the system architecture, provide a technical evaluation of its components, and report insights from a real-world deployment.

[39] Exploiting Primacy Effect To Improve Large Language Models

Bianca Raimondi, Maurizio Gabbrielli

Main category: cs.CL

TL;DR: Fine-tuned LLMs exhibit amplified primacy bias in MCQA, which can be strategically leveraged by reordering answer options to improve performance.

DetailsMotivation: LLMs show human-like biases, such as primacy effects, which impact MCQA accuracy. Understanding and leveraging these biases can enhance model performance.

Method: Reordering answer options based on semantic similarity to the query, without knowing the correct answer, to exploit primacy bias.

Result: The approach significantly improves performance in MCQA tasks.

Conclusion: Biases in LLMs can be both challenges and opportunities, offering insights for bias-aware model design and NLP applications.

Abstract: Large Language Models (LLMs) have become essential in many Natural Language Processing (NLP) tasks, leveraging extensive pre-training and fine-tuning to achieve high accuracy. However, like humans, LLMs exhibit biases, particularly positional biases such as primacy and recency effects, which can influence the accuracy of the answers. The primacy effect-where items presented first are more likely to be remembered or selected-plays a key role in Multiple Choice Question Answering (MCQA), where the order of answer options can affect prediction outcomes. This study focuses on primacy bias in fine-tuned LLMs: We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns. Hence, we strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer. Our experimental results show that this approach significantly improves performance in MCQA. More generally, our findings underscore the dual nature of biases as both challenges and opportunities, offering insights for bias-aware model design and NLP applications.

[40] Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

Bhishma Dedhia, Yuval Kansal, Niraj K. Jha

Main category: cs.CL

TL;DR: The paper proposes a bottom-up approach using knowledge graphs (KGs) to train language models for domain-specific reasoning, validated in medicine with significant performance improvements.

DetailsMotivation: Traditional top-down training on general corpora lacks deep domain expertise, necessitating a method to compose simple concepts into complex ones.

Method: A KG-based task generation pipeline synthesizes tasks from primitives, fine-tuning models like QwQ-32B to create QwQ-Med-3, evaluated using ICD-Bench.

Result: QwQ-Med-3 outperforms state-of-the-art models on ICD-Bench and transfers expertise to improve base model performance.

Conclusion: Domain-specific superintelligence via composable KG-based training is a viable path toward AGI.

Abstract: Language models traditionally used for cross-domain generalization have recently demonstrated task-specific reasoning. However, their top-down training approach on general corpora is insufficient for acquiring abstractions needed for deep domain expertise. This may require a bottom-up approach that acquires expertise by learning to compose simple domain concepts into more complex ones. A knowledge graph (KG) provides this compositional structure, where domain primitives are represented as head-relation-tail edges and their paths encode higher-level concepts. We present a task generation pipeline that synthesizes tasks directly from KG primitives, enabling models to acquire and compose them for reasoning. We fine-tune language models on the resultant KG-grounded curriculum to demonstrate domain-specific superintelligence. While broadly applicable, we validate our approach in medicine, where reliable KGs exist. Using a medical KG, we curate 24,000 reasoning tasks paired with thinking traces derived from diverse medical primitives. We fine-tune the QwQ-32B model on this curriculum to obtain QwQ-Med-3 that takes a step towards medical superintelligence. We also introduce ICD-Bench, an evaluation suite to quantify reasoning abilities across 15 medical domains. Our experiments demonstrate that QwQ-Med-3 significantly outperforms state-of-the-art reasoning models on ICD-Bench categories. Further analysis reveals that QwQ-Med-3 utilizes acquired primitives to widen the performance gap on the hardest tasks of ICD-Bench. Finally, evaluation on medical question-answer benchmarks shows that QwQ-Med-3 transfers acquired expertise to enhance the base model’s performance. While the industry’s approach to artificial general intelligence (AGI) emphasizes broad expertise, we envision a future in which AGI emerges from the composable interaction of efficient domain-specific superintelligent agents.

[41] Efficient Temporal Tokenization for Mobility Prediction with Large Language Models

Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang

Main category: cs.CL

TL;DR: RHYTHM is a framework using LLMs for spatio-temporal prediction, improving accuracy and efficiency by tokenizing trajectories and freezing the LLM backbone.

DetailsMotivation: To enhance human mobility prediction by leveraging LLMs while reducing computational overhead.

Method: Partitions trajectories into daily tokens with hierarchical attention, uses pre-computed prompt embeddings, and freezes the LLM backbone.

Result: 2.4% accuracy improvement, 5.0% better weekend performance, and 24.6% faster training.

Conclusion: RHYTHM efficiently improves mobility prediction accuracy and computational efficiency.

Abstract: We introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a framework that leverages large language models (LLMs) as spatio-temporal predictors and trajectory reasoners. RHYTHM partitions trajectories into daily segments encoded as discrete tokens with hierarchical attention, capturing both daily and weekly dependencies while substantially reducing the sequence length. Token representations are enriched with pre-computed prompt embeddings via a frozen LLM, enhancing the model’s ability to capture interdependencies without extensive computational overhead. By freezing the LLM backbone, RHYTHM achieves significant computational efficiency. Evaluation on three real-world datasets demonstrates a 2.4% improvement in accuracy, 5.0% increase on weekends, and 24.6% reduction in training time compared to state-of-the-art methods.

[42] CPC-CMS: Cognitive Pairwise Comparison Classification Model Selection Framework for Document-level Sentiment Analysis

Jianfei Li, Kevin Kam Fung Yuen

Main category: cs.CL

TL;DR: The CPC-CMS framework selects the best sentiment analysis model using expert-weighted criteria and a decision matrix. ALBERT performs best without time constraints, but no single model excels when time is considered.

DetailsMotivation: To improve document-level sentiment analysis by systematically selecting the best classification model based on weighted evaluation criteria.

Method: Uses expert knowledge to weight criteria (e.g., accuracy, F1-score) and forms a decision matrix to compare models like Naive Bayes, LSTM, and ALBERT.

Result: ALBERT is best without time constraints; no single model consistently outperforms others when time is included.

Conclusion: CPC-CMS is effective for model selection in sentiment analysis and can be adapted for other classification tasks.

Abstract: This study proposes the Cognitive Pairwise Comparison Classification Model Selection (CPC-CMS) framework for document-level sentiment analysis. The CPC, based on expert knowledge judgment, is used to calculate the weights of evaluation criteria, including accuracy, precision, recall, F1-score, specificity, Matthews Correlation Coefficient (MCC), Cohen’s Kappa (Kappa), and efficiency. Naive Bayes, Linear Support Vector Classification (LSVC), Random Forest, Logistic Regression, Extreme Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM), and A Lite Bidirectional Encoder Representations from Transformers (ALBERT) are chosen as classification baseline models. A weighted decision matrix consisting of classification evaluation scores with respect to criteria weights, is formed to select the best classification model for a classification problem. Three open datasets of social media are used to demonstrate the feasibility of the proposed CPC-CMS. Based on our simulation, for evaluation results excluding the time factor, ALBERT is the best for the three datasets; if time consumption is included, no single model always performs better than the other models. The CPC-CMS can be applied to the other classification applications in different areas.

[43] Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks

Israt Jahan, Md Tahmid Rahman Laskar, Chun Peng, Jimmy Huang

Main category: cs.CL

TL;DR: The paper evaluates cost-efficient LLMs for biomedical tasks, finding no single model excels in all tasks. Open-source models sometimes match or outperform closed-source ones, offering benefits like speed and privacy.

DetailsMotivation: To assess the performance of various LLMs in biomedical tasks to guide optimal model selection for specific applications.

Method: Evaluated closed-source and open-source LLMs on tasks like text classification, generation, QA, and multimodal image processing.

Result: No single LLM consistently outperforms others; performance varies by task. Open-source models can match or exceed closed-source ones, with added advantages.

Conclusion: The study provides insights for choosing the best-suited LLMs for specific biomedical tasks, highlighting the potential of open-source models.

Abstract: This paper presents a comprehensive evaluation of cost-efficient Large Language Models (LLMs) for diverse biomedical tasks spanning both text and image modalities. We evaluated a range of closed-source and open-source LLMs on tasks such as biomedical text classification and generation, question answering, and multimodal image processing. Our experimental findings indicate that there is no single LLM that can consistently outperform others across all tasks. Instead, different LLMs excel in different tasks. While some closed-source LLMs demonstrate strong performance on specific tasks, their open-source counterparts achieve comparable results (sometimes even better), with additional benefits like faster inference and enhanced privacy. Our experimental results offer valuable insights for selecting models that are optimally suited for specific biomedical applications.

[44] Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog

Lautaro Estienne, Gabriel Ben Zenou, Nona Naderi, Jackie Cheung, Pablo Piantanida

Main category: cs.CL

TL;DR: The paper introduces Collaborative Rational Speech Act (CRSA), an extension of the RSA framework, to improve pragmatic reasoning in multi-turn, collaborative AI dialogues.

DetailsMotivation: AI systems need to reason about shared goals and beliefs in collaborative settings, but existing RSA extensions struggle with multi-turn scenarios.

Method: CRSA extends RSA using information-theoretic principles, optimizing a gain function adapted from rate-distortion theory for multi-turn dialogues.

Result: CRSA outperforms baselines in referential games and medical dialogues, showing more consistent, interpretable, and collaborative behavior.

Conclusion: CRSA advances pragmatic and socially aware AI language agents, making them more effective in collaborative scenarios.

Abstract: As AI systems take on collaborative roles, they must reason about shared goals and beliefs-not just generate fluent language. The Rational Speech Act (RSA) framework offers a principled approach to pragmatic reasoning, but existing extensions face challenges in scaling to multi-turn, collaborative scenarios. In this paper, we introduce Collaborative Rational Speech Act (CRSA), an information-theoretic (IT) extension of RSA that models multi-turn dialog by optimizing a gain function adapted from rate-distortion theory. This gain is an extension of the gain model that is maximized in the original RSA model but takes into account the scenario in which both agents in a conversation have private information and produce utterances conditioned on the dialog. We demonstrate the effectiveness of CRSA on referential games and template-based doctor-patient dialogs in the medical domain. Empirical results show that CRSA yields more consistent, interpretable, and collaborative behavior than existing baselines-paving the way for more pragmatic and socially aware language agents.

[45] DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits

Garapati Keerthana, Manik Gupta

Main category: cs.CL

TL;DR: DENSE is a system that generates clinically coherent progress notes by leveraging a retrieval strategy and LLM, improving longitudinal patient narratives in EHR datasets.

DetailsMotivation: Progress notes are crucial in EHRs but underrepresented in datasets like MIMIC-III, creating gaps in patient narratives.

Method: DENSE uses fine-grained note categorization, temporal alignment, and a retrieval strategy to prompt an LLM for generating progress notes.

Result: Generated notes show strong longitudinal fidelity with a temporal alignment ratio of 1.089, outperforming original notes.

Conclusion: DENSE enhances narrative coherence, supporting downstream tasks like summarization and predictive modeling in healthcare.

Abstract: Progress notes are among the most clinically meaningful artifacts in an Electronic Health Record (EHR), offering temporally grounded insights into a patient’s evolving condition, treatments, and care decisions. Despite their importance, they are severely underrepresented in large-scale EHR datasets. For instance, in the widely used Medical Information Mart for Intensive Care III (MIMIC-III) dataset, only about $8.56%$ of hospital visits include progress notes, leaving gaps in longitudinal patient narratives. In contrast, the dataset contains a diverse array of other note types, each capturing different aspects of care. We present DENSE (Documenting Evolving Progress Notes from Scattered Evidence), a system designed to align with clinical documentation workflows by simulating how physicians reference past encounters while drafting progress notes. The system introduces a fine-grained note categorization and a temporal alignment mechanism that organizes heterogeneous notes across visits into structured, chronological inputs. At its core, DENSE leverages a clinically informed retrieval strategy to identify temporally and semantically relevant content from both current and prior visits. This retrieved evidence is used to prompt a large language model (LLM) to generate clinically coherent and temporally aware progress notes. We evaluate DENSE on a curated cohort of patients with multiple visits and complete progress note documentation. The generated notes demonstrate strong longitudinal fidelity, achieving a temporal alignment ratio of $1.089$, surpassing the continuity observed in original notes. By restoring narrative coherence across fragmented documentation, our system supports improved downstream tasks such as summarization, predictive modeling, and clinical decision support, offering a scalable solution for LLM-driven note synthesis in real-world healthcare settings.

[46] Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track

Brian Ondov, William Xia, Kush Attal, Ishita Unde, Jerry He, Hoa Dang, Ian Soboroff, Dina Demner-Fushman

Main category: cs.CL

TL;DR: The PLABA track evaluated language models for adapting biomedical abstracts into plain language, showing promise but highlighting limitations in simplicity, brevity, and automatic evaluation.

DetailsMotivation: To make biomedical literature accessible to patients and caregivers by adapting it into plain language, while ensuring rigorous evaluation due to potential harm.

Method: Hosted the PLABA track with tasks for rewriting abstracts and replacing difficult terms, using professional references and manual evaluations.

Result: Top models matched human factual accuracy but lacked simplicity and brevity. Automatic metrics poorly correlated with manual judgments. LLMs excelled in term replacement accuracy but struggled with brevity.

Conclusion: LLMs show potential for adapting biomedical texts but need improvements in simplicity, brevity, and better automatic evaluation tools.

Abstract: Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.

[47] ViMMRC 2.0 – Enhancing Machine Reading Comprehension on Vietnamese Literature Text

Son T. Luu, Khoi Trong Hoang, Tuong Quang Pham, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen

Main category: cs.CL

TL;DR: ViMMRC 2.0 extends ViMMRC for Vietnamese multiple-choice reading comprehension, featuring harder questions and a multi-stage approach combining MAN and NLI to improve performance.

DetailsMotivation: Enhance computer understanding of Vietnamese texts by creating a more challenging dataset (ViMMRC 2.0) and improving reading comprehension models.

Method: Propose a multi-stage approach combining multi-step attention network (MAN) and natural language inference (NLI) task. Compare with BERTology models.

Result: Challenges include understanding implicit context and linking information. The proposed method shows improved performance.

Conclusion: ViMMRC 2.0 aims to advance Vietnamese language understanding in computers and inspire further research.

Abstract: Machine reading comprehension has been an interesting and challenging task in recent years, with the purpose of extracting useful information from texts. To attain the computer ability to understand the reading text and answer relevant information, we introduce ViMMRC 2.0 - an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks which contain the reading articles for students from Grade 1 to Grade 12. This dataset has 699 reading passages which are prose and poems, and 5,273 questions. The questions in the new dataset are not fixed with four options as in the previous version. Moreover, the difficulty of questions is increased, which challenges the models to find the correct choice. The computer must understand the whole context of the reading passage, the question, and the content of each choice to extract the right answers. Hence, we propose a multi-stage approach that combines the multi-step attention network (MAN) with the natural language inference (NLI) task to enhance the performance of the reading comprehension model. Then, we compare the proposed methodology with the baseline BERTology models on the new dataset and the ViMMRC 1.0. From the results of the error analysis, we found that the challenge of the reading comprehension models is understanding the implicit context in texts and linking them together in order to find the correct answers. Finally, we hope our new dataset will motivate further research to enhance the ability of computers to understand the Vietnamese language.

[48] Meta4XNLI: A Crosslingual Parallel Corpus for Metaphor Detection and Interpretation

Elisa Sanchez-Bayona, Rodrigo Agerri

Main category: cs.CL

TL;DR: Meta4XNLI is a parallel dataset for metaphor detection and interpretation in Spanish and English, used to evaluate language models’ abilities in understanding metaphors through monolingual and cross-lingual experiments.

DetailsMotivation: Metaphors are pervasive in language, and understanding them is essential for Language Models. The lack of annotated resources for metaphor tasks in multiple languages motivates the creation of Meta4XNLI.

Method: The study introduces Meta4XNLI, a parallel dataset with metaphor annotations in Spanish and English. It conducts experiments to evaluate language models’ metaphor detection and interpretation abilities, including error analysis.

Result: The experiments reveal insights into how models handle metaphors, highlighting challenges in non-literal language understanding. The parallel data also enables exploration of metaphor transferability and translation effects.

Conclusion: Meta4XNLI provides a valuable resource for metaphor research, demonstrating the need for improved models in figurative language understanding and the potential of multilingual annotated datasets.

Abstract: Metaphors, although occasionally unperceived, are ubiquitous in our everyday language. Thus, it is crucial for Language Models to be able to grasp the underlying meaning of this kind of figurative language. In this work, we present Meta4XNLI, a novel parallel dataset for the tasks of metaphor detection and interpretation that contains metaphor annotations in both Spanish and English. We investigate language models’ metaphor identification and understanding abilities through a series of monolingual and cross-lingual experiments by leveraging our proposed corpus. In order to comprehend how these non-literal expressions affect models’ performance, we look over the results and perform an error analysis. Additionally, parallel data offers many potential opportunities to investigate metaphor transferability between these languages and the impact of translation on the development of multilingual annotated resources.

[49] psifx – Psychological and Social Interactions Feature Extraction Package

Guillaume Rochette, Mathieu Rochat, Matthew J. Vowels

Main category: cs.CL

TL;DR: psifx is a multi-modal feature extraction toolkit for human sciences, automating annotation, promoting open-source research, and enabling non-expert use.

DetailsMotivation: To automate and standardize data annotation, develop open-source psychology tools, and make ML accessible to non-experts.

Method: Modular framework with tools for audio, video, and text feature extraction, including speaker diarization, pose estimation, and LLM-supported text analysis.

Result: Enables large-scale, standardized behavioral analysis in psychology and social sciences.

Conclusion: psifx democratizes advanced ML for human sciences, fostering community-driven research and real-time behavioral studies.

Abstract: psifx is a plug-and-play multi-modal feature extraction toolkit, aiming to facilitate and democratize the use of state-of-the-art machine learning techniques for human sciences research. It is motivated by a need (a) to automate and standardize data annotation processes that typically require expensive, lengthy, and inconsistent human labour; (b) to develop and distribute open-source community-driven psychology research software; and (c) to enable large-scale access and ease of use for non-expert users. The framework contains an array of tools for tasks such as speaker diarization, closed-caption transcription and translation from audio; body, hand, and facial pose estimation and gaze tracking with multi-person tracking from video; and interactive textual feature extraction supported by large language models. The package has been designed with a modular and task-oriented approach, enabling the community to add or update new tools easily. This combination creates new opportunities for in-depth study of real-time behavioral phenomena in psychological and social science research.

[50] Sparse Rewards Can Self-Train Dialogue Agents

Barrett Martin Lattimer, Varun Gangal, Ryan McDonald, Yi Yang

Main category: cs.CL

TL;DR: The paper introduces JOSH, a self-alignment algorithm for LLMs to autonomously improve performance without human feedback, using a sparse reward simulation environment.

DetailsMotivation: Human feedback for LLM improvement is costly and may become impractical as models surpass human capabilities.

Method: JOSH leverages sparse reward simulations (ToolWOZ) to train LLMs on their own outputs.

Result: Models trained with JOSH show improved tool-based interactions without losing general capabilities.

Conclusion: JOSH offers a scalable alternative to human feedback for LLM self-improvement.

Abstract: Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub at https://github.com/asappresearch/josh-llm-simulation-training

[51] Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

Main category: cs.CL

TL;DR: The paper introduces the Cross Lingual Auto Evaluation (CIA) Suite, including the Hercule model and Recon test set, to address the lack of multilingual evaluation frameworks in NLP. Hercule outperforms proprietary models in aligning with human judgments, even in low-resource and zero-shot scenarios.

DetailsMotivation: Current evaluation methods for machine-generated text are English-centric, leaving a gap for multilingual frameworks. The study aims to bridge this gap by providing tools for cross-lingual evaluation.

Method: The CIA Suite includes Hercule, a cross-lingual evaluator LLM, and Recon, a multilingual test set with human-annotated instructions and scores. Hercule learns from English references to evaluate non-English responses.

Result: Hercule aligns better with human judgments than proprietary models, proving effective in low-resource and zero-shot evaluation.

Conclusion: The study presents a scalable, effective approach for multilingual evaluation, with all resources made publicly available to advance research in this area.

Abstract: Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

[52] Temporal reasoning for timeline summarisation in social media

Jiayu Song, Mahmud Elahi Akhter, Dana Atzil Slonim, Maria Liakata

Main category: cs.CL

TL;DR: Enhancing temporal reasoning in LLMs improves timeline summarization, especially for complex, emotional social media threads.

DetailsMotivation: To improve timeline summarization by leveraging temporal reasoning, addressing gaps in existing datasets and methods.

Method: Combines temporal reasoning with summarization via knowledge distillation: fine-tunes a teacher model on temporal tasks, then distills knowledge into a student model for summarization.

Result: Superior performance on out-of-domain mental health-related timeline summarization tasks, handling long, repetitive, emotional threads.

Conclusion: Temporal reasoning enhances timeline summarization, proving its importance and generalizability.

Abstract: This paper explores whether enhancing temporal reasoning capabilities in Large Language Models (LLMs) can improve the quality of timeline summarisation, the task of summarising long texts containing sequences of events, such as social media threads. We first introduce NarrativeReason, a novel dataset focused on temporal relationships among sequential events within narratives, distinguishing it from existing temporal reasoning datasets that primarily address pair-wise event relationships. Our approach then combines temporal reasoning with timeline summarisation through a knowledge distillation framework, where we first fine-tune a teacher model on temporal reasoning tasks and then distill this knowledge into a student model while simultaneously training it for the task of timeline summarisation. Experimental results demonstrate that our model achieves superior performance on out-of-domain mental health-related timeline summarisation tasks, which involve long social media threads with repetitions of events and a mix of emotions, highlighting the importance and generalisability of leveraging temporal reasoning to improve timeline summarisation.

[53] Consistency of Responses and Continuations Generated by Large Language Models on Social Media

Wenlu Fan, Yuqi Zhu, Chenyang Wang, Bin Wang, Wentao Xu

Main category: cs.CL

TL;DR: This study explores how LLMs (Gemma and Llama) handle emotional content and semantic coherence in social media contexts, revealing distinct emotional patterns and high semantic similarity with human-authored text.

DetailsMotivation: To understand the emotional consistency and semantic coherence of LLMs in social media contexts, as these aspects are insufficiently studied.

Method: Analyzed climate change discussions from Twitter and Reddit using Gemma and Llama for continuation and response tasks, examining emotional transitions, intensity, and semantic similarity.

Result: Gemma amplifies negative emotions (e.g., anger) but retains some positive ones (e.g., optimism), while Llama preserves a broader emotional spectrum. Both models attenuate emotional intensity and show a positive bias in responses, maintaining strong semantic coherence.

Conclusion: LLMs exhibit distinct emotional patterns and high semantic coherence, offering insights for their use in social media and human-AI interaction design.

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities in text generation, yet their emotional consistency and semantic coherence in social media contexts remain insufficiently understood. This study investigates how LLMs handle emotional content and maintain semantic relationships through continuation and response tasks using two open-source models: Gemma and Llama. By analyzing climate change discussions from Twitter and Reddit, we examine emotional transitions, intensity patterns, and semantic similarity between human-authored and LLM-generated content. Our findings reveal that while both models maintain high semantic coherence, they exhibit distinct emotional patterns: Gemma shows a tendency toward negative emotion amplification, particularly anger, while maintaining certain positive emotions like optimism. Llama demonstrates superior emotional preservation across a broader spectrum of affects. Both models systematically generate responses with attenuated emotional intensity compared to human-authored content and show a bias toward positive emotions in response tasks. Additionally, both models maintain strong semantic similarity with original texts, though performance varies between continuation and response tasks. These findings provide insights into LLMs' emotional and semantic processing capabilities, with implications for their deployment in social media contexts and human-AI interaction design.

[54] ASTRID – An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Mohita Chowdhury, Yajie Vera He, Jared Joselowitz, Aisling Higham, Ernest Lim

Main category: cs.CL

TL;DR: The paper introduces ASTRID, an automated evaluation triad for clinical QA systems using RAG, addressing limitations of current metrics with three new metrics: Context Relevance, Refusal Accuracy, and Conversational Faithfulness.

DetailsMotivation: Current automated RAG metrics perform poorly in clinical and conversational QA, and human evaluations are costly and unscalable.

Method: ASTRID is proposed, consisting of three metrics (CR, RA, CF), validated on a dataset of 200+ real-world patient questions and clinician-selected scenarios.

Result: CF outperforms existing metrics in predicting human ratings of faithfulness, and the triad aligns with clinician assessments. Nine LLMs show close agreement with human evaluations.

Conclusion: ASTRID offers a scalable, automated solution for evaluating clinical QA systems, with potential for broader LLM-driven evaluation pipelines.

Abstract: Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses. However, current automated RAG metrics perform poorly in clinical and conversational use cases. Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems. To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF). Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model’s response to the knowledge base without penalising conversational elements. To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, clinical, and non-clinical out-of-domain scenarios. We demonstrate that CF can predict human ratings of faithfulness better than existing definitions for conversational use cases. Furthermore, we show that evaluation using our triad consisting of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses. Finally, using nine different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines. We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development.

[55] Culture is Not Trivia: Sociocultural Theory for Cultural NLP

Naitian Zhou, David Bamman, Isaac L. Bleaman

Main category: cs.CL

TL;DR: The paper critiques cultural NLP’s reliance on proxies for culture, highlights limitations like coarse boundaries and static benchmarks, and proposes sociocultural linguistics as a solution for better cultural competence and localization.

DetailsMotivation: The need for effective and safe language technologies across diverse cultures drives the study, addressing gaps in current cultural NLP methodologies.

Method: The paper uses a case study to illustrate methodological constraints and suggests paths forward, drawing on sociocultural linguistics theory.

Result: The study identifies recurring limitations in cultural NLP and proposes localization as a better framing for cultural competence.

Conclusion: The paper advocates for a theoretical shift in cultural NLP, emphasizing dynamic, nuanced cultural understanding and localization.

Abstract: The field of cultural NLP has recently experienced rapid growth, driven by a pressing need to ensure that language technologies are effective and safe across a pluralistic user base. This work has largely progressed without a shared conception of culture, instead choosing to rely on a wide array of cultural proxies. However, this leads to a number of recurring limitations: coarse national boundaries fail to capture nuanced differences that lay within them, limited coverage restricts datasets to only a subset of usually highly-represented cultures, and a lack of dynamicity results in static cultural benchmarks that do not change as culture evolves. In this position paper, we argue that these methodological limitations are symptomatic of a theoretical gap. We draw on a well-developed theory of culture from sociocultural linguistics to fill this gap by 1) demonstrating in a case study how it can clarify methodological constraints and affordances, 2) offering theoretically-motivated paths forward to achieving cultural competence, and 3) arguing that localization is a more useful framing for the goals of much current work in cultural NLP.

[56] When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models

Julia Mendelsohn, Ceren Budak

Main category: cs.CL

TL;DR: A computational method measures metaphorical language in immigration discourse on social media, revealing ideological differences and engagement effects.

DetailsMotivation: To understand how metaphors shape political discourse and public perception, especially in immigration debates.

Method: Developed a technique combining word-level and document-level signals to measure metaphors in 400K US tweets about immigration.

Result: Conservatives use more dehumanizing metaphors, but effects vary by concept. Creature-related metaphors increase retweets, especially for liberals.

Conclusion: Computational methods can enhance qualitative research in analyzing implicit language in political discourse.

Abstract: Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. “water” or “vermin”). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.

[57] Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

William Jurayj, Jeffrey Cheng, Benjamin Van Durme

Main category: cs.CL

TL;DR: The paper explores the impact of test-time compute scaling on large language models, focusing on confidence in responses and the appropriateness of always answering. It introduces confidence-based thresholding and suggests evaluation methods for non-zero response risk.

DetailsMotivation: Existing evaluations assume models should always answer, ignoring confidence and appropriateness. The study aims to address these gaps by analyzing confidence and proposing new evaluation paradigms.

Method: Extract confidence scores during reasoning for thresholding responses. Analyze the effect of increased compute on correctness and confidence. Extend evaluation to include non-zero response risk.

Result: Increased compute improves correctness and confidence in responses. The study provides a framework for evaluating models under non-zero risk settings.

Conclusion: Test-time compute scaling enhances model performance and confidence. The paper advocates for evaluations that account for response risk and confidence.

Abstract: Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

[58] HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu

Main category: cs.CL

TL;DR: The paper introduces HoH, a benchmark to evaluate the impact of outdated information in RAG systems, showing it degrades performance and can cause harmful outputs.

DetailsMotivation: Address the overlooked challenge of outdated information coexisting in RAG knowledge bases, which current research inadequately tackles.

Method: Uses token-level diff algorithms and LLM pipelines to create a large-scale QA dataset capturing temporal knowledge evolution.

Result: Outdated information reduces response accuracy and can mislead models into harmful outputs, even with current information available.

Conclusion: Highlights the need for innovative solutions to handle temporal challenges in RAG, as current approaches struggle with outdated information.

Abstract: While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it still faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures the evolution of temporal knowledge in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG. Our code and data are available at: https://github.com/0russwest0/HoH.

[59] MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, Arianna Bisazza

Main category: cs.CL

TL;DR: MultiBLiMP 1.0 is a multilingual benchmark with 128,000+ minimal pairs across 101 languages, evaluating LLMs on subject-verb agreement.

DetailsMotivation: To assess LLMs' linguistic abilities at a large multilingual scale and identify gaps in low-resource language modeling.

Method: Uses an automated pipeline with Universal Dependencies and UniMorph resources to create minimal pairs.

Result: Benchmark covers 101 languages and highlights current LLM shortcomings in low-resource languages.

Conclusion: MultiBLiMP 1.0 provides a scalable tool for evaluating LLMs’ multilingual linguistic capabilities.

Abstract: We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages

[60] ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: ParaPO is a post-training method that reduces verbatim regurgitation in LMs while preserving utility, outperforming prior unlearning methods.

DetailsMotivation: Address concerns about copyright, plagiarism, privacy, and creativity due to LMs memorizing and reproducing pretraining data.

Method: Fine-tunes LMs to prefer paraphrased versions of memorized segments, using system prompts to control regurgitation for famous quotations.

Result: ParaPO reduces regurgitation metrics (e.g., 17.3 to 12.9 in creative writing) and maintains quotation recall when prompted.

Conclusion: ParaPO effectively mitigates unintentional regurgitation while preserving LM utility, outperforming traditional unlearning methods.

Abstract: Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).

[61] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, Chong Ruan

Main category: cs.CL

TL;DR: DeepSeek-Prover-V2 is an open-source LLM for formal theorem proving in Lean 4, achieving state-of-the-art results on benchmarks like MiniF2F and PutnamBench.

DetailsMotivation: To integrate informal and formal mathematical reasoning into a unified model for theorem proving.

Method: Uses a recursive theorem proving pipeline with DeepSeek-V3 to decompose problems, synthesize proofs, and initialize reinforcement learning.

Result: Achieves 88.9% pass ratio on MiniF2F-test, solves 49/658 PutnamBench problems, and 6/15 AIME problems.

Conclusion: The gap between formal and informal reasoning in LLMs is narrowing, as shown by DeepSeek-Prover-V2’s performance.

Abstract: We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3’s step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing.

[62] On-Policy Optimization with Group Equivalent Preference for Multi-Programming Language Understanding

Haoyuan Wu, Rui Ming, Jilong Gao, Hangyu Zhao, Xueyi Chen, Yikai Yang, Haisheng Zheng, Zhuolun He, Bei Yu

Main category: cs.CL

TL;DR: The paper addresses performance disparities in LLMs for code generation by using code translation tasks and a novel RL framework (OORL) with GEPO for preference optimization, improving cross-language coding proficiency.

DetailsMotivation: The performance gap in LLMs for code generation between popular and less common programming languages needs addressing to enhance versatility.

Method: Proposes OORL, an RL framework combining on-policy and off-policy strategies, and GEPO for preference optimization using intermediate representations (IRs) groups.

Result: OORL with code translation tasks significantly improves LLM performance on code benchmarks across multiple languages.

Conclusion: The approach effectively bridges the capability gap in LLMs for diverse programming languages by leveraging translation and RL.

Abstract: Large language models (LLMs) achieve remarkable performance in code generation tasks. However, a significant performance disparity persists between popular programming languages (e.g., Python, C++) and others. To address this capability gap, we leverage the code translation task to train LLMs, thereby facilitating the transfer of coding proficiency across diverse programming languages. Moreover, we introduce OORL for training, a novel reinforcement learning (RL) framework that integrates on-policy and off-policy strategies. Within OORL, on-policy RL is applied during code translation, guided by a rule-based reward signal derived from unit tests. Complementing this coarse-grained rule-based reward, we propose Group Equivalent Preference Optimization (GEPO), a novel preference optimization method. Specifically, GEPO trains the LLM using intermediate representations (IRs) groups. LLMs can be guided to discern IRs equivalent to the source code from inequivalent ones, while also utilizing signals about the mutual equivalence between IRs within the group. This process allows LLMs to capture nuanced aspects of code functionality. By employing OORL for training with code translation tasks, LLMs improve their recognition of code functionality and their understanding of the relationships between code implemented in different languages. Extensive experiments demonstrate that our OORL for LLMs training with code translation tasks achieves significant performance improvements on code benchmarks across multiple programming languages.

[63] Exploring Graph Representations of Logical Forms for Language Modeling

Michael Sullivan

Main category: cs.CL

TL;DR: Language models over logical forms (LFLMs) are more data-efficient than textual models, demonstrated by the GFoLDS prototype, which outperforms BERT on downstream tasks.

DetailsMotivation: To show that LFLMs are more data-efficient and can leverage inherent linguistic knowledge for learning complex patterns.

Method: Introduce GFoLDS, a pretrained LM over graph representations of logical forms, and compare it with textual LMs like BERT.

Result: GFoLDS outperforms BERT on downstream tasks, showing LFLMs require less data and scale well with more parameters and data.

Conclusion: LFLMs, like GFoLDS, are viable for real-world applications due to their data efficiency and scalability.

Abstract: We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the Graph-based Formal-Logical Distributional Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs (BERT) pretrained on the same data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.

[64] On the class of coding optimality of human languages and the origins of Zipf’s law

Ramon Ferrer-i-Cancho

Main category: cs.CL

TL;DR: The paper introduces a new class of optimality for coding systems, linking Zipf’s law to linear displacement from optimal coding. It identifies human languages as members of this class and explores conditions for Zipf’s law emergence in compressing systems.

DetailsMotivation: To understand the origins of Zipf's law in coding systems and identify conditions under which it emerges, particularly in human languages and certain animal communication systems.

Method: The study analyzes coding systems, focusing on linear displacement from optimal coding, and examines frequency-rank distributions in double logarithmic scale.

Result: Human languages align with the new class, exhibiting Zipf’s law, while some animal systems (e.g., dolphins, whales) may also qualify. A straight line in frequency-rank plots indicates optimal coding conditions.

Conclusion: Zipf’s law likely stems from compression, and the paper provides testable conditions for its emergence in compressing systems.

Abstract: Here we present a new class of optimality for coding systems. Members of that class are displaced linearly from optimal coding and thus exhibit Zipf’s law, namely a power-law distribution of frequency ranks. Within that class, Zipf’s law, the size-rank law and the size-probability law form a group-like structure. We identify human languages that are members of the class. All languages showing sufficient agreement with Zipf’s law are potential members of the class. In contrast, there are communication systems in other species that cannot be members of that class for exhibiting an exponential distribution instead but dolphins and humpback whales might. We provide a new insight into plots of frequency versus rank in double logarithmic scale. For any system, a straight line in that scale indicates that the lengths of optimal codes under non-singular coding and under uniquely decodable encoding are displaced by a linear function whose slope is the exponent of Zipf’s law. For systems under compression and constrained to be uniquely decodable, such a straight line may indicate that the system is coding close to optimality. We provide support for the hypothesis that Zipf’s law originates from compression and define testable conditions for the emergence of Zipf’s law in compressing systems.

[65] RExBench: Can coding agents autonomously implement AI research extensions?

Nicholas Edwards, Yukyung Lee, Yujun Audrey Mao, Yulu Qin, Sebastian Schuster, Najoung Kim

Main category: cs.CL

TL;DR: RExBench evaluates LLM agents’ ability to autonomously implement research extensions, finding current agents fall short without human guidance.

DetailsMotivation: To assess the capability of LLM agents in autonomously extending and implementing research tasks, a critical skill for advanced AI systems.

Method: Introduces RExBench, a benchmark with 12 realistic research extension tasks, evaluated using automatic execution and domain expert instructions.

Result: All nine LLM agents tested failed to autonomously implement most extensions, with success rates below 40% even with human hints.

Conclusion: Current LLM agents lack the ability to handle realistic research extensions without significant human assistance.

Abstract: Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

[66] STACK: Adversarial Attacks on LLM Safeguard Pipelines

Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave

Main category: cs.CL

TL;DR: The paper evaluates AI defense pipelines, introduces a new classifier outperforming existing safeguards, and demonstrates vulnerabilities through a staged attack method (STACK).

DetailsMotivation: To address the unclear security of AI defense pipelines and the lack of prior evaluation or attacks on such systems.

Method: Developed an open-source defense pipeline, tested a novel few-shot-prompted classifier, and introduced the STACK attack procedure.

Result: The new classifier reduced attack success rate (ASR) to 0% on ClearHarm, while STACK achieved 71% ASR in black-box and 33% in transfer settings.

Conclusion: Highlights vulnerabilities in AI defense pipelines and suggests mitigations to counter staged attacks.

Abstract: Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

[67] Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving

Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: Agent KB is a shared knowledge base enabling AI agents to transfer problem-solving strategies and execution lessons, improving performance on tasks like GAIA and SWE-bench.

DetailsMotivation: Current AI agents lack effective learning from each other's experiences or past successes, limiting their problem-solving and error-correction capabilities.

Method: Agent KB uses a teacher-student dual-phase retrieval mechanism for hierarchical knowledge transfer, combining high-level strategies and execution-level refinements.

Result: Agent KB improved success rates by 6.06 percentage points on GAIA and 8.67 percentage points on SWE-bench tasks.

Conclusion: Agent KB enhances AI agent performance by enabling cross-framework knowledge transfer and diverse reasoning pathways.

Abstract: Current AI agents cannot effectively learn from each other’s problem-solving experiences or use past successes to guide self-reflection and error correction in new tasks. We introduce Agent KB, a shared knowledge base that captures both high-level problem-solving strategies and detailed execution lessons, enabling knowledge transfer across agent frameworks. Agent KB implements a novel teacher-student dual-phase retrieval mechanism where student agents retrieve workflow-level patterns for strategic guidance while teacher agents identify execution-level patterns for refinement. This hierarchical approach enables agents to break out of limited reasoning pathways by incorporating diverse strategies from external sources. Evaluations on the GAIA benchmark demonstrate substantial performance gains, with Agent KB improving success rates by up to 6.06 percentage points overall under pass@1. For SWE-bench code repair tasks, our system significantly improved resolution rates, with o3-mini achieving an 8.67 percentage point gain (23 percent to 31.67 percent) in pass@1.

[68] From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation

Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee

Main category: cs.CL

TL;DR: The paper introduces two Korean expert-level benchmarks, KMMLU-Redux and KMMLU-Pro, to evaluate LLMs in real-world industrial and professional contexts.

DetailsMotivation: To address the need for robust benchmarks that assess LLMs' applicability in real-world industrial and professional scenarios, particularly in Korea.

Method: Developed KMMLU-Redux (revised from KMMLU, removing errors) and KMMLU-Pro (based on Korean licensure exams) to represent industrial and professional knowledge.

Result: The benchmarks effectively represent industrial knowledge in Korea, as demonstrated by experiments.

Conclusion: The benchmarks are publicly released to aid in evaluating LLMs for real-world applications in Korea.

Abstract: The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.

[69] Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models

Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gerstenberg, Timothy O’Donnell, Alexander K. Lew, Jacob D. Andreas, Joshua B. Tenenbaum, Tyler Brooke-Wilson

Main category: cs.CL

TL;DR: The paper explores how people use distributed and symbolic representations to construct mental models for novel situations, proposing a ‘Model Synthesis Architecture’ (MSA) combining language models and probabilistic programs. MSA outperforms language model-only baselines in mimicking human reasoning.

DetailsMotivation: To understand how people draw on diverse background knowledge for coherent reasoning in novel situations and to replicate this ability computationally.

Method: Proposes MSA, combining language models for relevance-based retrieval and probabilistic programs for coherent world models. Evaluated on a novel reasoning dataset (‘Model Olympics’).

Result: MSA captures human judgments better than language model-only baselines, showing improved reasoning over globally relevant variables.

Conclusion: MSA offers a viable approach to replicating human-like, open-ended reasoning, bridging symbolic and distributed representations.

Abstract: When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea – a ``Model Synthesis Architecture’’ (MSA) – using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset – built around a Model Olympics domain of sports vignettes – tests models’ capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people’s ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.

cs.CV

[70] Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives

Yang Zhou, Junjie Li, CongYang Ou, Dawei Yan, Haokui Zhang, Xizhe Xue

Main category: cs.CV

TL;DR: A survey on open-vocabulary object detection (OVOD) in UAV aerial scenes, highlighting its advantages over traditional methods, reviewing existing techniques, datasets, challenges, and future directions.

DetailsMotivation: Traditional UAV aerial object detection is limited to predefined categories, while OVOD, enabled by cross-modal text-image alignment (e.g., CLIP), allows detection of unseen objects via natural language, enhancing UAV intelligence.

Method: Aligns OVOD principles with UAV vision characteristics, constructs a taxonomy of OVOD methods for aerial imagery, and reviews datasets.

Result: Identifies key challenges and open problems in OVOD for UAV scenes, providing a structured overview of current methods.

Conclusion: Outlines future research directions and application prospects, serving as a roadmap for researchers in this evolving field.

Abstract: Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in Unmanned Aerial Vehicles (UAV) technology have further propelled this field to new heights, giving rise to a broader range of application requirements. However, traditional UAV aerial object detection methods primarily focus on detecting predefined categories, which significantly limits their applicability. The advent of cross-modal text-image alignment (e.g., CLIP) has overcome this limitation, enabling open-vocabulary object detection (OVOD), which can identify previously unseen objects through natural language descriptions. This breakthrough significantly enhances the intelligence and autonomy of UAVs in aerial scene understanding. This paper presents a comprehensive survey of OVOD in the context of UAV aerial scenes. We begin by aligning the core principles of OVOD with the unique characteristics of UAV vision, setting the stage for a specialized discussion. Building on this foundation, we construct a systematic taxonomy that categorizes existing OVOD methods for aerial imagery and provides a comprehensive overview of the relevant datasets. This structured review enables us to critically dissect the key challenges and open problems at the intersection of these fields. Finally, based on this analysis, we outline promising future research directions and application prospects. This survey aims to provide a clear road map and a valuable reference for both newcomers and seasoned researchers, fostering innovation in this rapidly evolving domain. We keep tracing related works at https://github.com/zhouyang2002/OVOD-in-UVA-imagery

[71] Low-Light Enhancement via Encoder-Decoder Network with Illumination Guidance

Le-Anh Tran, Chung Nguyen Tran, Ngoc-Luu Nguyen, Nhan Cach Dang, Jordi Carrabina, David Castells-Rufas, Minh Son Nguyen

Main category: cs.CV

TL;DR: EDNIG is a deep learning framework for low-light image enhancement using an encoder-decoder network with illumination guidance and SPP for multi-scale features, optimized via GAN with a composite loss.

DetailsMotivation: To enhance low-light images effectively by focusing on underexposed regions and handling diverse lighting conditions.

Method: Uses U-Net with BCP-derived illumination guidance, SPP for multi-scale features, Swish activation, and GAN optimization with adversarial, MSE, and perceptual losses.

Result: Competes with state-of-the-art methods in metrics and visual quality, with lower complexity.

Conclusion: EDNIG is effective for real-world low-light image enhancement, with open-source code available.

Abstract: This paper introduces a novel deep learning framework for low-light image enhancement, named the Encoder-Decoder Network with Illumination Guidance (EDNIG). Building upon the U-Net architecture, EDNIG integrates an illumination map, derived from Bright Channel Prior (BCP), as a guidance input. This illumination guidance helps the network focus on underexposed regions, effectively steering the enhancement process. To further improve the model’s representational power, a Spatial Pyramid Pooling (SPP) module is incorporated to extract multi-scale contextual features, enabling better handling of diverse lighting conditions. Additionally, the Swish activation function is employed to ensure smoother gradient propagation during training. EDNIG is optimized within a Generative Adversarial Network (GAN) framework using a composite loss function that combines adversarial loss, pixel-wise mean squared error (MSE), and perceptual loss. Experimental results show that EDNIG achieves competitive performance compared to state-of-the-art methods in quantitative metrics and visual quality, while maintaining lower model complexity, demonstrating its suitability for real-world applications. The source code for this work is available at https://github.com/tranleanh/ednig.

[72] HeCoFuse: Cross-Modal Complementary V2X Cooperative Perception with Heterogeneous Sensors

Chuheng Wei, Ziye Qin, Walter Zimmer, Guoyuan Wu, Matthew J. Barth

Main category: cs.CV

TL;DR: HeCoFuse is a unified framework for cooperative perception in V2X systems with heterogeneous sensor setups, using adaptive feature fusion and learning strategies to improve performance.

DetailsMotivation: Address challenges like feature misalignment and imbalanced representation in V2X systems due to heterogeneous sensor configurations.

Method: Hierarchical fusion with channel-wise and spatial attention, adaptive spatial resolution adjustment, and cooperative learning for dynamic fusion.

Result: Achieves 43.22% 3D mAP (LC+LC) and 43.38% (L+LC), outperforming baselines and maintaining robustness across nine configurations.

Conclusion: HeCoFuse is state-of-the-art, validated by CVPR 2025 DriveX challenge, and robust for diverse V2X sensor deployments.

Abstract: Real-world Vehicle-to-Everything (V2X) cooperative perception systems often operate under heterogeneous sensor configurations due to cost constraints and deployment variability across vehicles and infrastructure. This heterogeneity poses significant challenges for feature fusion and perception reliability. To address these issues, we propose HeCoFuse, a unified framework designed for cooperative perception across mixed sensor setups where nodes may carry Cameras (C), LiDARs (L), or both. By introducing a hierarchical fusion mechanism that adaptively weights features through a combination of channel-wise and spatial attention, HeCoFuse can tackle critical challenges such as cross-modality feature misalignment and imbalanced representation quality. In addition, an adaptive spatial resolution adjustment module is employed to balance computational cost and fusion effectiveness. To enhance robustness across different configurations, we further implement a cooperative learning strategy that dynamically adjusts fusion type based on available modalities. Experiments on the real-world TUMTraf-V2X dataset demonstrate that HeCoFuse achieves 43.22% 3D mAP under the full sensor configuration (LC+LC), outperforming the CoopDet3D baseline by 1.17%, and reaches an even higher 43.38% 3D mAP in the L+LC scenario, while maintaining 3D mAP in the range of 21.74% to 43.38% across nine heterogeneous sensor configurations. These results, validated by our first-place finish in the CVPR 2025 DriveX challenge, establish HeCoFuse as the current state-of-the-art on TUM-Traf V2X dataset while demonstrating robust performance across diverse sensor deployments.

[73] VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

Shmuel Berman, Jia Deng

Main category: cs.CV

TL;DR: VLMs struggle with nonlocal visual reasoning tasks, despite excelling in complex visual tasks. Flagship models fail comparative perception, saccadic search, and smooth visual search tasks, highlighting a gap in core visual reasoning capabilities.

DetailsMotivation: To evaluate VLMs' capacity for nonlocal visual reasoning, isolating distinct forms like comparative perception, saccadic search, and smooth visual search.

Method: Presented an evaluation suite testing VLMs on nonlocal reasoning tasks, comparing performance of flagship models (e.g., Gemini 2.5 Pro, Claude Vision 3.7, GPT-o4-mini) to human accuracy.

Result: Flagship models failed the tests, barely exceeding random accuracy, while humans found the tasks trivial.

Conclusion: Current VLMs lack core visual reasoning capabilities despite advances in raw visual acuity, revealing a critical limitation in their design.

Abstract: Visual Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models’ capacity for nonlocal visual reasoning – reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non-local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g., Gemini 2.5 Pro, Claude Vision 3.7, GPT-o4-mini), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform similar visual algorithms to humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

[74] Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning

Binbin Ji, Siddharth Agrawal, Qiance Tang, Yvonne Wu

Main category: cs.CV

TL;DR: The study explores spatial reasoning in VLMs using CoT prompting and reinforcement learning, finding structured SceneGraph CoT and GRPO fine-tuning improve accuracy and robustness.

DetailsMotivation: To understand and enhance the spatial reasoning capabilities of VLMs, addressing limitations of simple CoT prompting and overfitting in supervised fine-tuning.

Method: Evaluated CoT prompting strategies, introduced SceneGraph CoT, and fine-tuned models using GRPO on the SAT dataset, tested on CVBench.

Result: SceneGraph CoT outperforms simple CoT, and GRPO achieves higher accuracy and robustness compared to SFT, especially under OOD conditions.

Conclusion: Structured prompting and reinforcement learning (GRPO) significantly improve spatial reasoning and generalization in VLMs.

Abstract: This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought (CoT) prompting and reinforcement learning. We begin by evaluating the impact of different prompting strategies and find that simple CoT formats, where the model generates a reasoning step before the answer, not only fail to help, but can even harm the model’s original performance. In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy. Furthermore, to improve spatial reasoning ability, we fine-tune models using Group Relative Policy Optimization (GRPO) on the SAT dataset and evaluate their performance on CVBench. Compared to supervised fine-tuning (SFT), GRPO achieves higher accuracy on Pass@1 evaluations and demonstrates superior robustness under out-of-distribution (OOD) conditions. In particular, we find that SFT overfits to surface-level linguistic patterns and may degrade performance when test-time phrasing changes (e.g., from “closer to” to “farther from”). GRPO, on the other hand, generalizes more reliably and maintains stable performance under such shifts. Our findings provide insights into how reinforcement learning and structured prompting improve the spatial reasoning capabilities and generalization behavior of modern VLMs. All code is open source at: https://github.com/Yvonne511/spatial-vlm-investigator

[75] TimeNeRF: Building Generalizable Neural Radiance Fields across Time from Few-Shot Input Views

Hsiang-Hui Hung, Huu-Phu Do, Yung-Hui Li, Ching-Chun Huang

Main category: cs.CV

TL;DR: TimeNeRF is a neural rendering method for generating novel views at any viewpoint and time with few input views, addressing the need for efficient, immersive 3D scene modeling.

DetailsMotivation: The need for efficient, immersive 3D scene modeling in applications like the metaverse, where transitioning between day and night is crucial, drives this work. Current NeRF-based methods lack dedicated datasets and exploration for temporal modeling.

Method: Combines multi-view stereo, neural radiance fields (NeRF), and disentanglement strategies to enable generalizability in few-shot settings, construct implicit content radiance fields, and build NeRFs at arbitrary times.

Result: TimeNeRF renders novel views without per-scene optimization and excels in smooth temporal transitions, capturing natural scene changes like dawn to dusk.

Conclusion: TimeNeRF advances neural rendering by enabling efficient, generalizable temporal 3D scene modeling with few input views.

Abstract: We present TimeNeRF, a generalizable neural rendering approach for rendering novel views at arbitrary viewpoints and at arbitrary times, even with few input views. For real-world applications, it is expensive to collect multiple views and inefficient to re-optimize for unseen scenes. Moreover, as the digital realm, particularly the metaverse, strives for increasingly immersive experiences, the ability to model 3D environments that naturally transition between day and night becomes paramount. While current techniques based on Neural Radiance Fields (NeRF) have shown remarkable proficiency in synthesizing novel views, the exploration of NeRF’s potential for temporal 3D scene modeling remains limited, with no dedicated datasets available for this purpose. To this end, our approach harnesses the strengths of multi-view stereo, neural radiance fields, and disentanglement strategies across diverse datasets. This equips our model with the capability for generalizability in a few-shot setting, allows us to construct an implicit content radiance field for scene representation, and further enables the building of neural radiance fields at any arbitrary time. Finally, we synthesize novel views of that time via volume rendering. Experiments show that TimeNeRF can render novel views in a few-shot setting without per-scene optimization. Most notably, it excels in creating realistic novel views that transition smoothly across different times, adeptly capturing intricate natural scene changes from dawn to dusk.

[76] Just Add Geometry: Gradient-Free Open-Vocabulary 3D Detection Without Human-in-the-Loop

Atharv Goel, Mehar Khurana

Main category: cs.CV

TL;DR: The paper introduces a method for open-vocabulary 3D object detection using 2D vision-language models without human-annotated 3D labels, achieving competitive performance in various settings.

DetailsMotivation: Existing 3D object detection datasets are limited by narrow taxonomies and costly annotations, hindering scalability. The work leverages 2D foundation models' rich semantic understanding for open-world 3D detection.

Method: The pipeline uses a 2D vision-language detector for text-conditioned proposals, segments them with SAM, and back-projects into 3D using camera geometry and pseudo-depth. A geometric inflation strategy infers 3D bounding boxes without training.

Result: The method achieves competitive localization performance in LiDAR-based and RGB-D settings, remaining training-free and open-vocabulary.

Conclusion: The work demonstrates the potential of 2D foundation models for scalable 3D perception, with open-sourced code and resources.

Abstract: Modern 3D object detection datasets are constrained by narrow class taxonomies and costly manual annotations, limiting their ability to scale to open-world settings. In contrast, 2D vision-language models trained on web-scale image-text pairs exhibit rich semantic understanding and support open-vocabulary detection via natural language prompts. In this work, we leverage the maturity and category diversity of 2D foundation models to perform open-vocabulary 3D object detection without any human-annotated 3D labels. Our pipeline uses a 2D vision-language detector to generate text-conditioned proposals, which are segmented with SAM and back-projected into 3D using camera geometry and either LiDAR or monocular pseudo-depth. We introduce a geometric inflation strategy based on DBSCAN clustering and Rotating Calipers to infer 3D bounding boxes without training. To simulate adverse real-world conditions, we construct Pseudo-nuScenes, a fog-augmented, RGB-only variant of the nuScenes dataset. Experiments demonstrate that our method achieves competitive localization performance across multiple settings, including LiDAR-based and purely RGB-D inputs, all while remaining training-free and open-vocabulary. Our results highlight the untapped potential of 2D foundation models for scalable 3D perception. We open-source our code and resources at https://github.com/atharv0goel/open-world-3D-det.

[77] OmniVec2 – A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

Siddharth Srivastava, Gaurav Sharma

Main category: cs.CV

TL;DR: A novel multimodal multitask network with a shared transformer architecture and cross-attention mechanisms achieves state-of-the-art performance across 12 modalities and 25 datasets.

DetailsMotivation: To address the challenge of processing and integrating diverse data modalities (e.g., image, video, audio) into a unified framework for multitask learning.

Method: Uses modality-specific tokenizers, a shared transformer, and cross-attention. Introduces iterative modality switching for pretraining and a training algorithm balancing joint and pairwise modality training.

Result: Demonstrates state-of-the-art performance across 25 datasets from 12 modalities.

Conclusion: The proposed architecture, pretraining strategy, and training algorithm effectively handle multimodal multitask scenarios.

Abstract: We present a novel multimodal multitask network and associated training algorithm. The method is capable of ingesting data from approximately 12 different modalities namely image, video, audio, text, depth, point cloud, time series, tabular, graph, X-ray, infrared, IMU, and hyperspectral. The proposed approach utilizes modality specialized tokenizers, a shared transformer architecture, and cross-attention mechanisms to project the data from different modalities into a unified embedding space. It addresses multimodal and multitask scenarios by incorporating modality-specific task heads for different tasks in respective modalities. We propose a novel pretraining strategy with iterative modality switching to initialize the network, and a training algorithm which trades off fully joint training over all modalities, with training on pairs of modalities at a time. We provide comprehensive evaluation across 25 datasets from 12 modalities and show state of the art performances, demonstrating the effectiveness of the proposed architecture, pretraining strategy and adapted multitask training.

[78] Transformer-Based Framework for Motion Capture Denoising and Anomaly Detection in Medical Rehabilitation

Yeming Cai, Yang Wang, Zhenglin Li

Main category: cs.CV

TL;DR: A deep learning framework combines motion capture and Transformers for medical rehab, addressing noise, missing data, and real-time anomaly detection.

DetailsMotivation: Improve rehabilitation by handling noisy/missing motion data and ensuring patient safety with real-time anomaly detection.

Method: End-to-end deep learning with Transformer-based temporal sequence modeling for denoising and data completion.

Result: Outperforms on stroke/orthopedic datasets in reconstruction and anomaly detection, enabling scalable remote rehab.

Conclusion: The framework offers robust, cost-effective rehab solutions with minimal supervision.

Abstract: This paper proposes an end-to-end deep learning framework integrating optical motion capture with a Transformer-based model to enhance medical rehabilitation. It tackles data noise and missing data caused by occlusion and environmental factors, while detecting abnormal movements in real time to ensure patient safety. Utilizing temporal sequence modeling, our framework denoises and completes motion capture data, improving robustness. Evaluations on stroke and orthopedic rehabilitation datasets show superior performance in data reconstruction and anomaly detection, providing a scalable, cost-effective solution for remote rehabilitation with reduced on-site supervision.

[79] Enhancing Breast Cancer Detection with Vision Transformers and Graph Neural Networks

Yeming Cai, Zhenglin Li, Yang Wang

Main category: cs.CV

TL;DR: A novel framework combining Vision Transformers (ViT) and Graph Neural Networks (GNN) improves breast cancer detection with 84.2% accuracy, offering interpretable insights for radiologists.

DetailsMotivation: Early detection of breast cancer is crucial for survival, and current methods need improvement in accuracy and interpretability.

Method: The framework integrates ViT for global image features and GNN for structural relationships, tested on the CBIS-DDSM dataset.

Result: Achieves 84.2% accuracy, surpassing traditional methods, with interpretable attention heatmaps.

Conclusion: The proposed framework enhances detection accuracy and provides clinical utility through interpretability.

Abstract: Breast cancer is a leading cause of death among women globally, and early detection is critical for improving survival rates. This paper introduces an innovative framework that integrates Vision Transformers (ViT) and Graph Neural Networks (GNN) to enhance breast cancer detection using the CBIS-DDSM dataset. Our framework leverages ViT’s ability to capture global image features and GNN’s strength in modeling structural relationships, achieving an accuracy of 84.2%, outperforming traditional methods. Additionally, interpretable attention heatmaps provide insights into the model’s decision-making process, aiding radiologists in clinical settings.

[80] Butter: Frequency Consistency and Hierarchical Fusion for Autonomous Driving Object Detection

Xiaojian Lin, Wenxin Zhang, Yuchu Jiang, Wangyu Wu, Yiran Guo, Kangxu Wang, Zongzheng Zhang, Guijin Wang, Lei Jin, Hao Zhao

Main category: cs.CV

TL;DR: Butter is a novel object detection framework for autonomous driving, enhancing hierarchical feature representation with FAFCE and PHFFNet, improving accuracy and efficiency.

DetailsMotivation: Existing architectures like YOLO and DETR struggle with feature consistency across scales and balancing precision with computational efficiency in dynamic environments.

Method: Butter introduces FAFCE for multi-scale feature consistency and PHFFNet for progressive hierarchical feature fusion.

Result: Experiments on BDD100K, KITTI, and Cityscapes show improved detection accuracy and reduced complexity.

Conclusion: Butter balances accuracy, deployability, and efficiency, advancing real-time object detection for autonomous driving.

Abstract: Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at https://github.com/Aveiro-Lin/Butter, facilitating further research and validation within the autonomous driving community.

[81] Smart Routing for Multimodal Video Retrieval: When to Search What

Kevin Dela Rosa

Main category: cs.CV

TL;DR: ModaRoute is an LLM-based routing system for multimodal video retrieval, reducing computational costs by 41% while maintaining competitive performance.

DetailsMotivation: Existing methods like dense text captions are expensive and miss critical visual information, necessitating a smarter routing solution.

Method: ModaRoute uses GPT-4.1 to analyze query intent and dynamically select optimal modalities (ASR, OCR, visual indices), averaging 1.78 modalities per query.

Result: Achieves 60.9% Recall@5, reduces computational overhead by 41%, and scales effectively on 1.8M video clips.

Conclusion: Intelligent routing with ModaRoute offers a practical, cost-effective solution for scalable multimodal retrieval systems.

Abstract: We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.

[82] A Comprehensive Survey for Real-World Industrial Defect Detection: Challenges, Approaches, and Prospects

Yuqi Cheng, Yunkang Cao, Haiming Yao, Wei Luo, Cheng Jiang, Hui Zhang, Weiming Shen

Main category: cs.CV

TL;DR: A survey on industrial defect detection, comparing closed-set and open-set methods in 2D/3D modalities, highlighting open-set advancements and challenges.

DetailsMotivation: Conventional inspection methods fall short in meeting modern manufacturing demands, prompting the need for advanced, scalable defect detection techniques.

Method: In-depth analysis of closed-set and open-set defect detection strategies in 2D and 3D modalities, tracking recent developments.

Result: Open-set frameworks reduce reliance on extensive annotations and improve anomaly recognition, gaining prominence in the field.

Conclusion: The survey provides a comprehensive overview of industrial defect detection, emphasizing open-set methods and identifying key challenges and trends.

Abstract: Industrial defect detection is vital for upholding product quality across contemporary manufacturing systems. As the expectations for precision, automation, and scalability intensify, conventional inspection approaches are increasingly found wanting in addressing real-world demands. Notable progress in computer vision and deep learning has substantially bolstered defect detection capabilities across both 2D and 3D modalities. A significant development has been the pivot from closed-set to open-set defect detection frameworks, which diminishes the necessity for extensive defect annotations and facilitates the recognition of novel anomalies. Despite such strides, a cohesive and contemporary understanding of industrial defect detection remains elusive. Consequently, this survey delivers an in-depth analysis of both closed-set and open-set defect detection strategies within 2D and 3D modalities, charting their evolution in recent years and underscoring the rising prominence of open-set techniques. We distill critical challenges inherent in practical detection environments and illuminate emerging trends, thereby providing a current and comprehensive vista of this swiftly progressing field.

[83] Using Multiple Input Modalities Can Improve Data-Efficiency and O.O.D. Generalization for ML with Satellite Imagery

Arjun Rao, Esther Rolf

Main category: cs.CV

TL;DR: The paper explores the impact of integrating additional geospatial data layers with optical satellite imagery in supervised learning tasks, finding significant performance improvements, especially in data-limited and out-of-sample scenarios.

DetailsMotivation: To understand the value of combining non-optical geospatial data (e.g., elevation, temperature) with optical satellite imagery in machine learning models for geospatial tasks.

Method: Augmented benchmark datasets by appending additional geographic data layers to existing tasks (classification, regression, segmentation) and compared model performance with and without these inputs.

Result: Fusing additional geographic inputs with optical imagery improves model performance, particularly in data-limited and out-of-sample settings. Hard-coded fusion strategies outperformed learned ones.

Conclusion: Multi-modal inputs enhance data-efficiency and out-of-sample performance in satellite imagery-based machine learning, with hard-coded fusion being unexpectedly effective.

Abstract: A large variety of geospatial data layers is available around the world ranging from remotely-sensed raster data like satellite imagery, digital elevation models, predicted land cover maps, and human-annotated data, to data derived from environmental sensors such as air temperature or wind speed data. A large majority of machine learning models trained on satellite imagery (SatML), however, are designed primarily for optical input modalities such as multi-spectral satellite imagery. To better understand the value of using other input modalities alongside optical imagery in supervised learning settings, we generate augmented versions of SatML benchmark tasks by appending additional geographic data layers to datasets spanning classification, regression, and segmentation. Using these augmented datasets, we find that fusing additional geographic inputs with optical imagery can significantly improve SatML model performance. Benefits are largest in settings where labeled data are limited and in geographic out-of-sample settings, suggesting that multi-modal inputs may be especially valuable for data-efficiency and out-of-sample performance of SatML models. Surprisingly, we find that hard-coded fusion strategies outperform learned variants, with interesting implications for future work.

[84] From Binary to Semantic: Utilizing Large-Scale Binary Occupancy Data for 3D Semantic Occupancy Prediction

Chihiro Noguchi, Takaki Yamamoto

Main category: cs.CV

TL;DR: The paper proposes a framework leveraging binary occupancy data for 3D semantic occupancy prediction, improving accuracy and reducing annotation costs.

DetailsMotivation: High annotation costs for LiDAR-based semantic occupancy prediction and the availability of cheaper binary occupancy data motivate exploring its use.

Method: Decomposes prediction into binary and semantic occupancy modules, utilizing binary data for pre-training and auto-labeling.

Result: Outperforms existing methods in pre-training and auto-labeling tasks, enhancing 3D semantic occupancy prediction.

Conclusion: The framework effectively leverages binary occupancy data, improving performance and reducing reliance on costly annotations.

Abstract: Accurate perception of the surrounding environment is essential for safe autonomous driving. 3D occupancy prediction, which estimates detailed 3D structures of roads, buildings, and other objects, is particularly important for vision-centric autonomous driving systems that do not rely on LiDAR sensors. However, in 3D semantic occupancy prediction – where each voxel is assigned a semantic label – annotated LiDAR point clouds are required, making data acquisition costly. In contrast, large-scale binary occupancy data, which only indicate occupied or free space without semantic labels, can be collected at a lower cost. Despite their availability, the potential of leveraging such data remains unexplored. In this study, we investigate the utilization of large-scale binary occupancy data from two perspectives: (1) pre-training and (2) learning-based auto-labeling. We propose a novel binary occupancy-based framework that decomposes the prediction process into binary and semantic occupancy modules, enabling effective use of binary occupancy data. Our experimental results demonstrate that the proposed framework outperforms existing methods in both pre-training and auto-labeling tasks, highlighting its effectiveness in enhancing 3D semantic occupancy prediction. The code is available at https://github.com/ToyotaInfoTech/b2s-occupancy

[85] Minimalist Concept Erasure in Generative Models

Yang Zhang, Er Jin, Yanfei Dong, Yixuan Wu, Philip Torr, Ashkan Khakzar, Johannes Stegmaier, Kenji Kawaguchi

Main category: cs.CV

TL;DR: A minimalist approach for concept erasure in generative models is proposed, focusing on distributional distance of outputs, avoiding excessive modifications, and maintaining model utility.

DetailsMotivation: Address safety and copyright concerns in generative models by erasing unwanted concepts without compromising model performance.

Method: Formulate a novel minimalist concept erasure objective based on distributional distance, derive a tractable loss for end-to-end optimization, and incorporate neuron masking for robustness.

Result: Empirical evaluations show robust concept erasure without degrading model performance in state-of-the-art flow-matching models.

Conclusion: The method enables safer and more responsible generative models by effectively erasing unwanted concepts while preserving utility.

Abstract: Recent advances in generative models have demonstrated remarkable capabilities in producing high-quality images, but their reliance on large-scale unlabeled data has raised significant safety and copyright concerns. Efforts to address these issues by erasing unwanted concepts have shown promise. However, many existing erasure methods involve excessive modifications that compromise the overall utility of the model. In this work, we address these issues by formulating a novel minimalist concept erasure objective based \emph{only} on the distributional distance of final generation outputs. Building on our formulation, we derive a tractable loss for differentiable optimization that leverages backpropagation through all generation steps in an end-to-end manner. We also conduct extensive analysis to show theoretical connections with other models and methods. To improve the robustness of the erasure, we incorporate neuron masking as an alternative to model fine-tuning. Empirical evaluations on state-of-the-art flow-matching models demonstrate that our method robustly erases concepts without degrading overall model performance, paving the way for safer and more responsible generative models.

[86] Tackling fake images in cybersecurity – Interpretation of a StyleGAN and lifting its black-box

Julia Laubmann, Johannes Reschke

Main category: cs.CV

TL;DR: Analysis of StyleGAN’s generator reveals weight pruning reduces computation without major output loss, while latent vector manipulation allows precise facial feature control, raising ethical concerns.

DetailsMotivation: To understand StyleGAN's inner workings and its potential for misuse in generating realistic synthetic faces.

Method: Analyzed StyleGAN’s generator, explored key techniques like Equalized Learning Rate, trained a model in PyTorch, pruned weights, and examined latent vector manipulation.

Result: Pruning reduces computational needs; latent vector changes allow precise facial feature control, highlighting ethical risks.

Conclusion: StyleGAN’s capabilities pose ethical concerns due to potential misuse in creating fake identities.

Abstract: In today’s digital age, concerns about the dangers of AI-generated images are increasingly common. One powerful tool in this domain is StyleGAN (style-based generative adversarial networks), a generative adversarial network capable of producing highly realistic synthetic faces. To gain a deeper understanding of how such a model operates, this work focuses on analyzing the inner workings of StyleGAN’s generator component. Key architectural elements and techniques, such as the Equalized Learning Rate, are explored in detail to shed light on the model’s behavior. A StyleGAN model is trained using the PyTorch framework, enabling direct inspection of its learned weights. Through pruning, it is revealed that a significant number of these weights can be removed without drastically affecting the output, leading to reduced computational requirements. Moreover, the role of the latent vector – which heavily influences the appearance of the generated faces – is closely examined. Global alterations to this vector primarily affect aspects like color tones, while targeted changes to individual dimensions allow for precise manipulation of specific facial features. This ability to finetune visual traits is not only of academic interest but also highlights a serious ethical concern: the potential misuse of such technology. Malicious actors could exploit this capability to fabricate convincing fake identities, posing significant risks in the context of digital deception and cybercrime.

[87] InSyn: Modeling Complex Interactions for Pedestrian Trajectory Prediction

Kaiyuan Zhai, Juan Chen, Chao Wang, Zeyi Xu

Main category: cs.CV

TL;DR: Proposes InSyn, a Transformer-based model for pedestrian trajectory prediction, capturing diverse interaction patterns and improving accuracy in crowded scenarios. Introduces SSOS training to reduce initial-step errors.

DetailsMotivation: Existing methods overlook specific pedestrian interaction patterns, limiting prediction accuracy in crowded scenarios.

Method: InSyn (Interaction-Synchronization Network) uses Transformers to model diverse interactions and direction-sensitive behaviors. SSOS training strategy reduces initial-step divergence.

Result: Outperforms baselines on ETH and UCY datasets, especially in high-density scenarios. SSOS reduces initial-step error by ~6.58%.

Conclusion: InSyn and SSOS effectively improve pedestrian trajectory prediction, addressing interaction modeling and initial-step divergence.

Abstract: Accurate pedestrian trajectory prediction is crucial for intelligent applications, yet it remains highly challenging due to the complexity of interactions among pedestrians. Previous methods have primarily relied on relative positions to model pedestrian interactions; however, they tend to overlook specific interaction patterns such as paired walking or conflicting behaviors, limiting the prediction accuracy in crowded scenarios. To address this issue, we propose InSyn (Interaction-Synchronization Network), a novel Transformer-based model that explicitly captures diverse interaction patterns (e.g., walking in sync or conflicting) while effectively modeling direction-sensitive social behaviors. Additionally, we introduce a training strategy termed Seq-Start of Seq (SSOS), designed to alleviate the common issue of initial-step divergence in numerical time-series prediction. Experiments on the ETH and UCY datasets demonstrate that our model outperforms recent baselines significantly, especially in high-density scenarios. Furthermore, the SSOS strategy proves effective in improving sequential prediction performance, reducing the initial-step prediction error by approximately 6.58%.

[88] A Quantum-assisted Attention U-Net for Building Segmentation over Tunis using Sentinel-1 Data

Luigi Russo, Francesco Mauro, Babak Memar, Alessandro Sebastianelli, Silvia Liberata Ullo, Paolo Gamba

Main category: cs.CV

TL;DR: The paper explores Quanvolutional pre-processing with Attention U-Net for building segmentation in urban areas, using SAR imagery from Tunis. It shows comparable accuracy to standard methods with fewer parameters.

DetailsMotivation: Accurate building segmentation in dense urban areas is challenging due to large, high-resolution satellite images. The study aims to enhance segmentation using quantum-assisted methods.

Method: Quanvolutional pre-processing extracts informative features from SAR imagery, integrated with Attention U-Net for segmentation.

Result: The method matches standard Attention U-Net accuracy while reducing network parameters, improving computational efficiency.

Conclusion: Quantum-assisted Deep Learning shows promise for efficient, large-scale urban building segmentation.

Abstract: Building segmentation in urban areas is essential in fields such as urban planning, disaster response, and population mapping. Yet accurately segmenting buildings in dense urban regions presents challenges due to the large size and high resolution of satellite images. This study investigates the use of a Quanvolutional pre-processing to enhance the capability of the Attention U-Net model in the building segmentation. Specifically, this paper focuses on the urban landscape of Tunis, utilizing Sentinel-1 Synthetic Aperture Radar (SAR) imagery. In this work, Quanvolution was used to extract more informative feature maps that capture essential structural details in radar imagery, proving beneficial for accurate building segmentation. Preliminary results indicate that proposed methodology achieves comparable test accuracy to the standard Attention U-Net model while significantly reducing network parameters. This result aligns with findings from previous works, confirming that Quanvolution not only maintains model accuracy but also increases computational efficiency. These promising outcomes highlight the potential of quantum-assisted Deep Learning frameworks for large-scale building segmentation in urban environments.

[89] MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing

Shreya Kadambi, Risheek Garrepalli, Shubhankar Borse, Munawar Hyatt, Fatih Porikli

Main category: cs.CV

TL;DR: The paper introduces MADI, a framework enhancing diffusion models for structured, controllable generation and editing via Masking-Augmented gaussian Diffusion (MAgD) and inference-time scaling with Pause Tokens.

DetailsMotivation: To address challenges in grounded visual editing and compositional control in diffusion models, leveraging self-supervised learning and in-context generative modeling.

Method: Proposes MADI with MAgD (dual corruption process combining denoising and masked reconstruction) and Pause Tokens for inference-time capacity scaling.

Result: MADI improves editability, compositionality, and controllability of diffusion models, enabling localized and structure-aware editing.

Conclusion: MADI advances diffusion models for general-purpose, in-context generative architectures.

Abstract: Despite the remarkable success of diffusion models in text-to-image generation, their effectiveness in grounded visual editing and compositional control remains challenging. Motivated by advances in self-supervised learning and in-context generative modeling, we propose a series of simple yet powerful design choices that significantly enhance diffusion model capacity for structured, controllable generation and editing. We introduce Masking-Augmented Diffusion with Inference-Time Scaling (MADI), a framework that improves the editability, compositionality and controllability of diffusion models through two core innovations. First, we introduce Masking-Augmented gaussian Diffusion (MAgD), a novel training strategy with dual corruption process which combines standard denoising score matching and masked reconstruction by masking noisy input from forward process. MAgD encourages the model to learn discriminative and compositional visual representations, thus enabling localized and structure-aware editing. Second, we introduce an inference-time capacity scaling mechanism based on Pause Tokens, which act as special placeholders inserted into the prompt for increasing computational capacity at inference time. Our findings show that adopting expressive and dense prompts during training further enhances performance, particularly for MAgD. Together, these contributions in MADI substantially enhance the editability of diffusion models, paving the way toward their integration into more general-purpose, in-context generative diffusion architectures.

[90] UL-DD: A Multimodal Drowsiness Dataset Using Video, Biometric Signals, and Behavioral Data

Morteza Bodaghi, Majid Hosseini, Raju Gottumukkala, Ravi Teja Bhupatiraju, Iftikhar Ahmad, Moncef Gabbouj

Main category: cs.CV

TL;DR: A multimodal dataset for driver drowsiness detection was created, combining facial, behavioral, and biometric signals, with 40-minute continuous sessions per subject.

DetailsMotivation: To provide a comprehensive dataset capturing gradual changes in driver drowsiness, unlike existing datasets with discrete labels.

Method: Data collection included 3D facial video, IR footage, biometric signals, grip sensor data, and telemetry from 19 subjects in alert and drowsy states, using the KSS for self-reported drowsiness.

Result: A dataset of 1,400 minutes was compiled, featuring continuous monitoring of driver states.

Conclusion: The dataset offers a richer resource for drowsiness detection research by integrating diverse signals and continuous state changes.

Abstract: In this study, we present a comprehensive public dataset for driver drowsiness detection, integrating multimodal signals of facial, behavioral, and biometric indicators. Our dataset includes 3D facial video using a depth camera, IR camera footage, posterior videos, and biometric signals such as heart rate, electrodermal activity, blood oxygen saturation, skin temperature, and accelerometer data. This data set provides grip sensor data from the steering wheel and telemetry data from the American truck simulator game to provide more information about drivers’ behavior while they are alert and drowsy. Drowsiness levels were self-reported every four minutes using the Karolinska Sleepiness Scale (KSS). The simulation environment consists of three monitor setups, and the driving condition is completely like a car. Data were collected from 19 subjects (15 M, 4 F) in two conditions: when they were fully alert and when they exhibited signs of sleepiness. Unlike other datasets, our multimodal dataset has a continuous duration of 40 minutes for each data collection session per subject, contributing to a total length of 1,400 minutes, and we recorded gradual changes in the driver state rather than discrete alert/drowsy labels. This study aims to create a comprehensive multimodal dataset of driver drowsiness that captures a wider range of physiological, behavioral, and driving-related signals. The dataset will be available upon request to the corresponding author.

[91] AortaDiff: Volume-Guided Conditional Diffusion Models for Multi-Branch Aortic Surface Generation

Delin An, Pan Du, Jian-Xun Wang, Chaoli Wang

Main category: cs.CV

TL;DR: AortaDiff is a diffusion-based framework for generating smooth, CFD-compatible 3D aortic surfaces from CT/MRI volumes, reducing reliance on large labeled datasets and manual intervention.

DetailsMotivation: Accurate 3D aortic construction is essential for clinical diagnosis and CFD simulations, but existing methods require extensive manual work and large datasets.

Method: AortaDiff uses a volume-guided conditional diffusion model to generate aortic centerlines, extracts vessel contours, and fits them into smooth 3D surfaces.

Result: AortaDiff produces high-fidelity meshes suitable for CFD, even with limited data, and handles both normal and pathological cases.

Conclusion: AortaDiff is a practical, end-to-end solution for cardiovascular research, offering high-quality visualizations and CFD compatibility.

Abstract: Accurate 3D aortic construction is crucial for clinical diagnosis, preoperative planning, and computational fluid dynamics (CFD) simulations, as it enables the estimation of critical hemodynamic parameters such as blood flow velocity, pressure distribution, and wall shear stress. Existing construction methods often rely on large annotated training datasets and extensive manual intervention. While the resulting meshes can serve for visualization purposes, they struggle to produce geometrically consistent, well-constructed surfaces suitable for downstream CFD analysis. To address these challenges, we introduce AortaDiff, a diffusion-based framework that generates smooth aortic surfaces directly from CT/MRI volumes. AortaDiff first employs a volume-guided conditional diffusion model (CDM) to iteratively generate aortic centerlines conditioned on volumetric medical images. Each centerline point is then automatically used as a prompt to extract the corresponding vessel contour, ensuring accurate boundary delineation. Finally, the extracted contours are fitted into a smooth 3D surface, yielding a continuous, CFD-compatible mesh representation. AortaDiff offers distinct advantages over existing methods, including an end-to-end workflow, minimal dependency on large labeled datasets, and the ability to generate CFD-compatible aorta meshes with high geometric fidelity. Experimental results demonstrate that AortaDiff performs effectively even with limited training data, successfully constructing both normal and pathologically altered aorta meshes, including cases with aneurysms or coarctation. This capability enables the generation of high-quality visualizations and positions AortaDiff as a practical solution for cardiovascular research.

[92] COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O’Brien, Vasu Sharma

Main category: cs.CV

TL;DR: COREVQA is a new benchmark for evaluating VLMs on visual entailment tasks, revealing their limitations in reasoning over crowded scenes.

DetailsMotivation: Existing benchmarks lack evaluation of VLMs' visual entailment abilities, especially in crowded images.

Method: Proposed COREVQA, a benchmark with 5608 image and synthetic true/false statement pairs derived from CrowdHuman dataset.

Result: Top VLMs scored below 80% accuracy, with others ranging 39.98%-69.95%, showing significant limitations.

Conclusion: VLMs struggle with visual entailment in crowded scenes, highlighting a key area for improvement.

Abstract: Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model’s ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image-question pairs in crowded scenes.

[93] IConMark: Robust Interpretable Concept-Based Watermark For AI Images

Vinu Sankar Sadasivan, Mehrdad Saberi, Soheil Feizi

Main category: cs.CV

TL;DR: IConMark is a novel semantic watermarking method for AI-generated images, embedding interpretable concepts to combat adversarial attacks and misinformation.

DetailsMotivation: The rise of generative AI and synthetic media necessitates robust methods to distinguish AI-generated images from real ones, as traditional watermarking is vulnerable to attacks.

Method: IConMark embeds meaningful semantic attributes into images, making watermarks human-readable and resilient to adversarial manipulation. It also combines with existing techniques (StegaStamp, TrustMark) for enhanced robustness.

Result: IConMark and its variants (+TM, +SS) outperform baselines, achieving 10.8%, 14.5%, and 15.9% higher AUROC scores for watermark detection.

Conclusion: IConMark offers a robust, interpretable solution for watermarking AI-generated images, with potential for further enhancement through hybrid approaches.

Abstract: With the rapid rise of generative AI and synthetic media, distinguishing AI-generated images from real ones has become crucial in safeguarding against misinformation and ensuring digital authenticity. Traditional watermarking techniques have shown vulnerabilities to adversarial attacks, undermining their effectiveness in the presence of attackers. We propose IConMark, a novel in-generation robust semantic watermarking method that embeds interpretable concepts into AI-generated images, as a first step toward interpretable watermarking. Unlike traditional methods, which rely on adding noise or perturbations to AI-generated images, IConMark incorporates meaningful semantic attributes, making it interpretable to humans and hence, resilient to adversarial manipulation. This method is not only robust against various image augmentations but also human-readable, enabling manual verification of watermarks. We demonstrate a detailed evaluation of IConMark’s effectiveness, demonstrating its superiority in terms of detection accuracy and maintaining image quality. Moreover, IConMark can be combined with existing watermarking techniques to further enhance and complement its robustness. We introduce IConMark+SS and IConMark+TM, hybrid approaches combining IConMark with StegaStamp and TrustMark, respectively, to further bolster robustness against multiple types of image manipulations. Our base watermarking technique (IConMark) and its variants (+TM and +SS) achieve 10.8%, 14.5%, and 15.9% higher mean area under the receiver operating characteristic curve (AUROC) scores for watermark detection, respectively, compared to the best baseline on various datasets.

[94] A Deep Learning-Based Ensemble System for Automated Shoulder Fracture Detection in Clinical Radiographs

Hemanth Kumar M, Karthika M, Saianiruth M, Vasanthakumar Venugopal, Anandakumar D, Revathi Ezhumalai, Charulatha K, Kishore Kumar J, Dayana G, Kalyan Sivasailam, Bargava Subramanian

Main category: cs.CV

TL;DR: An AI-driven multi-model deep learning system achieves high accuracy (95.5%) in detecting shoulder fractures in X-rays, outperforming individual models and showing potential for clinical integration.

DetailsMotivation: To address the underdiagnosis of shoulder fractures in emergency settings by leveraging AI for scalable and early detection.

Method: Developed a multi-model system using 10,000 annotated X-rays, employing Faster R-CNN, EfficientDet, and RF-DETR architectures, enhanced with ensemble techniques like Soft-NMS, WBF, and NMW fusion.

Result: The NMW ensemble achieved 95.5% accuracy and an F1-score of 0.9610, excelling in recall and localization precision.

Conclusion: Ensemble-based AI is effective for reliable shoulder fracture detection, suitable for clinical workflows, though limited to binary detection for rapid screening.

Abstract: Background: Shoulder fractures are often underdiagnosed, especially in emergency and high-volume clinical settings. Studies report up to 10% of such fractures may be missed by radiologists. AI-driven tools offer a scalable way to assist early detection and reduce diagnostic delays. We address this gap through a dedicated AI system for shoulder radiographs. Methods: We developed a multi-model deep learning system using 10,000 annotated shoulder X-rays. Architectures include Faster R-CNN (ResNet50-FPN, ResNeXt), EfficientDet, and RF-DETR. To enhance detection, we applied bounding box and classification-level ensemble techniques such as Soft-NMS, WBF, and NMW fusion. Results: The NMW ensemble achieved 95.5% accuracy and an F1-score of 0.9610, outperforming individual models across all key metrics. It demonstrated strong recall and localization precision, confirming its effectiveness for clinical fracture detection in shoulder X-rays. Conclusion: The results show ensemble-based AI can reliably detect shoulder fractures in radiographs with high clinical relevance. The model’s accuracy and deployment readiness position it well for integration into real-time diagnostic workflows. The current model is limited to binary fracture detection, reflecting its design for rapid screening and triage support rather than detailed orthopedic classification.

[95] AI-ming backwards: Vanishing archaeological landscapes in Mesopotamia and automatic detection of sites on CORONA imagery

Alessandro Pistola, Valentina Orru’, Nicolo’ Marchetti, Marco Roccetti

Main category: cs.CV

TL;DR: Upgrading a deep learning model with CORONA satellite imagery improved archaeological site detection in transformed landscapes, achieving high accuracy and discovering new sites.

DetailsMotivation: To enhance AI models for identifying archaeological sites in environments altered or destroyed over decades, leveraging historical satellite imagery.

Method: Retrained a Bing-based convolutional network model using CORONA satellite imagery for Abu Ghraib, focusing on image segmentation and site detection.

Result: Achieved 85% IoU and 90% accuracy in detecting sites, identifying four new archaeological locations confirmed by fieldwork.

Conclusion: Combining AI with CORONA imagery is effective for discovering vanished archaeological sites, offering breakthroughs in landscape studies.

Abstract: By upgrading an existing deep learning model with the knowledge provided by one of the oldest sets of grayscale satellite imagery, known as CORONA, we improved the AI model attitude towards the automatic identification of archaeological sites in an environment which has been completely transformed in the last five decades, including the complete destruction of many of those same sites. The initial Bing based convolutional network model was retrained using CORONA satellite imagery for the district of Abu Ghraib, west of Baghdad, central Mesopotamian floodplain. The results were twofold and surprising. First, the detection precision obtained on the area of interest increased sensibly: in particular, the Intersection over Union (IoU) values, at the image segmentation level, surpassed 85 percent, while the general accuracy in detecting archeological sites reached 90 percent. Second, our retrained model allowed the identification of four new sites of archaeological interest (confirmed through field verification), previously not identified by archaeologists with traditional techniques. This has confirmed the efficacy of using AI techniques and the CORONA imagery from the 1960 to discover archaeological sites currently no longer visible, a concrete breakthrough with significant consequences for the study of landscapes with vanishing archaeological evidence induced by anthropization

[96] CaSTFormer: Causal Spatio-Temporal Transformer for Driving Intention Prediction

Sirui Wang, Zhou Guan, Bingxi Zhao, Tongjia Gu

Main category: cs.CV

TL;DR: CaSTFormer, a Causal Spatio-Temporal Transformer, improves driving intention prediction by modeling causal interactions between driver behavior and environment, achieving state-of-the-art results.

DetailsMotivation: Current methods fail to accurately model the complex spatio-temporal dependencies and variability in human driving behavior, limiting the safety and efficiency of human-machine co-driving systems.

Method: CaSTFormer uses Reciprocal Shift Fusion (RSF) for temporal alignment, Causal Pattern Extraction (CPE) to remove spurious correlations, and a Feature Synthesis Network (FSN) to synthesize purified representations.

Result: CaSTFormer outperforms existing methods on the Brain4Cars dataset, capturing complex dependencies and improving prediction accuracy and transparency.

Conclusion: CaSTFormer advances driving intention prediction by effectively modeling causal spatio-temporal relationships, enhancing system safety and interactive efficiency.

Abstract: Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatio-temporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaSTFormer, a Causal Spatio-Temporal Transformer to explicitly model causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaSTFormer introduces a novel Reciprocal Shift Fusion (RSF) mechanism for precise temporal alignment of internal and external feature streams, a Causal Pattern Extraction (CPE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent spatio-temporal inferences. We evaluate the proposed CaSTFormer on the public Brain4Cars dataset, and it achieves state-of-the-art performance. It effectively captures complex causal spatio-temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.

[97] “PhyWorldBench”: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

Main category: cs.CV

TL;DR: PhyWorldBench evaluates video generation models’ adherence to physics, testing 12 models with 1,050 prompts across fundamental, complex, and anti-physics scenarios.

DetailsMotivation: Assessing and improving video generation models' ability to simulate physical phenomena accurately.

Method: Introduces PhyWorldBench with human and MLLM-based evaluation, testing models on diverse physics scenarios.

Result: Identifies challenges in models’ physics adherence and provides prompt-crafting recommendations.

Conclusion: Highlights gaps in physics simulation and suggests improvements for future models.

Abstract: Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel ““Anti-Physics”” category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts-spanning fundamental, composite, and anti-physics scenarios-we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.

[98] Uncertainty Quantification Framework for Aerial and UAV Photogrammetry through Error Propagation

Debao Huang, Rongjun Qin

Main category: cs.CV

TL;DR: The paper introduces a framework for quantifying uncertainty in photogrammetric point clouds, addressing gaps in Multi-view Stereo (MVS) uncertainty estimation by using a self-calibrating method with reliable n-view points.

DetailsMotivation: Photogrammetric point clouds lack standardized uncertainty quantification, especially in the MVS stage, due to its non-differentiable and multi-modal nature.

Method: Proposes a self-calibrating method using reliable n-view points (n>=6) to regress disparity uncertainty with cues from MVS, adhering to error propagation paths.

Result: Outperforms existing approaches by achieving high bounding rates without overestimating uncertainty, validated on airborne and UAV datasets.

Conclusion: The framework provides robust and certifiable uncertainty quantification for photogrammetric point clouds, closing a critical gap in the field.

Abstract: Uncertainty quantification of the photogrammetry process is essential for providing per-point accuracy credentials of the point clouds. Unlike airborne LiDAR, which typically delivers consistent accuracy across various scenes, the accuracy of photogrammetric point clouds is highly scene-dependent, since it relies on algorithm-generated measurements (i.e., stereo or multi-view stereo). Generally, errors of the photogrammetric point clouds propagate through a two-step process: Structure-from-Motion (SfM) with Bundle adjustment (BA), followed by Multi-view Stereo (MVS). While uncertainty estimation in the SfM stage has been well studied using the first-order statistics of the reprojection error function, that in the MVS stage remains largely unsolved and non-standardized, primarily due to its non-differentiable and multi-modal nature (i.e., from pixel values to geometry). In this paper, we present an uncertainty quantification framework closing this gap by associating an error covariance matrix per point accounting for this two-step photogrammetry process. Specifically, to estimate the uncertainty in the MVS stage, we propose a novel, self-calibrating method by taking reliable n-view points (n>=6) per-view to regress the disparity uncertainty using highly relevant cues (such as matching cost values) from the MVS stage. Compared to existing approaches, our method uses self-contained, reliable 3D points extracted directly from the MVS process, with the benefit of being self-supervised and naturally adhering to error propagation path of the photogrammetry process, thereby providing a robust and certifiable uncertainty quantification across diverse scenes. We evaluate the framework using a variety of publicly available airborne and UAV imagery datasets. Results demonstrate that our method outperforms existing approaches by achieving high bounding rates without overestimating uncertainty.

[99] Sugar-Beet Stress Detection using Satellite Image Time Series

Bhumika Laxman Sadbhave, Philipp Vaeth, Denise Dejon, Gunther Schorcht, Magda Gregorová

Main category: cs.CV

TL;DR: Unsupervised 3D convolutional autoencoder for stress detection in sugar-beet fields using Sentinel-2 SITS data.

DetailsMotivation: Leverage SITS data for agricultural stress detection without labeled data, focusing on sugar-beet fields.

Method: 3D convolutional autoencoder with temporal encodings to extract features from Sentinel-2 sequences, followed by clustering.

Result: Effective unsupervised system for detecting stressed sugar-beet fields, applicable across different years.

Conclusion: Proposed method provides a practical, label-free tool for agricultural stress monitoring.

Abstract: Satellite Image Time Series (SITS) data has proven effective for agricultural tasks due to its rich spectral and temporal nature. In this study, we tackle the task of stress detection in sugar-beet fields using a fully unsupervised approach. We propose a 3D convolutional autoencoder model to extract meaningful features from Sentinel-2 image sequences, combined with acquisition-date-specific temporal encodings to better capture the growth dynamics of sugar-beets. The learned representations are used in a downstream clustering task to separate stressed from healthy fields. The resulting stress detection system can be directly applied to data from different years, offering a practical and accessible tool for stress detection in sugar-beets.

[100] SparseC-AFM: a deep learning method for fast and accurate characterization of MoS$_2$ with C-AFM

Levi Harris, Md Jayed Hossain, Mufan Qiu, Ruichen Zhang, Pingchuan Ma, Tianlong Chen, Jiaqi Gu, Seth Ariel Tongay, Umberto Celano

Main category: cs.CV

TL;DR: SparseC-AFM, a deep learning model, accelerates conductivity mapping of 2D materials like MoS$_2$ from sparse C-AFM scans, reducing acquisition time by 11x while maintaining accuracy.

DetailsMotivation: The need for faster and robust electrical characterization of 2D materials in nanoelectronics, overcoming the slow data acquisition of traditional C-AFM.

Method: Introduces SparseC-AFM, a deep learning model that reconstructs conductivity maps from sparse C-AFM scans, validated across various conditions.

Result: Achieves 11x faster acquisition time, accurately extracts material parameters, and matches electrical properties of full-resolution C-AFM data.

Conclusion: SparseC-AFM bridges AI-assisted 2D material characterization from lab research to industrial applications, with open-source code available.

Abstract: The increasing use of two-dimensional (2D) materials in nanoelectronics demands robust metrology techniques for electrical characterization, especially for large-scale production. While atomic force microscopy (AFM) techniques like conductive AFM (C-AFM) offer high accuracy, they suffer from slow data acquisition speeds due to the raster scanning process. To address this, we introduce SparseC-AFM, a deep learning model that rapidly and accurately reconstructs conductivity maps of 2D materials like MoS$_2$ from sparse C-AFM scans. Our approach is robust across various scanning modes, substrates, and experimental conditions. We report a comparison between (a) classic flow implementation, where a high pixel density C-AFM image (e.g., 15 minutes to collect) is manually parsed to extract relevant material parameters, and (b) our SparseC-AFM method, which achieves the same operation using data that requires substantially less acquisition time (e.g., under 5 minutes). SparseC-AFM enables efficient extraction of critical material parameters in MoS$_2$, including film coverage, defect density, and identification of crystalline island boundaries, edges, and cracks. We achieve over 11x reduction in acquisition time compared to manual extraction from a full-resolution C-AFM image. Moreover, we demonstrate that our model-predicted samples exhibit remarkably similar electrical properties to full-resolution data gathered using classic-flow scanning. This work represents a significant step toward translating AI-assisted 2D material characterization from laboratory research to industrial fabrication. Code and model weights are available at github.com/UNITES-Lab/sparse-cafm.

[101] Total Generalized Variation of the Normal Vector Field and Applications to Mesh Denoising

Lukas Baumgärtner, Ronny Bergmann, Roland Herzog, Stephan Schmidt, Manuel Weiß

Main category: cs.CV

TL;DR: A novel formulation for second-order total generalized variation (TGV) of normal vectors on triangular meshes in 3D space, using a tailored tangential Raviart-Thomas finite element space.

DetailsMotivation: To extend discrete TGV models to manifold-valued functions (normal vectors on the unit sphere) for improved mesh denoising.

Method: Construct a tangential Raviart-Thomas type finite element space for the manifold setting and apply it to the TGV formulation.

Result: The new regularizer is evaluated in mesh denoising experiments and compared to existing methods.

Conclusion: The proposed method effectively extends TGV to manifold-valued functions, offering potential improvements in mesh denoising.

Abstract: We propose a novel formulation for the second-order total generalized variation (TGV) of the normal vector on an oriented, triangular mesh embedded in $\mathbb{R}^3$. The normal vector is considered as a manifold-valued function, taking values on the unit sphere. Our formulation extends previous discrete TGV models for piecewise constant scalar data that utilize a Raviart-Thomas function space. To exctend this formulation to the manifold setting, a tailor-made tangential Raviart-Thomas type finite element space is constructed in this work. The new regularizer is compared to existing methods in mesh denoising experiments.

[102] $\nabla$NABLA: Neighborhood Adaptive Block-Level Attention

Dmitrii Mikhailov, Aleksey Letunovskiy, Maria Kovaleva, Vladimir Arkhipkin, Vladimir Korviakov, Vladimir Polovnikov, Viacheslav Vasilev, Evelina Sidorova, Denis Dimitrov

Main category: cs.CV

TL;DR: NABLA introduces a Neighborhood Adaptive Block-Level Attention mechanism for video diffusion transformers, reducing computational overhead while maintaining generative quality.

DetailsMotivation: Quadratic complexity of full attention in transformers is a bottleneck for high-resolution and long-duration video generation.

Method: Proposes NABLA, a block-wise attention mechanism with adaptive sparsity-driven thresholds, compatible with PyTorch’s Flex Attention.

Result: Achieves up to 2.7x faster training/inference with minimal quality loss in metrics like CLIP score and human evaluation.

Conclusion: NABLA efficiently addresses computational challenges in video generation without sacrificing quality, offering practical integration.

Abstract: Recent progress in transformer-based architectures has demonstrated remarkable success in video generation tasks. However, the quadratic complexity of full attention mechanisms remains a critical bottleneck, particularly for high-resolution and long-duration video sequences. In this paper, we propose NABLA, a novel Neighborhood Adaptive Block-Level Attention mechanism that dynamically adapts to sparsity patterns in video diffusion transformers (DiTs). By leveraging block-wise attention with adaptive sparsity-driven threshold, NABLA reduces computational overhead while preserving generative quality. Our method does not require custom low-level operator design and can be seamlessly integrated with PyTorch’s Flex Attention operator. Experiments demonstrate that NABLA achieves up to 2.7x faster training and inference compared to baseline almost without compromising quantitative metrics (CLIP score, VBench score, human evaluation score) and visual quality drop. The code and model weights are available here: https://github.com/gen-ai-team/Wan2.1-NABLA

[103] LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning

Kaihong Wang, Donghyun Kim, Margrit Betke

Main category: cs.CV

TL;DR: A LoRA-enhanced synthetic-replay framework improves continual learning in vision-language models by adapting Stable Diffusion with task-specific low-rank adapters and confidence-based sample selection.

DetailsMotivation: Existing synthetic-replay methods produce misaligned samples due to domain-specific nuances, undermining knowledge retention.

Method: Proposes a LoRA-enhanced framework with two-stage confidence-based sample selection for task-specific adaptation and synthetic sample distillation.

Result: Outperforms previous synthetic-replay techniques on the MTIL benchmark, balancing plasticity, stability, and zero-shot capability.

Conclusion: Generator adaptation via LoRA effectively enhances continual learning robustness in VLMs.

Abstract: Continual learning for vision-language models has achieved remarkable performance through synthetic replay, where samples are generated using Stable Diffusion to regularize during finetuning and retain knowledge. However, real-world downstream applications often exhibit domain-specific nuances and fine-grained semantics not captured by generators, causing synthetic-replay methods to produce misaligned samples that misguide finetuning and undermine retention of prior knowledge. In this work, we propose a LoRA-enhanced synthetic-replay framework that injects task-specific low-rank adapters into a frozen Stable Diffusion model, efficiently capturing each new task’s unique visual and semantic patterns. Specifically, we introduce a two-stage, confidence-based sample selection: we first rank real task data by post-finetuning VLM confidence to focus LoRA finetuning on the most representative examples, then generate synthetic samples and again select them by confidence for distillation. Our approach integrates seamlessly with existing replay pipelines-simply swap in the adapted generator to boost replay fidelity. Extensive experiments on the Multi-domain Task Incremental Learning (MTIL) benchmark show that our method outperforms previous synthetic-replay techniques, achieving an optimal balance among plasticity, stability, and zero-shot capability. These results demonstrate the effectiveness of generator adaptation via LoRA for robust continual learning in VLMs.

[104] NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision

Tengkai Wang, Weihao Li, Ruikai Cui, Shi Qiu, Nick Barnes

Main category: cs.CV

TL;DR: NoiseSDF2NoiseSDF extends the Noise2Noise paradigm to 3D neural fields, enabling clean neural SDF learning from noisy point clouds by minimizing MSE loss between noisy SDFs.

DetailsMotivation: Low-quality scanning devices produce noisy point clouds, leading to inaccurate surface reconstructions.

Method: NoiseSDF2NoiseSDF uses noisy supervision to learn clean neural SDFs by minimizing MSE loss between noisy SDF representations.

Result: The method significantly improves surface reconstruction quality on benchmarks like ShapeNet, ABC, Famous, and Real datasets.

Conclusion: NoiseSDF2NoiseSDF effectively denoises and refines surface estimations from noisy point clouds.

Abstract: Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs directly from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs.

[105] Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model

Chengxu Liu, Lu Qi, Jinshan Pan, Xueming Qian, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: A novel diffusion model (DM)-based framework, dubbed \ours, is proposed for unsupervised image deblurring using unpaired data, outperforming state-of-the-art methods.

DetailsMotivation: Acquiring paired blurry-sharp images is costly and impractical, prompting the need for learning deblurring from unpaired data. Existing adversarial methods fail to address real-world blur complexity.

Method: The framework uses a Texture Prior Encoder (TPE) with a memory mechanism for DM training and a Texture Transfer Transformer (TTformer) with Filter-Modulated Multi-head Self-Attention (FM-MSA) for adaptive blur removal. A wavelet-based adversarial loss preserves high-frequency details.

Result: Extensive evaluations show the framework outperforms SOTA methods in benchmarks.

Conclusion: The proposed DM-based framework offers a promising unsupervised solution for image deblurring, effectively handling real-world blur patterns.

Abstract: Since acquiring large amounts of realistic blurry-sharp image pairs is difficult and expensive, learning blind image deblurring from unpaired data is a more practical and promising solution. Unfortunately, dominant approaches rely heavily on adversarial learning to bridge the gap from blurry domains to sharp domains, ignoring the complex and unpredictable nature of real-world blur patterns. In this paper, we propose a novel diffusion model (DM)-based framework, dubbed \ours, for image deblurring by learning spatially varying texture prior from unpaired data. In particular, \ours performs DM to generate the prior knowledge that aids in recovering the textures of blurry images. To implement this, we propose a Texture Prior Encoder (TPE) that introduces a memory mechanism to represent the image textures and provides supervision for DM training. To fully exploit the generated texture priors, we present the Texture Transfer Transformer layer (TTformer), in which a novel Filter-Modulated Multi-head Self-Attention (FM-MSA) efficiently removes spatially varying blurring through adaptive filtering. Furthermore, we implement a wavelet-based adversarial loss to preserve high-frequency texture details. Extensive evaluations show that \ours provides a promising unsupervised deblurring solution and outperforms SOTA methods in widely-used benchmarks.

[106] Efficient Burst Super-Resolution with One-step Diffusion

Kento Kawai, Takeru Oba, Kyotaro Tokoro, Kazutoshi Akita, Norimichi Ukita

Main category: cs.CV

TL;DR: The paper proposes a diffusion model for burst SR to produce sharp, high-fidelity images, improving efficiency with a stochastic sampler and one-step diffusion, reducing runtime to 1.6% of baseline.

DetailsMotivation: Prior burst SR methods produce blurry images; the goal is to achieve sharp, high-fidelity SR images using a diffusion model.

Method: Uses a diffusion model with a stochastic sampler (high-order ODE) and one-step diffusion via knowledge distillation.

Result: Reduces runtime to 1.6% of baseline while maintaining SR quality in distortion and perceptual metrics.

Conclusion: The method efficiently enhances burst SR images, balancing speed and quality.

Abstract: While burst Low-Resolution (LR) images are useful for improving their Super Resolution (SR) image compared to a single LR image, prior burst SR methods are trained in a deterministic manner, which produces a blurry SR image. Since such blurry images are perceptually degraded, we aim to reconstruct sharp and high-fidelity SR images by a diffusion model. Our method improves the efficiency of the diffusion model with a stochastic sampler with a high-order ODE as well as one-step diffusion using knowledge distillation. Our experimental results demonstrate that our method can reduce the runtime to 1.6 % of its baseline while maintaining the SR quality measured based on image distortion and perceptual quality.

[107] CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks

Yanan Wang, Julio Vizcarra, Zhi Li, Hao Niu, Mori Kurokawa

Main category: cs.CV

TL;DR: CoTasks introduces a framework for enhancing VideoLLMs with chain-of-thought reasoning by decomposing video questions into entity-level tasks, improving performance on benchmarks like NeXT-QA.

DetailsMotivation: Existing VideoLLMs lack fine-grained object-level reasoning due to high-level video-text training data. CoTasks addresses this gap by enabling structured, step-by-step reasoning.

Method: CoTasks decomposes video questions into four foundational tasks (frame localization, entity tracking, spatial/temporal relation extraction) and embeds intermediate reasoning steps into inputs.

Result: CoTasks boosts performance: LLaVA-video-7B improves by +3.3 points, Qwen2.5-VL-3B by +17.4, with significant gains in causal, temporal, and descriptive reasoning.

Conclusion: CoTasks effectively enhances compositional video reasoning through structured chain-of-thought supervision.

Abstract: Despite recent progress in video large language models (VideoLLMs), a key open challenge remains: how to equip models with chain-of-thought (CoT) reasoning abilities grounded in fine-grained object-level video understanding. Existing instruction-tuned models, such as the Qwen and LLaVA series, are trained on high-level video-text pairs, often lacking structured annotations necessary for compositional, step-by-step reasoning. We propose CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks, a new framework that decomposes complex video questions of existing datasets (e.g., NeXT-QA, STAR) into four entity-level foundational tasks: frame localization, entity tracking, spatial and temporal relation extraction. By embedding these intermediate CoT-style reasoning steps into the input, CoTasks enables models to explicitly perform object-centric spatiotemporal reasoning. Experiments on the NeXT-QA benchmark show that CoTasks significantly enhance inference performance: LLaVA-video-7B improves by +3.3 points in average GPT-4 evaluation score, and Qwen2.5-VL-3B gains +17.4, with large boosts in causal (+14.6), temporal (+10.9), and descriptive (+48.1) subcategories. These results demonstrate the effectiveness of CoTasks as a structured CoT-style supervision framework for improving compositional video reasoning.

[108] Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation

Masahiro Ogawa, Qi An, Atsushi Yamashita

Main category: cs.CV

TL;DR: FoELS integrates optical flow and texture to separate moving and static objects in complex scenes with camera motion, outperforming existing methods.

DetailsMotivation: Existing optical flow-based methods struggle in complex, structured scenes with camera motion, necessitating a more robust solution.

Method: FoELS combines focus of expansion (FoE) computation from optical flow with texture-based segmentation to estimate moving object probability.

Result: FoELS achieves state-of-the-art performance on DAVIS 2016 and real-world traffic videos.

Conclusion: FoELS effectively addresses challenges like complex scenes and camera motion, proving superior to traditional optical flow methods.

Abstract: Separating moving and static objects from a moving camera viewpoint is essential for 3D reconstruction, autonomous navigation, and scene understanding in robotics. Existing approaches often rely primarily on optical flow, which struggles to detect moving objects in complex, structured scenes involving camera motion. To address this limitation, we propose Focus of Expansion Likelihood and Segmentation (FoELS), a method based on the core idea of integrating both optical flow and texture information. FoELS computes the focus of expansion (FoE) from optical flow and derives an initial motion likelihood from the outliers of the FoE computation. This likelihood is then fused with a segmentation-based prior to estimate the final moving probability. The method effectively handles challenges including complex structured scenes, rotational camera motion, and parallel motion. Comprehensive evaluations on the DAVIS 2016 dataset and real-world traffic videos demonstrate its effectiveness and state-of-the-art performance.

[109] EPSilon: Efficient Point Sampling for Lightening of Hybrid-based 3D Avatar Generation

Seungjun Moon, Sangjoon Yu, Gyeong-Moon Park

Main category: cs.CV

TL;DR: EPSilon introduces efficient point sampling strategies (ERO and EIO) to reduce computational costs in hybrid 3D avatar generation, achieving faster inference and training while maintaining quality.

DetailsMotivation: Hybrid models combining NeRF and SMPL-based mesh suffer from slow inference due to unnecessary deformation computations on empty points.

Method: Proposes empty ray omission (ERO) and empty interval omission (EIO) to skip empty spaces during rendering, reducing sampled points and computational load.

Result: EPSilon uses only 3.9% of sampled points, achieves 20x faster inference, and 4x faster training convergence while preserving quality.

Conclusion: EPSilon’s efficient sampling strategies significantly improve performance in hybrid 3D avatar generation without sacrificing realism.

Abstract: The rapid advancement of neural radiance fields (NeRF) has paved the way to generate animatable human avatars from a monocular video. However, the sole usage of NeRF suffers from a lack of details, which results in the emergence of hybrid representation that utilizes SMPL-based mesh together with NeRF representation. While hybrid-based models show photo-realistic human avatar generation qualities, they suffer from extremely slow inference due to their deformation scheme: to be aligned with the mesh, hybrid-based models use the deformation based on SMPL skinning weights, which needs high computational costs on each sampled point. We observe that since most of the sampled points are located in empty space, they do not affect the generation quality but result in inference latency with deformation. In light of this observation, we propose EPSilon, a hybrid-based 3D avatar generation scheme with novel efficient point sampling strategies that boost both training and inference. In EPSilon, we propose two methods to omit empty points at rendering; empty ray omission (ERO) and empty interval omission (EIO). In ERO, we wipe out rays that progress through the empty space. Then, EIO narrows down the sampling interval on the ray, which wipes out the region not occupied by either clothes or mesh. The delicate sampling scheme of EPSilon enables not only great computational cost reduction during deformation but also the designation of the important regions to be sampled, which enables a single-stage NeRF structure without hierarchical sampling. Compared to existing methods, EPSilon maintains the generation quality while using only 3.9% of sampled points and achieves around 20 times faster inference, together with 4 times faster training convergence. We provide video results on https://github.com/seungjun-moon/epsilon.

[110] When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework

Xiao Wang, Qian Zhu, Shujuan Wu, Bo Jiang, Shiliang Zhang, Yaowei Wang, Yonghong Tian, Bin Luo

Main category: cs.CV

TL;DR: The paper introduces EvReID, a large-scale RGB-event person ReID dataset, and proposes TriPro-ReID, a contrastive learning framework leveraging pedestrian attributes for better feature learning.

DetailsMotivation: Addressing data scarcity and lack of generalization in event camera-based person ReID by providing a large-scale dataset and improving feature learning.

Method: Creation of the EvReID dataset with 118,988 image pairs and 1200 identities. Proposal of TriPro-ReID, a pedestrian attribute-guided contrastive learning framework.

Result: EvReID dataset and TriPro-ReID framework validated on EvReID and MARS datasets, showing effectiveness.

Conclusion: The work provides a benchmark dataset and a robust framework for future RGB-event person ReID research.

Abstract: Recent researchers have proposed using event cameras for person re-identification (ReID) due to their promising performance and better balance in terms of privacy protection, event camera-based person ReID has attracted significant attention. Currently, mainstream event-based person ReID algorithms primarily focus on fusing visible light and event stream, as well as preserving privacy. Although significant progress has been made, these methods are typically trained and evaluated on small-scale or simulated event camera datasets, making it difficult to assess their real identification performance and generalization ability. To address the issue of data scarcity, this paper introduces a large-scale RGB-event based person ReID dataset, called EvReID. The dataset contains 118,988 image pairs and covers 1200 pedestrian identities, with data collected across multiple seasons, scenes, and lighting conditions. We also evaluate 15 state-of-the-art person ReID algorithms, laying a solid foundation for future research in terms of both data and benchmarking. Based on our newly constructed dataset, this paper further proposes a pedestrian attribute-guided contrastive learning framework to enhance feature learning for person re-identification, termed TriPro-ReID. This framework not only effectively explores the visual features from both RGB frames and event streams, but also fully utilizes pedestrian attributes as mid-level semantic features. Extensive experiments on the EvReID dataset and MARS datasets fully validated the effectiveness of our proposed RGB-Event person ReID framework. The benchmark dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID

[111] Global Modeling Matters: A Fast, Lightweight and Effective Baseline for Efficient Image Restoration

Xingyu Jiang, Ning Gao, Hongkun Dou, Xiuhui Zhang, Xiaoqing Zhong, Yue Deng, Hongjue Li

Main category: cs.CV

TL;DR: The paper proposes PW-FNet, a novel image restoration method using pyramid Wavelet-Fourier processing, outperforming state-of-the-art methods in quality and efficiency.

DetailsMotivation: Adverse weather degrades image quality, hindering downstream tasks. Existing transformer-based methods are complex and inefficient for real-time use.

Method: PW-FNet integrates pyramid wavelet-based multi-scale decomposition and Fourier transforms to replace self-attention, reducing complexity.

Result: PW-FNet excels in tasks like deraining, super-resolution, and dehazing, offering better quality and efficiency than current methods.

Conclusion: PW-FNet is a highly efficient and effective baseline for image restoration, balancing performance and computational cost.

Abstract: Natural image quality is often degraded by adverse weather conditions, significantly impairing the performance of downstream tasks. Image restoration has emerged as a core solution to this challenge and has been widely discussed in the literature. Although recent transformer-based approaches have made remarkable progress in image restoration, their increasing system complexity poses significant challenges for real-time processing, particularly in real-world deployment scenarios. To this end, most existing methods attempt to simplify the self-attention mechanism, such as by channel self-attention or state space model. However, these methods primarily focus on network architecture while neglecting the inherent characteristics of image restoration itself. In this context, we explore a pyramid Wavelet-Fourier iterative pipeline to demonstrate the potential of Wavelet-Fourier processing for image restoration. Inspired by the above findings, we propose a novel and efficient restoration baseline, named Pyramid Wavelet-Fourier Network (PW-FNet). Specifically, PW-FNet features two key design principles: 1) at the inter-block level, integrates a pyramid wavelet-based multi-input multi-output structure to achieve multi-scale and multi-frequency bands decomposition; and 2) at the intra-block level, incorporates Fourier transforms as an efficient alternative to self-attention mechanisms, effectively reducing computational complexity while preserving global modeling capability. Extensive experiments on tasks such as image deraining, raindrop removal, image super-resolution, motion deblurring, image dehazing, image desnowing and underwater/low-light enhancement demonstrate that PW-FNet not only surpasses state-of-the-art methods in restoration quality but also achieves superior efficiency, with significantly reduced parameter size, computational cost and inference time.

[112] MaskHOI: Robust 3D Hand-Object Interaction Estimation via Masked Pre-training

Yuechen Xie, Haobo Jiang, Jian Yang, Yigong Zhang, Jin Xie

Main category: cs.CV

TL;DR: MaskHOI is a novel MAE-driven pretraining framework for 3D hand-object interaction pose estimation, addressing challenges like geometric ambiguity and occlusions through region-specific masking and SDF-driven learning.

DetailsMotivation: Precise joint pose estimation in 3D hand-object interactions is challenging due to RGB image ambiguity and mutual occlusions.

Method: Proposes MaskHOI with region-specific masking ratios and skeleton-driven hand masking, plus a masked SDF-driven multimodal learning mechanism.

Result: Outperforms state-of-the-art methods in experiments.

Conclusion: MaskHOI effectively enhances geometric-aware and occlusion-robust representation learning for HOI tasks.

Abstract: In 3D hand-object interaction (HOI) tasks, estimating precise joint poses of hands and objects from monocular RGB input remains highly challenging due to the inherent geometric ambiguity of RGB images and the severe mutual occlusions that occur during interaction.To address these challenges, we propose MaskHOI, a novel Masked Autoencoder (MAE)-driven pretraining framework for enhanced HOI pose estimation. Our core idea is to leverage the masking-then-reconstruction strategy of MAE to encourage the feature encoder to infer missing spatial and structural information, thereby facilitating geometric-aware and occlusion-robust representation learning. Specifically, based on our observation that human hands exhibit far greater geometric complexity than rigid objects, conventional uniform masking fails to effectively guide the reconstruction of fine-grained hand structures. To overcome this limitation, we introduce a Region-specific Mask Ratio Allocation, primarily comprising the region-specific masking assignment and the skeleton-driven hand masking guidance. The former adaptively assigns lower masking ratios to hand regions than to rigid objects, balancing their feature learning difficulty, while the latter prioritizes masking critical hand parts (e.g., fingertips or entire fingers) to realistically simulate occlusion patterns in real-world interactions. Furthermore, to enhance the geometric awareness of the pretrained encoder, we introduce a novel Masked Signed Distance Field (SDF)-driven multimodal learning mechanism. Through the self-masking 3D SDF prediction, the learned encoder is able to perceive the global geometric structure of hands and objects beyond the 2D image plane, overcoming the inherent limitations of monocular input and alleviating self-occlusion issues. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches.

[113] Gaussian kernel-based motion measurement

Hongyi Liu, Haifeng Wang

Main category: cs.CV

TL;DR: A novel Gaussian kernel-based method for high-precision, sub-pixel motion measurement in structural health monitoring, eliminating manual parameter tuning.

DetailsMotivation: Address the lack of accuracy and manual parameter dependency in current vision-based motion measurement techniques for structural health monitoring.

Method: Uses Gaussian kernel tracking for motion extraction, incorporating motion consistency and super-resolution constraints for enhanced accuracy and robustness.

Result: Achieves consistent high accuracy without customized parameter setups, validated numerically and experimentally.

Conclusion: The proposed method offers a reliable, automated solution for precise motion measurement in structural monitoring.

Abstract: The growing demand for structural health monitoring has driven increasing interest in high-precision motion measurement, as structural information derived from extracted motions can effectively reflect the current condition of the structure. Among various motion measurement techniques, vision-based methods stand out due to their low cost, easy installation, and large-scale measurement. However, when it comes to sub-pixel-level motion measurement, current vision-based methods either lack sufficient accuracy or require extensive manual parameter tuning (e.g., pyramid layers, target pixels, and filter parameters) to reach good precision. To address this issue, we developed a novel Gaussian kernel-based motion measurement method, which can extract the motion between different frames via tracking the location of Gaussian kernels. The motion consistency, which fits practical structural conditions, and a super-resolution constraint, are introduced to increase accuracy and robustness of our method. Numerical and experimental validations show that it can consistently reach high accuracy without customized parameter setup for different test samples.

[114] Can Synthetic Images Conquer Forgetting? Beyond Unexplored Doubts in Few-Shot Class-Incremental Learning

Junsu Kim, Yunhoe Ku, Seungryul Baek

Main category: cs.CV

TL;DR: Diffusion-FSCIL uses a frozen text-to-image diffusion model for few-shot class-incremental learning, leveraging its generative and representational strengths to outperform existing methods.

DetailsMotivation: Address the challenge of few-shot class-incremental learning (FSCIL) with limited data while minimizing catastrophic forgetting and adapting to new classes.

Method: Utilizes a frozen diffusion model for multi-scale feature extraction and latent replay, combined with feature distillation to reduce biases.

Result: Outperforms state-of-the-art methods on CUB-200, miniImageNet, and CIFAR-100, maintaining performance on old classes and adapting well to new ones.

Conclusion: Diffusion-FSCIL effectively leverages generative models for FSCIL, achieving strong performance with minimal trainable components.

Abstract: Few-shot class-incremental learning (FSCIL) is challenging due to extremely limited training data; while aiming to reduce catastrophic forgetting and learn new information. We propose Diffusion-FSCIL, a novel approach that employs a text-to-image diffusion model as a frozen backbone. Our conjecture is that FSCIL can be tackled using a large generative model’s capabilities benefiting from 1) generation ability via large-scale pre-training; 2) multi-scale representation; 3) representational flexibility through the text encoder. To maximize the representation capability, we propose to extract multiple complementary diffusion features to play roles as latent replay with slight support from feature distillation for preventing generative biases. Our framework realizes efficiency through 1) using a frozen backbone; 2) minimal trainable components; 3) batch processing of multiple feature extractions. Extensive experiments on CUB-200, \emph{mini}ImageNet, and CIFAR-100 show that Diffusion-FSCIL surpasses state-of-the-art methods, preserving performance on previously learned classes and adapting effectively to new ones.

[115] GOSPA and T-GOSPA quasi-metrics for evaluation of multi-object tracking algorithms

Ángel F. García-Fernández, Jinhao Gu, Lennart Svensson, Yuxuan Xia, Jan Krejčí, Oliver Kost, Ondřej Straka

Main category: cs.CV

TL;DR: The paper introduces two quasi-metrics for evaluating multi-object tracking (MOT) algorithms, extending GOSPA and T-GOSPA metrics with flexible cost penalties.

DetailsMotivation: To address limitations in existing MOT evaluation metrics by allowing asymmetric costs for missed/false objects and non-symmetric localization errors.

Method: Proposes two quasi-metrics: one for object sets (extending GOSPA) and one for trajectory sets (extending T-GOSPA), incorporating flexible cost structures.

Result: Demonstrates the quasi-metrics’ utility in assessing Bayesian MOT algorithms through simulations.

Conclusion: The proposed quasi-metrics offer enhanced flexibility for MOT evaluation, particularly in applications requiring asymmetric cost penalties.

Abstract: This paper introduces two quasi-metrics for performance assessment of multi-object tracking (MOT) algorithms. In particular, one quasi-metric is an extension of the generalised optimal subpattern assignment (GOSPA) metric and measures the discrepancy between sets of objects. The other quasi-metric is an extension of the trajectory GOSPA (T-GOSPA) metric and measures the discrepancy between sets of trajectories. Similar to the GOSPA-based metrics, these quasi-metrics include costs for localisation error for properly detected objects, the number of false objects and the number of missed objects. The T-GOSPA quasi-metric also includes a track switching cost. Differently from the GOSPA and T-GOSPA metrics, the proposed quasi-metrics have the flexibility of penalising missed and false objects with different costs, and the localisation costs are not required to be symmetric. These properties can be useful in MOT evaluation in certain applications. The performance of several Bayesian MOT algorithms is assessed with the T-GOSPA quasi-metric via simulations.

[116] Learning Spectral Diffusion Prior for Hyperspectral Image Reconstruction

Mingyang Yu, Zhijian Wu, Dingjiang Huang

Main category: cs.CV

TL;DR: The paper proposes a Spectral Diffusion Prior (SDP) and Spectral Prior Injector Module (SPIM) to enhance HSI reconstruction by capturing high-frequency details, outperforming existing methods by 0.5 dB.

DetailsMotivation: Existing deep learning-based HSI reconstruction methods struggle to capture high-frequency details accurately.

Method: The paper introduces SDP, a prior learned via a diffusion model, and SPIM to dynamically guide HSI detail recovery.

Result: The method outperforms existing networks by ~0.5 dB on MST and BISRNet benchmarks.

Conclusion: The proposed SDP and SPIM effectively improve HSI reconstruction performance by better capturing details.

Abstract: Hyperspectral image (HSI) reconstruction aims to recover 3D HSI from its degraded 2D measurements. Recently great progress has been made in deep learning-based methods, however, these methods often struggle to accurately capture high-frequency details of the HSI. To address this issue, this paper proposes a Spectral Diffusion Prior (SDP) that is implicitly learned from hyperspectral images using a diffusion model. Leveraging the powerful ability of the diffusion model to reconstruct details, this learned prior can significantly improve the performance when injected into the HSI model. To further improve the effectiveness of the learned prior, we also propose the Spectral Prior Injector Module (SPIM) to dynamically guide the model to recover the HSI details. We evaluate our method on two representative HSI methods: MST and BISRNet. Experimental results show that our method outperforms existing networks by about 0.5 dB, effectively improving the performance of HSI reconstruction.

[117] PoemTale Diffusion: Minimising Information Loss in Poem to Image Generation with Multi-Stage Prompt Refinement

Sofia Jamil, Bollampalli Areen Reddy, Raghvendra Kumar, Sriparna Saha, Koustava Goswami, K. J. Joseph

Main category: cs.CV

TL;DR: The paper introduces PoemTale Diffusion, a training-free method to improve image generation from poetic texts by refining prompts and modifying self-attention in diffusion models. It also releases the P4I dataset for research.

DetailsMotivation: Text-to-image models struggle with creative expressions like poetry, which often involve abstract or layered meanings. This work aims to enhance the interpretability of poetic texts for better image generation.

Method: The approach integrates a multi-stage prompt refinement loop into language models and modifies self-attention mechanisms in diffusion models to generate consistent images. The P4I dataset (1111 poems) supports this research.

Result: Human and quantitative evaluations confirm the method’s efficacy, showing improved information capture in images generated from poetic texts.

Conclusion: PoemTale Diffusion offers a novel perspective on poem-to-image generation, addressing the challenges of creative language interpretation and contributing a valuable dataset for future research.

Abstract: Recent advancements in text-to-image diffusion models have achieved remarkable success in generating realistic and diverse visual content. A critical factor in this process is the model’s ability to accurately interpret textual prompts. However, these models often struggle with creative expressions, particularly those involving complex, abstract, or highly descriptive language. In this work, we introduce a novel training-free approach tailored to improve image generation for a unique form of creative language: poetic verse, which frequently features layered, abstract, and dual meanings. Our proposed PoemTale Diffusion approach aims to minimise the information that is lost during poetic text-to-image conversion by integrating a multi stage prompt refinement loop into Language Models to enhance the interpretability of poetic texts. To support this, we adapt existing state-of-the-art diffusion models by modifying their self-attention mechanisms with a consistent self-attention technique to generate multiple consistent images, which are then collectively used to convey the poem’s meaning. Moreover, to encourage research in the field of poetry, we introduce the P4I (PoemForImage) dataset, consisting of 1111 poems sourced from multiple online and offline resources. We engaged a panel of poetry experts for qualitative assessments. The results from both human and quantitative evaluations validate the efficacy of our method and contribute a novel perspective to poem-to-image generation with enhanced information capture in the generated images.

[118] Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Pu Jian, Donglei Yu, Wen Yang, Shuo Ren, Jiajun Zhang

Main category: cs.CV

TL;DR: The paper introduces ClearVQA, a benchmark to evaluate VLMs’ ability to resolve ambiguities in VQA through interaction, addressing gaps in existing research.

DetailsMotivation: Existing VQA research focuses on rephrasing ambiguous questions but ignores interactive clarification, which is natural in user-VLM interactions.

Method: The authors propose ClearVQA, a benchmark targeting three common ambiguity categories in VQA, designed to assess VLMs’ interactive clarification capabilities.

Result: ClearVQA provides a framework to evaluate VLMs’ ability to handle ambiguities through interaction, filling a gap in current benchmarks.

Conclusion: ClearVQA addresses the lack of benchmarks for interactive clarification in VQA and encourages VLMs to seek user feedback, improving ambiguity resolution.

Abstract: In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs’ capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.

[119] Localized FNO for Spatiotemporal Hemodynamic Upsampling in Aneurysm MRI

Kyriakos Flouris, Moritz Halter, Yolanne Y. R. Lee, Samuel Castonguay, Luuk Jacobs, Pietro Dirix, Jonathan Nestmann, Sebastian Kozerke, Ender Konukoglu

Main category: cs.CV

TL;DR: LoFNO enhances hemodynamic analysis by improving spatiotemporal resolution and predicting wall shear stress directly from clinical imaging data, outperforming traditional methods.

DetailsMotivation: Low spatiotemporal resolution and signal-to-noise ratio in magnetic resonance flow imaging limit its diagnostic utility for aneurysm rupture prediction.

Method: Proposes LoFNO, a 3D architecture integrating Laplacian eigenvectors for geometric priors and EDSR for upsampling, combining neural operator frameworks for de-noising and upsampling.

Result: Superior velocity and WSS predictions compared to interpolation and other deep learning methods.

Conclusion: LoFNO enables more precise cerebrovascular diagnostics by enhancing resolution and accuracy.

Abstract: Hemodynamic analysis is essential for predicting aneurysm rupture and guiding treatment. While magnetic resonance flow imaging enables time-resolved volumetric blood velocity measurements, its low spatiotemporal resolution and signal-to-noise ratio limit its diagnostic utility. To address this, we propose the Localized Fourier Neural Operator (LoFNO), a novel 3D architecture that enhances both spatial and temporal resolution with the ability to predict wall shear stress (WSS) directly from clinical imaging data. LoFNO integrates Laplacian eigenvectors as geometric priors for improved structural awareness on irregular, unseen geometries and employs an Enhanced Deep Super-Resolution Network (EDSR) layer for robust upsampling. By combining geometric priors with neural operator frameworks, LoFNO de-noises and spatiotemporally upsamples flow data, achieving superior velocity and WSS predictions compared to interpolation and alternative deep learning methods, enabling more precise cerebrovascular diagnostics.

[120] Augmented Reality in Cultural Heritage: A Dual-Model Pipeline for 3D Artwork Reconstruction

Daniele Pannone, Alessia Castronovo, Maurizio Mancini, Gian Luca Foresti, Claudio Piciarelli, Rossana Gabrieli, Muhammad Yasir Bilal, Danilo Avola

Main category: cs.CV

TL;DR: An AR pipeline for museums uses two pre-trained depth models (GLPN and Depth-Anything) to create accurate 3D models from single images, improving reconstruction and AR experiences.

DetailsMotivation: To enhance museum visitor engagement by providing immersive AR experiences through accurate 3D reconstructions of artworks.

Method: Integrates GLPN for global scene structure and Depth-Anything for local details, converting depth maps into point clouds and meshes.

Result: Significant improvements in reconstruction accuracy and visual realism, making the system robust for museums.

Conclusion: The proposed pipeline is effective for creating interactive digital content in museums, enhancing visitor experiences.

Abstract: This paper presents an innovative augmented reality pipeline tailored for museum environments, aimed at recognizing artworks and generating accurate 3D models from single images. By integrating two complementary pre-trained depth estimation models, i.e., GLPN for capturing global scene structure and Depth-Anything for detailed local reconstruction, the proposed approach produces optimized depth maps that effectively represent complex artistic features. These maps are then converted into high-quality point clouds and meshes, enabling the creation of immersive AR experiences. The methodology leverages state-of-the-art neural network architectures and advanced computer vision techniques to overcome challenges posed by irregular contours and variable textures in artworks. Experimental results demonstrate significant improvements in reconstruction accuracy and visual realism, making the system a highly robust tool for museums seeking to enhance visitor engagement through interactive digital content.

[121] One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion

Haoang Lu, Yuanqi Su, Xiaoning Zhang, Hao Hu

Main category: cs.CV

TL;DR: CF-SSC introduces a temporal SSC framework using pseudo-future frame prediction to enhance 3D scene completion, outperforming existing methods.

DetailsMotivation: Existing monocular SSC methods struggle with occlusions and limited field of view in real-world traffic scenarios.

Method: CF-SSC combines poses and depths for accurate 3D correspondences, fusing past, present, and predicted future frames in 3D space.

Result: Achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360 benchmarks, improving occlusion reasoning and accuracy.

Conclusion: CF-SSC effectively addresses occlusion challenges and enhances 3D scene completion through temporal modeling.

Abstract: In recent years, visual 3D Semantic Scene Completion (SSC) has emerged as a critical perception task for autonomous driving due to its ability to infer complete 3D scene layouts and semantics from single 2D images. However, in real-world traffic scenarios, a significant portion of the scene remains occluded or outside the camera’s field of view – a fundamental challenge that existing monocular SSC methods fail to address adequately. To overcome these limitations, we propose Creating the Future SSC (CF-SSC), a novel temporal SSC framework that leverages pseudo-future frame prediction to expand the model’s effective perceptual range. Our approach combines poses and depths to establish accurate 3D correspondences, enabling geometrically-consistent fusion of past, present, and predicted future frames in 3D space. Unlike conventional methods that rely on simple feature stacking, our 3D-aware architecture achieves more robust scene completion by explicitly modeling spatial-temporal relationships. Comprehensive experiments on SemanticKITTI and SSCBench-KITTI-360 benchmarks demonstrate state-of-the-art performance, validating the effectiveness of our approach, highlighting our method’s ability to improve occlusion reasoning and 3D scene completion accuracy.

[122] Encapsulated Composition of Text-to-Image and Text-to-Video Models for High-Quality Video Synthesis

Tongtong Su, Chengyu Wang, Bingyan Liu, Jun Huang, Dongming Lu

Main category: cs.CV

TL;DR: EVS combines text-to-image (T2I) and text-to-video (T2V) models to improve video quality and motion smoothness without additional training.

DetailsMotivation: Existing T2V models struggle with visual fidelity and motion consistency, often causing flickering and artifacts.

Method: EVS uses a diffusion-based T2I model to refine frames and T2V backbones for motion dynamics, encapsulating their strengths.

Result: EVS enhances video quality and motion smoothness, with a 1.6x-4.5x speedup in inference time.

Conclusion: EVS effectively leverages T2I and T2V models, outperforming previous methods in quality and efficiency.

Abstract: In recent years, large text-to-video (T2V) synthesis models have garnered considerable attention for their abilities to generate videos from textual descriptions. However, achieving both high imaging quality and effective motion representation remains a significant challenge for these T2V models. Existing approaches often adapt pre-trained text-to-image (T2I) models to refine video frames, leading to issues such as flickering and artifacts due to inconsistencies across frames. In this paper, we introduce EVS, a training-free Encapsulated Video Synthesizer that composes T2I and T2V models to enhance both visual fidelity and motion smoothness of generated videos. Our approach utilizes a well-trained diffusion-based T2I model to refine low-quality video frames by treating them as out-of-distribution samples, effectively optimizing them with noising and denoising steps. Meanwhile, we employ T2V backbones to ensure consistent motion dynamics. By encapsulating the T2V temporal-only prior into the T2I generation process, EVS successfully leverages the strengths of both types of models, resulting in videos of improved imaging and motion quality. Experimental results validate the effectiveness of our approach compared to previous approaches. Our composition process also leads to a significant improvement of 1.6x-4.5x speedup in inference time. Source codes: https://github.com/Tonniia/EVS.

[123] Feature Engineering is Not Dead: Reviving Classical Machine Learning with Entropy, HOG, and LBP Feature Fusion for Image Classification

Abhijit Sen, Giridas Maiti, Bikram K. Parida, Bhanu P. Mishra, Mahima Arya, Denys I. Bondar

Main category: cs.CV

TL;DR: The paper proposes a novel feature extraction method for image classification using Permutation Entropy (PE), combined with HOG and LBP, achieving competitive results without deep learning.

DetailsMotivation: To address the need for interpretable and computationally efficient alternatives to deep learning models in image classification.

Method: Extends PE to 2D images, integrates HOG and LBP for feature extraction, and trains SVM classifiers with the resulting 780-dimensional feature set.

Result: Achieves competitive performance on benchmark datasets (Fashion-MNIST, KMNIST, EMNIST, CIFAR-10) without deep architectures.

Conclusion: The fusion of PE, HOG, and LBP offers a lightweight, interpretable, and effective solution for image classification, showcasing the potential of entropy-based descriptors.

Abstract: Feature engineering continues to play a critical role in image classification, particularly when interpretability and computational efficiency are prioritized over deep learning models with millions of parameters. In this study, we revisit classical machine learning based image classification through a novel approach centered on Permutation Entropy (PE), a robust and computationally lightweight measure traditionally used in time series analysis but rarely applied to image data. We extend PE to two-dimensional images and propose a multiscale, multi-orientation entropy-based feature extraction approach that characterizes spatial order and complexity along rows, columns, diagonals, anti-diagonals, and local patches of the image. To enhance the discriminatory power of the entropy features, we integrate two classic image descriptors: the Histogram of Oriented Gradients (HOG) to capture shape and edge structure, and Local Binary Patterns (LBP) to encode micro-texture of an image. The resulting hand-crafted feature set, comprising of 780 dimensions, is used to train Support Vector Machine (SVM) classifiers optimized through grid search. The proposed approach is evaluated on multiple benchmark datasets, including Fashion-MNIST, KMNIST, EMNIST, and CIFAR-10, where it delivers competitive classification performance without relying on deep architectures. Our results demonstrate that the fusion of PE with HOG and LBP provides a compact, interpretable, and effective alternative to computationally expensive and limited interpretable deep learning models. This shows a potential of entropy-based descriptors in image classification and contributes a lightweight and generalizable solution to interpretable machine learning in image classification and computer vision.

[124] Team of One: Cracking Complex Video QA with Model Synergy

Jun Xie, Zhaoran Zhao, Xiongjun Guan, Yingjian Zhu, Hongzhu Yi, Xinming Wang, Feng Chen, Zhepeng Wang

Main category: cs.CV

TL;DR: A novel framework for open-ended video question answering improves reasoning depth and robustness by integrating multiple Video-Language Models (VLMs) with structured chains of thought and an LLM evaluator, outperforming existing methods.

DetailsMotivation: Existing Video-LMMs lack contextual understanding, temporal modeling, and generalization to complex queries, prompting the need for a more robust solution.

Method: The framework uses a prompting-and-response integration mechanism to coordinate multiple VLMs with structured reasoning pathways, evaluated and integrated by an LLM.

Result: The method outperforms baselines in all metrics, demonstrating superior generalization and robustness.

Conclusion: The approach provides a lightweight, extensible strategy for advancing multimodal reasoning without retraining, setting a foundation for future Video-LMM development.

Abstract: We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios, as benchmarked on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries. To address these challenges, we introduce a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models (VLMs) via structured chains of thought, each tailored to distinct reasoning pathways. An external Large Language Model (LLM) serves as an evaluator and integrator, selecting and fusing the most reliable responses. Extensive experiments demonstrate that our method significantly outperforms existing baselines across all evaluation metrics, showcasing superior generalization and robustness. Our approach offers a lightweight, extensible strategy for advancing multimodal reasoning without requiring model retraining, setting a strong foundation for future Video-LMM development.

[125] SuperCM: Improving Semi-Supervised Learning and Domain Adaptation through differentiable clustering

Durgesh Singh, Ahcène Boubekki, Robert Jenssen, Michael Kampffmeyer

Main category: cs.CV

TL;DR: The paper introduces a differentiable clustering module for SSL and UDA, explicitly leveraging supervised data for centroids, showing effectiveness in low supervision regimes.

DetailsMotivation: To enhance SSL and UDA by explicitly enforcing the clustering assumption, improving performance with limited labeled data.

Method: Uses a differentiable clustering module, integrating supervised data for centroid computation, and employs end-to-end training.

Result: Demonstrates effectiveness in SSL and UDA, especially in low supervision, as both a standalone model and a regularizer.

Conclusion: The approach is simple yet effective, offering benefits for learning with limited supervision.

Abstract: Semi-Supervised Learning (SSL) and Unsupervised Domain Adaptation (UDA) enhance the model performance by exploiting information from labeled and unlabeled data. The clustering assumption has proven advantageous for learning with limited supervision and states that data points belonging to the same cluster in a high-dimensional space should be assigned to the same category. Recent works have utilized different training mechanisms to implicitly enforce this assumption for the SSL and UDA. In this work, we take a different approach by explicitly involving a differentiable clustering module which is extended to leverage the supervised data to compute its centroids. We demonstrate the effectiveness of our straightforward end-to-end training strategy for SSL and UDA over extensive experiments and highlight its benefits, especially in low supervision regimes, both as a standalone model and as a regularizer for existing approaches.

[126] When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models

Francesco Ortu, Zhijing Jin, Diego Doimo, Alberto Cazzaniga

Main category: cs.CV

TL;DR: The paper investigates how vision-language models (VLMs) handle conflicts between internal knowledge and external visual inputs, identifying specific model heads that control these conflicts and demonstrating their role in steering model responses.

DetailsMotivation: Understanding how VLMs resolve conflicts between internal parametric knowledge and external visual information to prevent hallucinations and unreliable responses.

Method: Introduces a dataset of multimodal counterfactual queries to analyze conflict resolution, localizes controlling heads via logit inspection, and modifies these heads to steer model behavior.

Result: Identifies specific heads that control conflicts, shows their modification can steer responses, and demonstrates their attention outperforms gradient-based attribution in pinpointing image regions.

Conclusion: The study provides insights into conflict resolution mechanisms in VLMs, offering a way to control model behavior and improve reliability.

Abstract: Vision-language models (VLMs) increasingly leverage diverse knowledge sources to address complex tasks, often encountering conflicts between their internal parametric knowledge and external information. Knowledge conflicts can result in hallucinations and unreliable responses, but the mechanisms governing such interactions remain unknown. To address this gap, we analyze the mechanisms that VLMs use to resolve cross-modal conflicts by introducing a dataset of multimodal counterfactual queries that deliberately contradict internal commonsense knowledge. We localize with logit inspection a small set of heads that control the conflict. Moreover, by modifying these heads, we can steer the model towards its internal knowledge or the visual inputs. Finally, we show that attention from such heads pinpoints localized image regions driving visual overrides, outperforming gradient-based attribution in precision.

[127] DynFaceRestore: Balancing Fidelity and Quality in Diffusion-Guided Blind Face Restoration with Dynamic Blur-Level Mapping and Guidance

Huu-Phu Do, Yu-Wei Chen, Yi-Cheng Liao, Chi-Wei Hsiao, Han-Yang Wang, Wei-Chen Chiu, Ching-Chun Huang

Main category: cs.CV

TL;DR: DynFaceRestore dynamically adjusts diffusion sampling and guidance for blind face restoration, balancing fidelity and quality.

DetailsMotivation: Existing methods use fixed diffusion parameters, leading to under- or over-diffusion and imbalanced results.

Method: Learns to map degraded inputs to Gaussian blurry images, dynamically selects timesteps, and adjusts guidance scaling locally.

Result: Achieves state-of-the-art performance in quantitative and qualitative evaluations.

Conclusion: DynFaceRestore effectively balances fidelity and quality in blind face restoration.

Abstract: Blind Face Restoration aims to recover high-fidelity, detail-rich facial images from unknown degraded inputs, presenting significant challenges in preserving both identity and detail. Pre-trained diffusion models have been increasingly used as image priors to generate fine details. Still, existing methods often use fixed diffusion sampling timesteps and a global guidance scale, assuming uniform degradation. This limitation and potentially imperfect degradation kernel estimation frequently lead to under- or over-diffusion, resulting in an imbalance between fidelity and quality. We propose DynFaceRestore, a novel blind face restoration approach that learns to map any blindly degraded input to Gaussian blurry images. By leveraging these blurry images and their respective Gaussian kernels, we dynamically select the starting timesteps for each blurry image and apply closed-form guidance during the diffusion sampling process to maintain fidelity. Additionally, we introduce a dynamic guidance scaling adjuster that modulates the guidance strength across local regions, enhancing detail generation in complex areas while preserving structural fidelity in contours. This strategy effectively balances the trade-off between fidelity and quality. DynFaceRestore achieves state-of-the-art performance in both quantitative and qualitative evaluations, demonstrating robustness and effectiveness in blind face restoration.

[128] NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh, Georgii Fedorov, Bulat Suleimanov, Vladimir Dokholyan, Aleksandr Gordeev

Main category: cs.CV

TL;DR: An automated pipeline for mining high-fidelity image-text triplets for training generative models, releasing a large open dataset (NHR-Edit) and a fine-tuned model (Bagel-NHR-Edit).

DetailsMotivation: The need for scalable, high-quality training data for image editing assistants without relying on human annotation.

Method: Uses public generative models, a Gemini validator for scoring, and techniques like inversion and compositional bootstrapping to expand the dataset.

Result: Created NHR-Edit (358k triplets) and Bagel-NHR-Edit, achieving state-of-the-art performance in evaluations.

Conclusion: The approach automates data mining, enabling large-scale training and democratizing research in generative modeling.

Abstract: Recent advances in generative modeling enable image editing assistants that follow natural language instructions without additional user input. Their supervised training requires millions of triplets: original image, instruction, edited image. Yet mining pixel-accurate examples is hard. Each edit must affect only prompt-specified regions, preserve stylistic coherence, respect physical plausibility, and retain visual appeal. The lack of robust automated edit-quality metrics hinders reliable automation at scale. We present an automated, modular pipeline that mines high-fidelity triplets across domains, resolutions, instruction complexities, and styles. Built on public generative models and running without human intervention, our system uses a task-tuned Gemini validator to score instruction adherence and aesthetics directly, removing any need for segmentation or grounding models. Inversion and compositional bootstrapping enlarge the mined set by approximately 2.2x, enabling large-scale high-fidelity training data. By automating the most repetitive annotation steps, the approach allows a new scale of training without human labeling effort. To democratize research in this resource-intensive area, we release NHR-Edit: an open dataset of 358k high-quality triplets. In the largest cross-dataset evaluation, it surpasses all public alternatives. We also release Bagel-NHR-Edit, an open-source fine-tuned Bagel model, which achieves state-of-the-art metrics in our experiments.

[129] Real-Time Fusion of Visual and Chart Data for Enhanced Maritime Vision

Marten Kreis, Benjamin Kiefer

Main category: cs.CV

TL;DR: A novel method enhances marine vision by fusing real-time video with nautical charts using a transformer-based neural network for accurate buoy detection and matching.

DetailsMotivation: Improving marine navigation by integrating live visual data with chart information for better accuracy in dynamic environments.

Method: Transformer-based end-to-end neural network predicts buoy bounding boxes and confidence scores, enabling direct matching with chart markers. Compared to ray-casting and YOLOv7-based baselines.

Result: Significant improvement in object localization and association accuracy in real-world maritime scenes.

Conclusion: The proposed method outperforms baseline approaches, offering robust performance in challenging marine environments.

Abstract: This paper presents a novel approach to enhancing marine vision by fusing real-time visual data with chart information. Our system overlays nautical chart data onto live video feeds by accurately matching detected navigational aids, such as buoys, with their corresponding representations in chart data. To achieve robust association, we introduce a transformer-based end-to-end neural network that predicts bounding boxes and confidence scores for buoy queries, enabling the direct matching of image-domain detections with world-space chart markers. The proposed method is compared against baseline approaches, including a ray-casting model that estimates buoy positions via camera projection and a YOLOv7-based network extended with a distance estimation module. Experimental results on a dataset of real-world maritime scenes demonstrate that our approach significantly improves object localization and association accuracy in dynamic and challenging environments.

[130] GRAM-MAMBA: Holistic Feature Alignment for Wireless Perception with Adaptive Low-Rank Compensation

Weiqi Yang, Xu Zhou, Jingfu Guan, Hao Du, Tianyu Bai

Main category: cs.CV

TL;DR: GRAM-MAMBA is a framework for efficient and robust multimodal fusion in IoT, addressing challenges like high complexity, unidirectional alignment, and missing data. It uses Mamba for time-series processing, GRAM matrix for alignment, and LoRA-inspired adaptation for missing data.

DetailsMotivation: Existing IoT multimodal systems struggle with high complexity, poor alignment, and robustness to missing data, limiting real-world deployment.

Method: GRAM-MAMBA combines Mamba for efficient time-series processing, GRAM matrix for pairwise alignment, and adaptive low-rank layers for handling missing modalities.

Result: On SPAWC2021, it reduces error and boosts performance by 24.5% with minimal parameter training. On USC-HAD, it achieves 93.55% F1 and 93.81% OA, improving F1 by 23%.

Conclusion: GRAM-MAMBA offers efficient, robust multimodal perception for resource-constrained IoT environments, validated by superior performance and adaptability.

Abstract: Multi-modal fusion is crucial for Internet of Things (IoT) perception, widely deployed in smart homes, intelligent transport, industrial automation, and healthcare. However, existing systems often face challenges: high model complexity hinders deployment in resource-constrained environments, unidirectional modal alignment neglects inter-modal relationships, and robustness suffers when sensor data is missing. These issues impede efficient and robust multimodal perception in real-world IoT settings. To overcome these limitations, we propose GRAM-MAMBA. This framework utilizes the linear-complexity Mamba model for efficient sensor time-series processing, combined with an optimized GRAM matrix strategy for pairwise alignment among modalities, addressing the shortcomings of traditional single-modality alignment. Inspired by Low-Rank Adaptation (LoRA), we introduce an adaptive low-rank layer compensation strategy to handle missing modalities post-training. This strategy freezes the pre-trained model core and irrelevant adaptive layers, fine-tuning only those related to available modalities and the fusion process. Extensive experiments validate GRAM-MAMBA’s effectiveness. On the SPAWC2021 indoor positioning dataset, the pre-trained model shows lower error than baselines; adapting to missing modalities yields a 24.5% performance boost by training less than 0.2% of parameters. On the USC-HAD human activity recognition dataset, it achieves 93.55% F1 and 93.81% Overall Accuracy (OA), outperforming prior work; the update strategy increases F1 by 23% while training less than 0.3% of parameters. These results highlight GRAM-MAMBA’s potential for achieving efficient and robust multimodal perception in resource-constrained environments.

[131] Generalist Forecasting with Frozen Video Models via Latent Diffusion

Jacob C Walker, Pedro Vélez, Luisa Polania Cabrera, Guangyao Zhou, Rishabh Kabra, Carl Doersch, Maks Ovsjanikov, João Carreira, Shiry Ginosar

Main category: cs.CV

TL;DR: The paper explores the link between a vision model’s perceptual ability and its forecasting performance, introducing a generalist framework using latent diffusion models for future feature prediction.

DetailsMotivation: To understand the correlation between perceptual ability and forecasting performance in vision models, aiming to improve temporally grounded video understanding.

Method: A novel generalist forecasting framework using latent diffusion models on frozen vision backbones, with task-specific readouts for decoding.

Result: Strong correlation found between perceptual ability and forecasting performance across diverse models and tasks.

Conclusion: Bridging representation learning and generative modeling enhances temporally grounded video understanding.

Abstract: Forecasting what will happen next is a critical skill for general-purpose systems that plan or act in the world at different levels of abstraction. In this paper, we identify a strong correlation between a vision model’s perceptual ability and its generalist forecasting performance over short time horizons. This trend holds across a diverse set of pretrained models-including those trained generatively-and across multiple levels of abstraction, from raw pixels to depth, point tracks, and object motion. The result is made possible by a novel generalist forecasting framework that operates on any frozen vision backbone: we train latent diffusion models to forecast future features in the frozen representation space, which are then decoded via lightweight, task-specific readouts. To enable consistent evaluation across tasks, we introduce distributional metrics that compare distributional properties directly in the space of downstream tasks and apply this framework to nine models and four tasks. Our results highlight the value of bridging representation learning and generative modeling for temporally grounded video understanding.

[132] SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing

Yingying Zhang, Lixiang Ru, Kang Wu, Lei Yu, Lei Liang, Yansheng Li, Jingdong Chen

Main category: cs.CV

TL;DR: SkySense V2 is a unified multi-modal remote sensing foundation model using a single transformer backbone, tailored SSL for RS data, and innovative modules like adaptive patch merging and MoE, outperforming SkySense by 1.8 points.

DetailsMotivation: Existing MM-RSFMs require separate backbones per modality, causing redundancy, and SSL methods don't suit RS image traits like complex semantics.

Method: Uses a single transformer backbone with tailored SSL, adaptive patch merging, learnable modality prompts, and MoE for enhanced performance.

Result: Outperforms SkySense by 1.8 points on 16 datasets across 7 tasks, showing strong generalization.

Conclusion: SkySense V2 efficiently handles multi-modal RS data with improved performance and generalization.

Abstract: The multi-modal remote sensing foundation model (MM-RSFM) has significantly advanced various Earth observation tasks, such as urban planning, environmental monitoring, and natural disaster management. However, most existing approaches generally require the training of separate backbone networks for each data modality, leading to redundancy and inefficient parameter utilization. Moreover, prevalent pre-training methods typically apply self-supervised learning (SSL) techniques from natural images without adequately accommodating the characteristics of remote sensing (RS) images, such as the complicated semantic distribution within a single RS image. In this work, we present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities. This backbone is pre-trained with a novel SSL strategy tailored to the distinct traits of RS data. In particular, SkySense V2 incorporates an innovative adaptive patch merging module and learnable modality prompt tokens to address challenges related to varying resolutions and limited feature diversity across modalities. In additional, we incorporate the mixture of experts (MoE) module to further enhance the performance of the foundation model. SkySense V2 demonstrates impressive generalization abilities through an extensive evaluation involving 16 datasets over 7 tasks, outperforming SkySense by an average of 1.8 points.

[133] Depth3DLane: Fusing Monocular 3D Lane Detection with Self-Supervised Monocular Depth Estimation

Max van den Hoven, Kishaan Jeeveswaran, Pieter Piscaer, Thijs Wensveen, Elahe Arani, Bahram Zonooz

Main category: cs.CV

TL;DR: Depth3DLane is a dual-pathway framework for monocular 3D lane detection, integrating self-supervised depth estimation to avoid reliance on expensive sensors or ground-truth depth data.

DetailsMotivation: Existing methods rely on costly sensors or impractical ground-truth depth data, and assume known camera parameters, limiting scalability and applicability.

Method: Depth3DLane combines self-supervised depth estimation with dual pathways (bird’s-eye view for spatial info, front view for semantic info) and 3D lane anchors for accurate geometry inference. It also predicts camera parameters per-frame.

Result: Depth3DLane achieves competitive performance on OpenLane benchmark and works without ground-truth camera parameters.

Conclusion: The framework enables accurate 3D lane detection without costly sensors or calibration, expanding applicability to scenarios like crowdsourced HD mapping.

Abstract: Monocular 3D lane detection is essential for autonomous driving, but challenging due to the inherent lack of explicit spatial information. Multi-modal approaches rely on expensive depth sensors, while methods incorporating fully-supervised depth networks rely on ground-truth depth data that is impractical to collect at scale. Additionally, existing methods assume that camera parameters are available, limiting their applicability in scenarios like crowdsourced high-definition (HD) lane mapping. To address these limitations, we propose Depth3DLane, a novel dual-pathway framework that integrates self-supervised monocular depth estimation to provide explicit structural information, without the need for expensive sensors or additional ground-truth depth data. Leveraging a self-supervised depth network to obtain a point cloud representation of the scene, our bird’s-eye view pathway extracts explicit spatial information, while our front view pathway simultaneously extracts rich semantic information. Depth3DLane then uses 3D lane anchors to sample features from both pathways and infer accurate 3D lane geometry. Furthermore, we extend the framework to predict camera parameters on a per-frame basis and introduce a theoretically motivated fitting procedure to enhance stability on a per-segment basis. Extensive experiments demonstrate that Depth3DLane achieves competitive performance on the OpenLane benchmark dataset. Furthermore, experimental results show that using learned parameters instead of ground-truth parameters allows Depth3DLane to be applied in scenarios where camera calibration is infeasible, unlike previous methods.

[134] PositionIC: Unified Position and Identity Consistency for Image Customization

Junjie Hu, Tianyang Han, Kai Ma, Jialin Gao, Hao Dou, Song Yang, Xianhua He, Jianhui Zhang, Junfeng Luo, Xiaoming Wei, Wenqiang Zhang

Main category: cs.CV

TL;DR: PositionIC introduces a framework for precise spatial control in multi-subject image customization, addressing the lack of scalable datasets with positional cues.

DetailsMotivation: Current image customization lacks fine-grained spatial control due to missing datasets with identity-position binding, limiting real-world applications.

Method: PositionIC uses a bidirectional generation pipeline for data synthesis and a positional modulation layer to decouple spatial embeddings, ensuring independent subject placement.

Result: The framework achieves precise spatial control and high consistency in image customization, validated by extensive experiments.

Conclusion: PositionIC enables controllable, high-fidelity customization in open-world scenarios and will be released for further research.

Abstract: Recent subject-driven image customization has achieved significant advancements in fidelity, yet fine-grained entity-level spatial control remains elusive, hindering the broader real-world application. This limitation is mainly attributed to scalable datasets that bind identity with precise positional cues are absent. To this end, we introduce PositionIC, a unified framework that enforces position and identity consistency for multi-subject customization. We construct a scalable synthesis pipeline that employs a bidirectional generation paradigm to eliminate subject drift and maintain semantic coherence. On top of these data, we design a lightweight positional modulation layer that decouples spatial embeddings among subjects, enabling independent, accurate placement while preserving visual fidelity. Extensive experiments demonstrate that our approach can achieve precise spatial control while maintaining high consistency in image customization task. PositionIC paves the way for controllable, high-fidelity image customization in open-world, multi-entity scenarios and will be released to foster further research.

[135] CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models

Quang-Binh Nguyen, Minh Luu, Quang Nguyen, Anh Tran, Khoi Nguyen

Main category: cs.CV

TL;DR: CSD-VAR is a novel method for content-style decomposition using Visual Autoregressive Modeling, outperforming prior approaches with improved disentanglement and fidelity.

DetailsMotivation: To leverage VAR's scale-wise generation for better content-style disentanglement, addressing limitations in current diffusion-based methods.

Method: Introduces scale-aware optimization, SVD-based rectification, and Augmented K-V memory to enhance separation and identity preservation.

Result: CSD-VAR achieves superior content preservation and stylization fidelity, validated on the new CSD-100 dataset.

Conclusion: VAR is a viable framework for CSD, with CSD-VAR setting a new benchmark for performance in this task.

Abstract: Disentangling content and style from a single image, known as content-style decomposition (CSD), enables recontextualization of extracted content and stylization of extracted styles, offering greater creative flexibility in visual synthesis. While recent personalization methods have explored the decomposition of explicit content style, they remain tailored for diffusion models. Meanwhile, Visual Autoregressive Modeling (VAR) has emerged as a promising alternative with a next-scale prediction paradigm, achieving performance comparable to that of diffusion models. In this paper, we explore VAR as a generative framework for CSD, leveraging its scale-wise generation process for improved disentanglement. To this end, we propose CSD-VAR, a novel method that introduces three key innovations: (1) a scale-aware alternating optimization strategy that aligns content and style representation with their respective scales to enhance separation, (2) an SVD-based rectification method to mitigate content leakage into style representations, and (3) an Augmented Key-Value (K-V) memory enhancing content identity preservation. To benchmark this task, we introduce CSD-100, a dataset specifically designed for content-style decomposition, featuring diverse subjects rendered in various artistic styles. Experiments demonstrate that CSD-VAR outperforms prior approaches, achieving superior content preservation and stylization fidelity.

[136] PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations

Yu Wei, Jiahui Zhang, Xiaoqin Zhang, Ling Shao, Shijian Lu

Main category: cs.CV

TL;DR: PCR-GS improves 3D Gaussian Splatting (3D-GS) by co-regularizing camera poses, addressing challenges in scenes with complex trajectories.

DetailsMotivation: Existing 3D-GS struggles with complex camera trajectories, leading to poor pose estimation and optimization issues.

Method: PCR-GS uses feature reprojection and wavelet-based frequency regularization to align semantic features and optimize camera poses.

Result: PCR-GS outperforms in pose-free 3D-GS modeling, especially in scenes with dramatic camera trajectory changes.

Conclusion: PCR-GS offers a robust solution for high-quality 3D scene reconstruction without relying on COLMAP.

Abstract: COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is wavelet-based frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.

[137] Enhancing LiDAR Point Features with Foundation Model Priors for 3D Object Detection

Yujian Mo, Yan Wu, Junqiao Zhao, Jijun Wang, Yinghao Hu, Jun Yan

Main category: cs.CV

TL;DR: The paper enhances LiDAR-based 3D object detection by integrating depth priors from DepthAnything, improving point features and detection accuracy.

DetailsMotivation: To address the limited expressiveness of raw LiDAR point features, especially weak reflectance attributes, by leveraging dense depth priors from monocular RGB images.

Method: Fuses depth priors with LiDAR attributes, introduces a point-wise feature extraction module, and uses a Dual-Path RoI framework with a bidirectional gated fusion module.

Result: Improved detection accuracy on the KITTI benchmark.

Conclusion: Incorporating visual foundation model priors enhances LiDAR-based 3D object detection.

Abstract: Recent advances in foundation models have opened up new possibilities for enhancing 3D perception. In particular, DepthAnything offers dense and reliable geometric priors from monocular RGB images, which can complement sparse LiDAR data in autonomous driving scenarios. However, such priors remain underutilized in LiDAR-based 3D object detection. In this paper, we address the limited expressiveness of raw LiDAR point features, especially the weak discriminative capability of the reflectance attribute, by introducing depth priors predicted by DepthAnything. These priors are fused with the original LiDAR attributes to enrich each point’s representation. To leverage the enhanced point features, we propose a point-wise feature extraction module. Then, a Dual-Path RoI feature extraction framework is employed, comprising a voxel-based branch for global semantic context and a point-based branch for fine-grained structural details. To effectively integrate the complementary RoI features, we introduce a bidirectional gated RoI feature fusion module that balances global and local cues. Extensive experiments on the KITTI benchmark show that our method consistently improves detection accuracy, demonstrating the value of incorporating visual foundation model priors into LiDAR-based 3D object detection.

[138] DiViD: Disentangled Video Diffusion for Static-Dynamic Factorization

Marzieh Gheisari, Auguste Genovesio

Main category: cs.CV

TL;DR: DiViD is a novel video diffusion framework for disentangling static appearance and dynamic motion in videos, outperforming existing methods with explicit factorization and reduced leakage.

DetailsMotivation: Existing VAE- and GAN-based approaches struggle with information leakage and blurry reconstructions in disentangling static and dynamic content in videos.

Method: DiViD uses a sequence encoder to extract static and dynamic tokens, a conditional DDPM decoder with shared-noise schedules, time-varying KL bottlenecks, and cross-attention, along with an orthogonality regularizer.

Result: DiViD achieves the highest swap-based joint accuracy, improves dynamic transfer, preserves static fidelity, and reduces cross-leakage compared to state-of-the-art methods.

Conclusion: DiViD effectively disentangles static and dynamic content in videos, setting a new benchmark for sequential disentanglement.

Abstract: Unsupervised disentanglement of static appearance and dynamic motion in video remains a fundamental challenge, often hindered by information leakage and blurry reconstructions in existing VAE- and GAN-based approaches. We introduce DiViD, the first end-to-end video diffusion framework for explicit static-dynamic factorization. DiViD’s sequence encoder extracts a global static token from the first frame and per-frame dynamic tokens, explicitly removing static content from the motion code. Its conditional DDPM decoder incorporates three key inductive biases: a shared-noise schedule for temporal consistency, a time-varying KL-based bottleneck that tightens at early timesteps (compressing static information) and relaxes later (enriching dynamics), and cross-attention that routes the global static token to all frames while keeping dynamic tokens frame-specific. An orthogonality regularizer further prevents residual static-dynamic leakage. We evaluate DiViD on real-world benchmarks using swap-based accuracy and cross-leakage metrics. DiViD outperforms state-of-the-art sequential disentanglement methods: it achieves the highest swap-based joint accuracy, preserves static fidelity while improving dynamic transfer, and reduces average cross-leakage.

[139] VLA-Mark: A cross modal watermark for large vision-language alignment model

Shuliang Liu, Qi Zheng, Jesse Jiaxi Xu, Yibo Yan, He Geng, Aiwei Liu, Peijie Jiang, Jia Liu, Yik-Cheung Tam, Xuming Hu

Main category: cs.CV

TL;DR: VLA-Mark is a vision-aligned watermarking framework for vision-language models that preserves multimodal coherence while protecting intellectual property.

DetailsMotivation: Existing text watermarking methods disrupt visual-textual alignment and leave semantic-critical concepts vulnerable.

Method: VLA-Mark integrates multiscale visual-textual alignment metrics and an entropy-sensitive mechanism to guide watermark injection without model retraining.

Result: Achieves 7.4% lower PPL, 26.6% higher BLEU, 98.8% AUC detection, and 96.1% attack resilience.

Conclusion: VLA-Mark sets new standards for quality-preserving multimodal watermarking by maintaining text-visual consistency.

Abstract: Vision-language models demand watermarking solutions that protect intellectual property without compromising multimodal coherence. Existing text watermarking methods disrupt visual-textual alignment through biased token selection and static strategies, leaving semantic-critical concepts vulnerable. We propose VLA-Mark, a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination. Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns, to guide watermark injection without model retraining. An entropy-sensitive mechanism dynamically balances watermark strength and semantic preservation, prioritizing visual grounding during low-uncertainty generation phases. Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection (98.8% AUC). The framework demonstrates 96.1% attack resilience against attacks such as paraphrasing and synonym substitution, while maintaining text-visual consistency, establishing new standards for quality-preserving multimodal watermarking

[140] Evaluation of Human Visual Privacy Protection: A Three-Dimensional Framework and Benchmark Dataset

Sara Abdulaziz, Giacomo D’Amicantonio, Egor Bondarev

Main category: cs.CV

TL;DR: A framework for evaluating visual privacy-protection methods is introduced, along with the HR-VISPR dataset, to assess privacy, utility, and practicality.

DetailsMotivation: Addressing concerns over AI-powered surveillance and the need for objective privacy evaluation techniques.

Method: Proposes a three-dimensional framework (privacy, utility, practicality) and uses the HR-VISPR dataset to evaluate 11 privacy protection methods.

Result: The framework highlights trade-offs between privacy, utility, and practicality, aligning privacy levels with human perception.

Conclusion: The study provides a structured evaluation tool and dataset applicable in diverse contexts.

Abstract: Recent advances in AI-powered surveillance have intensified concerns over the collection and processing of sensitive personal data. In response, research has increasingly focused on privacy-by-design solutions, raising the need for objective techniques to evaluate privacy protection. This paper presents a comprehensive framework for evaluating visual privacy-protection methods across three dimensions: privacy, utility, and practicality. In addition, it introduces HR-VISPR, a publicly available human-centric dataset with biometric, soft-biometric, and non-biometric labels to train an interpretable privacy metric. We evaluate 11 privacy protection methods, ranging from conventional techniques to advanced deep-learning methods, through the proposed framework. The framework differentiates privacy levels in alignment with human visual perception, while highlighting trade-offs between privacy, utility, and practicality. This study, along with the HR-VISPR dataset, serves as an insightful tool and offers a structured evaluation framework applicable across diverse contexts.

[141] DreamScene: 3D Gaussian-based End-to-end Text-to-3D Scene Generation

Haoran Li, Yuli Tian, Kun Lan, Yong Liao, Lin Wang, Pan Hui, Peng Yuan Zhou

Main category: cs.CV

TL;DR: DreamScene is an end-to-end framework for generating high-quality, editable 3D scenes from text or dialogue, addressing automation, consistency, and control challenges.

DetailsMotivation: Existing methods lack automation, 3D consistency, and fine-grained control for 3D scene generation from natural language.

Method: DreamScene uses a GPT-4 agent for scene planning, a graph-based placement algorithm for layouts, Formation Pattern Sampling for geometry, and progressive camera sampling for consistency. It also supports fine-grained editing.

Result: DreamScene outperforms prior methods in quality, consistency, and flexibility, enabling open-domain 3D content creation.

Conclusion: DreamScene provides a practical solution for generating and editing 3D scenes from text, advancing the field of 3D content creation.

Abstract: Generating 3D scenes from natural language holds great promise for applications in gaming, film, and design. However, existing methods struggle with automation, 3D consistency, and fine-grained control. We present DreamScene, an end-to-end framework for high-quality and editable 3D scene generation from text or dialogue. DreamScene begins with a scene planning module, where a GPT-4 agent infers object semantics and spatial constraints to construct a hybrid graph. A graph-based placement algorithm then produces a structured, collision-free layout. Based on this layout, Formation Pattern Sampling (FPS) generates object geometry using multi-timestep sampling and reconstructive optimization, enabling fast and realistic synthesis. To ensure global consistent, DreamScene employs a progressive camera sampling strategy tailored to both indoor and outdoor settings. Finally, the system supports fine-grained scene editing, including object movement, appearance changes, and 4D dynamic motion. Experiments demonstrate that DreamScene surpasses prior methods in quality, consistency, and flexibility, offering a practical solution for open-domain 3D content creation. Code and demos are available at https://dreamscene-project.github.io.

[142] Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations

Yong Feng, Xiaolei Zhang, Shijin Feng, Yong Zhao, Yihan Chen

Main category: cs.CV

TL;DR: A two-step deep learning method for classifying and segmenting tunnel cracks, combining DenseNet-169 for classification and DeepLabV3+ for segmentation, achieves high accuracy and efficiency, validated by superior experimental results.

DetailsMotivation: Tunnel lining cracks are critical for safety assessment; existing methods lack accuracy and efficiency. This study aims to improve crack detection using deep learning.

Method: Proposes a two-step approach: (1) DenseNet-169 for classifying tunnel images, (2) DeepLabV3+ for segmenting cracks, with visual explanations for model transparency.

Result: Classification accuracy: 92.23%, FPS: 39.80; Segmentation IoU: 57.01%, F1 score: 67.44%, outperforming other models.

Conclusion: The method enhances crack detection accuracy and efficiency, with visual explanations aiding model understanding, supporting tunnel health assessment.

Abstract: Tunnel lining crack is a crucial indicator of tunnels’ safety status. Aiming to classify and segment tunnel cracks with enhanced accuracy and efficiency, this study proposes a two-step deep learning-based method. An automatic tunnel image classification model is developed using the DenseNet-169 in the first step. The proposed crack segmentation model in the second step is based on the DeepLabV3+, whose internal logic is evaluated via a score-weighted visual explanation technique. Proposed method combines tunnel image classification and segmentation together, so that the selected images containing cracks from the first step are segmented in the second step to improve the detection accuracy and efficiency. The superior performances of the two-step method are validated by experiments. The results show that the accuracy and frames per second (FPS) of the tunnel crack classification model are 92.23% and 39.80, respectively, which are higher than other convolutional neural networks (CNN) based and Transformer based models. Also, the intersection over union (IoU) and F1 score of the tunnel crack segmentation model are 57.01% and 67.44%, respectively, outperforming other state-of-the-art models. Moreover, the provided visual explanations in this study are conducive to understanding the “black box” of deep learning-based models. The developed two-stage deep learning-based method integrating visual explanations provides a basis for fast and accurate quantitative assessment of tunnel health status.

[143] Multi-Centre Validation of a Deep Learning Model for Scoliosis Assessment

Šimon Kubov, Simon Klíčník, Jakub Dandár, Zdeněk Straka, Karolína Kvaková, Daniel Kvak

Main category: cs.CV

TL;DR: Automated deep learning software (Carebot AI Bones) achieves expert-level accuracy in Cobb angle measurement for scoliosis, with strong agreement and correlation compared to radiologists.

DetailsMotivation: Manual Cobb angle measurement for scoliosis is time-consuming and prone to inter-observer variability, necessitating an automated solution.

Method: A retrospective, multi-center evaluation of the AI software on 103 radiographs, compared against measurements by two radiologists using Bland-Altman analysis, MAE, RMSE, Pearson correlation, and Cohen kappa.

Result: The AI showed high agreement with radiologists (MAE ~3.9 degrees, Pearson r ~0.9) and moderate to substantial Cohen kappa for severity grading (0.51-0.64).

Conclusion: The AI software reliably replicates expert measurements, offering potential to streamline scoliosis reporting and clinical workflows.

Abstract: Scoliosis affects roughly 2 to 4 percent of adolescents, and treatment decisions depend on precise Cobb angle measurement. Manual assessment is time consuming and subject to inter observer variation. We conducted a retrospective, multi centre evaluation of a fully automated deep learning software (Carebot AI Bones, Spine Measurement functionality; Carebot s.r.o.) on 103 standing anteroposterior whole spine radiographs collected from ten hospitals. Two musculoskeletal radiologists independently measured each study and served as reference readers. Agreement between the AI and each radiologist was assessed with Bland Altman analysis, mean absolute error (MAE), root mean squared error (RMSE), Pearson correlation coefficient, and Cohen kappa for four grade severity classification. Against Radiologist 1 the AI achieved an MAE of 3.89 degrees (RMSE 4.77 degrees) with a bias of 0.70 degrees and limits of agreement from minus 8.59 to plus 9.99 degrees. Against Radiologist 2 the AI achieved an MAE of 3.90 degrees (RMSE 5.68 degrees) with a bias of 2.14 degrees and limits from minus 8.23 to plus 12.50 degrees. Pearson correlations were r equals 0.906 and r equals 0.880 (inter reader r equals 0.928), while Cohen kappa for severity grading reached 0.51 and 0.64 (inter reader kappa 0.59). These results demonstrate that the proposed software reproduces expert level Cobb angle measurements and categorical grading across multiple centres, suggesting its utility for streamlining scoliosis reporting and triage in clinical workflows.

[144] Analysis of Plant Nutrient Deficiencies Using Multi-Spectral Imaging and Optimized Segmentation Model

Ji-Yan Wu, Zheng Yong Poh, Anoop C. Patil, Bongsoo Park, Giovanni Volpe, Daisuke Urano

Main category: cs.CV

TL;DR: A deep learning framework using multispectral imaging and an enhanced YOLOv5 model with a transformer-based attention head improves nutrient deficiency detection in plant leaves, outperforming baseline YOLOv5 by 12% in Dice score and IoU.

DetailsMotivation: Accurate detection of nutrient deficiency in plant leaves is crucial for precision agriculture, enabling early intervention in fertilization and stress management.

Method: The study uses a deep learning framework with multispectral imaging and an enhanced YOLOv5 model featuring a transformer-based attention head for leaf anomaly segmentation.

Result: The proposed model outperforms baseline YOLOv5 by about 12% in Dice score and IoU, excelling in detecting symptoms like chlorosis and pigment accumulation.

Conclusion: Combining multispectral imaging with spectral-spatial feature learning shows promise for advancing plant phenotyping and precision agriculture.

Abstract: Accurate detection of nutrient deficiency in plant leaves is essential for precision agriculture, enabling early intervention in fertilization, disease, and stress management. This study presents a deep learning framework for leaf anomaly segmentation using multispectral imaging and an enhanced YOLOv5 model with a transformer-based attention head. The model is tailored for processing nine-channel multispectral input and uses self-attention mechanisms to better capture subtle, spatially-distributed symptoms. The plants in the experiments were grown under controlled nutrient stress conditions for evaluation. We carry out extensive experiments to benchmark the proposed model against the baseline YOLOv5. Extensive experiments show that the proposed model significantly outperforms the baseline YOLOv5, with an average Dice score and IoU (Intersection over Union) improvement of about 12%. In particular, this model is effective in detecting challenging symptoms like chlorosis and pigment accumulation. These results highlight the promise of combining multi-spectral imaging with spectral-spatial feature learning for advancing plant phenotyping and precision agriculture.

[145] Moodifier: MLLM-Enhanced Emotion-Driven Image Editing

Jiarong Ye, Sharon X. Huang

Main category: cs.CV

TL;DR: The paper introduces a system for emotion-driven image editing, combining a dataset (MoodArchive), a vision-language model (MoodifyCLIP), and an editing model (Moodifier) to translate emotions into visual changes while preserving content integrity.

DetailsMotivation: Emotion-driven image editing is challenging due to the abstract nature of emotions and their varied visual manifestations. The paper aims to bridge this gap for creative industries.

Method: The approach includes: 1) MoodArchive dataset with emotional annotations, 2) MoodifyCLIP model for translating emotions to visual attributes, and 3) Moodifier, a training-free editing model using MLLMs.

Result: Moodifier outperforms existing methods in emotional accuracy and content preservation, enabling precise emotional transformations across diverse domains.

Conclusion: The system links abstract emotions to visual changes, unlocking new possibilities for emotional content creation. Resources will be released publicly.

Abstract: Bridging emotions and visual content for emotion-driven image editing holds great potential in creative industries, yet precise manipulation remains challenging due to the abstract nature of emotions and their varied manifestations across different contexts. We tackle this challenge with an integrated approach consisting of three complementary components. First, we introduce MoodArchive, an 8M+ image dataset with detailed hierarchical emotional annotations generated by LLaVA and partially validated by human evaluators. Second, we develop MoodifyCLIP, a vision-language model fine-tuned on MoodArchive to translate abstract emotions into specific visual attributes. Third, we propose Moodifier, a training-free editing model leveraging MoodifyCLIP and multimodal large language models (MLLMs) to enable precise emotional transformations while preserving content integrity. Our system works across diverse domains such as character expressions, fashion design, jewelry, and home d'ecor, enabling creators to quickly visualize emotional variations while preserving identity and structure. Extensive experimental evaluations show that Moodifier outperforms existing methods in both emotional accuracy and content preservation, providing contextually appropriate edits. By linking abstract emotions to concrete visual changes, our solution unlocks new possibilities for emotional content creation in real-world applications. We will release the MoodArchive dataset, MoodifyCLIP model, and make the Moodifier code and demo publicly available upon acceptance.

[146] QuantEIT: Ultra-Lightweight Quantum-Assisted Inference for Chest Electrical Impedance Tomography

Hao Fang, Sihao Teng, Hao Yu, Siyi Yuan, Huaiwu He, Zhe Liu, Yunjie Yang

Main category: cs.CV

TL;DR: QuantEIT, an ultra-lightweight quantum-assisted framework, improves EIT image reconstruction by reducing model complexity and parameters, outperforming conventional methods with minimal resources.

DetailsMotivation: EIT's ill-posed inverse problem challenges accurate image reconstruction, and existing DL methods are inefficient due to complex architectures.

Method: QuantEIT uses a QA-Net with parallel 2-qubit quantum circuits for latent representations and a linear layer for conductivity reconstruction, operating unsupervised.

Result: QuantEIT achieves superior accuracy with 0.2% of parameters, outperforming conventional methods in 2D/3D lung imaging, and is noise-robust.

Conclusion: QuantEIT is a scalable, efficient, and innovative quantum-assisted solution for EIT image reconstruction.

Abstract: Electrical Impedance Tomography (EIT) is a non-invasive, low-cost bedside imaging modality with high temporal resolution, making it suitable for bedside monitoring. However, its inherently ill-posed inverse problem poses significant challenges for accurate image reconstruction. Deep learning (DL)-based approaches have shown promise but often rely on complex network architectures with a large number of parameters, limiting efficiency and scalability. Here, we propose an Ultra-Lightweight Quantum-Assisted Inference (QuantEIT) framework for EIT image reconstruction. QuantEIT leverages a Quantum-Assisted Network (QA-Net), combining parallel 2-qubit quantum circuits to generate expressive latent representations that serve as implicit nonlinear priors, followed by a single linear layer for conductivity reconstruction. This design drastically reduces model complexity and parameter number. Uniquely, QuantEIT operates in an unsupervised, training-data-free manner and represents the first integration of quantum circuits into EIT image reconstruction. Extensive experiments on simulated and real-world 2D and 3D EIT lung imaging data demonstrate that QuantEIT outperforms conventional methods, achieving comparable or superior reconstruction accuracy using only 0.2% of the parameters, with enhanced robustness to noise.

[147] Training-free Token Reduction for Vision Mamba

Qiankun Ma, Ziyao Zhang, Chi Su, Jie Chen, Zhen Song, Hairong Zheng, Wen Gao

Main category: cs.CV

TL;DR: Vision Mamba competes with ViTs but lacks efficient token reduction. MTR, a training-free framework, addresses this by evaluating token importance without attention mechanisms, reducing FLOPs by ~40% with minimal performance loss.

DetailsMotivation: Token reduction in Vision Mamba is underexplored, and direct ViT techniques fail due to Mamba's lack of attention mechanisms. A tailored solution is needed.

Method: Proposes MTR, a Mamba-aware token reduction framework using a simple importance score, requiring no training or extra parameters.

Result: MTR reduces FLOPs by ~40% on Vim-B with only a 1.6% drop in ImageNet performance, proving effective across tasks and backbones.

Conclusion: MTR efficiently compresses Vision Mamba models without retraining, enabling broader applications with minimal performance impact.

Abstract: Vision Mamba has emerged as a strong competitor to Vision Transformers (ViTs) due to its ability to efficiently capture long-range dependencies with linear computational complexity. While token reduction, an effective compression technique in ViTs, has rarely been explored in Vision Mamba. Exploring Vision Mamba’s efficiency is essential for enabling broader applications. However, we find that directly applying existing token reduction techniques for ViTs to Vision Mamba leads to significant performance degradation. This is primarily because Mamba is a sequence model without attention mechanisms, whereas most token reduction techniques for ViTs rely on attention mechanisms for importance measurement and overlook the order of compressed tokens. In this paper, we investigate a Mamba structure-aware importance score to evaluate token importance in a simple and effective manner. Building on this score, we further propose MTR, a training-free \textbf{M}amba \textbf{T}oken \textbf{R}eduction framework. Without the need for training or additional tuning parameters, our method can be seamlessly integrated as a plug-and-play component across various Mamba models. Extensive experiments demonstrate that our approach significantly reduces computational workload while minimizing performance impact across various tasks and multiple backbones. Notably, MTR reduces FLOPs by approximately 40% on the Vim-B backbone, with only a 1.6% drop in ImageNet performance without retraining.

[148] Foundation Models as Class-Incremental Learners for Dermatological Image Classification

Mohamed Elkhayat, Mohamed Mahmoud, Jamil Fayyad, Nourhan Bayasi

Main category: cs.CV

TL;DR: The paper evaluates frozen foundation models (FMs) for Class-Incremental Learning (CIL) in dermatology, proposing a lightweight MLP approach that outperforms existing methods and explores zero-training scenarios with prototypes.

DetailsMotivation: To leverage the rich representations of pretrained FMs for CIL in dermatology, addressing the gap in their application for incremental learning in medical domains.

Method: A frozen FM backbone is used with a lightweight MLP trained incrementally for each task; zero-training scenarios are explored using nearest mean classifiers with prototypes.

Result: The proposed approach achieves state-of-the-art performance without forgetting, and the prototype-based variant also yields competitive results.

Conclusion: Frozen FMs are highly effective for continual learning in dermatology, supporting their broader use in medical applications.

Abstract: Class-Incremental Learning (CIL) aims to learn new classes over time without forgetting previously acquired knowledge. The emergence of foundation models (FM) pretrained on large datasets presents new opportunities for CIL by offering rich, transferable representations. However, their potential for enabling incremental learning in dermatology remains largely unexplored. In this paper, we systematically evaluate frozen FMs pretrained on large-scale skin lesion datasets for CIL in dermatological disease classification. We propose a simple yet effective approach where the backbone remains frozen, and a lightweight MLP is trained incrementally for each task. This setup achieves state-of-the-art performance without forgetting, outperforming regularization, replay, and architecture based methods. To further explore the capabilities of frozen FMs, we examine zero training scenarios using nearest mean classifiers with prototypes derived from their embeddings. Through extensive ablation studies, we demonstrate that this prototype based variant can also achieve competitive results. Our findings highlight the strength of frozen FMs for continual learning in dermatology and support their broader adoption in real world medical applications. Our code and datasets are available here.

[149] Unmasking Performance Gaps: A Comparative Study of Human Anonymization and Its Effects on Video Anomaly Detection

Sara Abdulaziz, Egor Bondarev

Main category: cs.CV

TL;DR: The paper analyzes how four anonymization techniques (blurring, masking, encryption, avatar replacement) affect anomaly detection performance on the UCF-Crime dataset, revealing model-specific sensitivities and trade-offs between privacy and utility.

DetailsMotivation: Address privacy concerns in anomaly detection due to sensitive human data collection while maintaining detection performance.

Method: Evaluate four anomaly detection methods (MGFN, UR-DMU, BN-WVAD, PEL4VAD) on anonymized UCF-Crime data using four obfuscation techniques.

Result: Anomaly detection remains viable under anonymization, with some models performing better under certain techniques (e.g., encryption, masking). Algorithm design and learning strategy influence performance.

Conclusion: The study highlights algorithm-specific sensitivities to anonymization and the trade-off between privacy and utility, providing insights for balancing privacy with detection demands.

Abstract: Advancements in deep learning have improved anomaly detection in surveillance videos, yet they raise urgent privacy concerns due to the collection of sensitive human data. In this paper, we present a comprehensive analysis of anomaly detection performance under four human anonymization techniques, including blurring, masking, encryption, and avatar replacement, applied to the UCF-Crime dataset. We evaluate four anomaly detection methods, MGFN, UR-DMU, BN-WVAD, and PEL4VAD, on the anonymized UCF-Crime to reveal how each method responds to different obfuscation techniques. Experimental results demonstrate that anomaly detection remains viable under anonymized data and is dependent on the algorithmic design and the learning strategy. For instance, under certain anonymization patterns, such as encryption and masking, some models inadvertently achieve higher AUC performance compared to raw data, due to the strong responsiveness of their algorithmic components to these noise patterns. These results highlight the algorithm-specific sensitivities to anonymization and emphasize the trade-off between preserving privacy and maintaining detection utility. Furthermore, we compare these conventional anonymization techniques with the emerging privacy-by-design solutions, highlighting an often overlooked trade-off between robust privacy protection and utility flexibility. Through comprehensive experiments and analyses, this study provides a compelling benchmark and insights into balancing human privacy with the demands of anomaly detection.

[150] C-DOG: Training-Free Multi-View Multi-Object Association in Dense Scenes Without Visual Feature via Connected δ-Overlap Graphs

Yung-Hong Sun, Ting-Hung Lin, Jiangang Chen, Hongrui Jiang, Yu Hen Hu

Main category: cs.CV

TL;DR: C-DOG is a training-free framework for robust multi-view multi-object association, combining graph modeling and epipolar geometry without relying on visual features.

DetailsMotivation: Existing methods fail with visually indistinguishable objects or noisy observations, necessitating a more robust solution.

Method: Uses connected delta-overlap graph modeling and epipolar geometry, with IQR filtering and 3D back-projection error for robustness.

Result: Outperforms geometry-based baselines in synthetic benchmarks, handling high object density and limited camera overlap.

Conclusion: C-DOG is effective for scalable 3D reconstruction in real-world scenarios.

Abstract: Multi-view multi-object association is a fundamental step in 3D reconstruction pipelines, enabling consistent grouping of object instances across multiple camera views. Existing methods often rely on appearance features or geometric constraints such as epipolar consistency. However, these approaches can fail when objects are visually indistinguishable or observations are corrupted by noise. We propose C-DOG, a training-free framework that serves as an intermediate module bridging object detection (or pose estimation) and 3D reconstruction, without relying on visual features. It combines connected delta-overlap graph modeling with epipolar geometry to robustly associate detections across views. Each 2D observation is represented as a graph node, with edges weighted by epipolar consistency. A delta-neighbor-overlap clustering step identifies strongly consistent groups while tolerating noise and partial connectivity. To further improve robustness, we incorporate Interquartile Range (IQR)-based filtering and a 3D back-projection error criterion to eliminate inconsistent observations. Extensive experiments on synthetic benchmarks demonstrate that C-DOG outperforms geometry-based baselines and remains robust under challenging conditions, including high object density, without visual features, and limited camera overlap, making it well-suited for scalable 3D reconstruction in real-world scenarios.

[151] Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Shashanka Venkataramanan, Valentinos Pariza, Mohammadreza Salehi, Lukas Knobel, Spyros Gidaris, Elias Ramzi, Andrei Bursuc, Yuki M. Asano

Main category: cs.CV

TL;DR: Franca is the first fully open-source vision foundation model that outperforms proprietary models like DINOv2 and CLIP. It uses transparent training with public data and introduces innovations in clustering and positional disentanglement for better performance and efficiency.

DetailsMotivation: To create a high-performance, fully open-source vision foundation model that addresses limitations in current SSL clustering methods and positional biases in dense representations.

Method: Uses a transparent training pipeline with public data (ImageNet-21K, ReLAION-2B), introduces a multi-head clustering projector with nested Matryoshka representations, and proposes positional disentanglement to remove biases.

Result: Matches or surpasses proprietary models, achieves consistent gains on downstream benchmarks, and improves memory efficiency.

Conclusion: Franca sets a new standard for transparent, high-performance vision models, promoting reproducibility and generalizability in AI.

Abstract: We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

[152] Improved DDIM Sampling with Moment Matching Gaussian Mixtures

Prasad Gabbur

Main category: cs.CV

TL;DR: Using a GMM kernel in DDIM improves sample quality, especially with fewer sampling steps, outperforming Gaussian kernels in FID and IS metrics.

DetailsMotivation: To enhance the quality of generated samples in DDIM by replacing Gaussian kernels with GMM kernels, leveraging moment matching for better performance.

Method: Proposed using a GMM as a reverse transition operator in DDIM, matching first and second order moments of DDPM forward marginals.

Result: Achieved better FID (6.94 vs 10.15) and IS (207.85 vs 196.73) on ImageNet 256x256 with 10 steps using GMM.

Conclusion: GMM kernels in DDIM significantly improve sample quality, especially with limited sampling steps.

Abstract: We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ and class-conditional models trained on ImageNet datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel.

[153] Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

Angelos Zavras, Dimitrios Michail, Begüm Demir, Ioannis Papoutsis

Main category: cs.CV

TL;DR: The paper proposes a method to improve CLIP’s zero-shot performance in domains like Remote Sensing and medical imagery by aligning distinct modalities with CLIP’s visual and textual embeddings, achieving significant gains without additional training or forgetting.

DetailsMotivation: CLIP performs well in many tasks but struggles in domains like Remote Sensing and medical imagery due to distribution shifts and reliance on non-RGB modalities.

Method: A two-stage approach: fine-tuning CLIP with PAINT patching to address distribution shifts, then aligning RS modalities with CLIP’s embeddings via knowledge distillation.

Result: Significant performance improvements in RS imagery classification and cross-modal retrieval benchmarks, achieved without task-specific parameters or catastrophic forgetting.

Conclusion: The method effectively extends CLIP’s zero-shot capabilities to challenging domains while maintaining simplicity and avoiding additional training burdens.

Abstract: Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks and often rivals fully supervised baselines, despite not being explicitly trained for those tasks. Nevertheless, there are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery. These domains do not only exhibit fundamentally different distributions compared to natural images, but also commonly rely on complementary modalities, beyond RGB, to derive meaningful insights. To this end, we propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP. Our two-stage procedure addresses the aforementioned distribution shift, extends the zero-shot capabilities of CLIP and enriches CLIP’s shared embedding space with domain-specific knowledge. Initially, we robustly fine-tune CLIP according to the PAINT (Ilharco et al., 2022) patching protocol, in order to deal with the distribution shift. Building upon this foundation, we facilitate the cross-modal alignment of a RS modality encoder by distilling knowledge from the CLIP visual and textual encoders. We empirically show that both patching and cross-modal alignment translate to significant performance gains, across several RS imagery classification and cross-modal retrieval benchmark datasets. Notably, these enhancements are achieved without the reliance on textual descriptions, without introducing any task-specific parameters, without training from scratch and without catastrophic forgetting. We make our code implementation and weights for all experiments publicly available at https://github.com/Orion-AI-Lab/MindTheModalityGap.

[154] On Pre-training of Multimodal Language Models Customized for Chart Understanding

Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal

Main category: cs.CV

TL;DR: The paper addresses the gap in MLLMs’ ability to extract numeric values from charts by proposing CHOPINLLM, a model improved through raw data alignment, textual representation, and data extraction steps.

DetailsMotivation: Existing MLLMs for chart comprehension neglect the discrepancy between natural image-caption data and chart-QA data, limiting numeric value extraction.

Method: The study explores training processes like raw data alignment, textual representation substitution, and data extraction-first QA to enhance MLLMs.

Result: CHOPINLLM outperforms in interpreting annotated and unannotated charts, supported by a new benchmark.

Conclusion: CHOPINLLM demonstrates robust chart comprehension, setting a new standard for MLLM evaluation in this domain.

Abstract: Recent studies customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks have yielded promising results, especially in the field of scientific chart comprehension. These studies generally utilize visual instruction tuning with specialized datasets to enhance question and answer (QA) accuracy within the chart domain. However, they often neglect the fundamental discrepancy between natural image-caption pre-training data and digital chart image-QA data, particularly in the models’ capacity to extract underlying numeric values from charts. This paper tackles this oversight by exploring the training processes necessary to improve MLLMs’ comprehension of charts. We present three key findings: (1) Incorporating raw data values in alignment pre-training markedly improves comprehension of chart data. (2) Replacing images with their textual representation randomly during end-to-end fine-tuning transfer the language reasoning capability to chart interpretation skills. (3) Requiring the model to first extract the underlying chart data and then answer the question in the fine-tuning can further improve the accuracy. Consequently, we introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension. CHOPINLLM effectively interprets various types of charts, including unannotated ones, while maintaining robust reasoning abilities. Furthermore, we establish a new benchmark to evaluate MLLMs’ understanding of different chart types across various comprehension levels. Experimental results show that CHOPINLLM exhibits strong performance in understanding both annotated and unannotated charts across a wide range of types.

[155] SecurePose: Automated Face Blurring and Human Movement Kinematics Extraction from Videos Recorded in Clinical Settings

Rishabh Bajpai, Bhooma Aravamuthan

Main category: cs.CV

TL;DR: SecurePose is an open-source tool for de-identifying patient videos while extracting kinematics, outperforming existing methods in speed and accuracy.

DetailsMotivation: Movement disorder diagnosis relies on video analysis, but current de-identification methods are manual, inconsistent, and may compromise kinematic data.

Method: SecurePose uses pose estimation (OpenPose) to extract kinematics, track individuals, and blur faces automatically.

Result: Validated on 116 children with cerebral palsy, SecurePose was 91.08% faster than manual blurring and matched its accuracy. Usability was confirmed by researchers.

Conclusion: SecurePose is a practical tool for privacy protection and accurate kinematics extraction in clinical settings.

Abstract: Movement disorder diagnosis often relies on expert evaluation of patient videos, but sharing these videos poses privacy risks. Current methods for de-identifying videos, such as blurring faces, are often manual, inconsistent, or inaccurate. Furthermore, these methods can compromise objective kinematic analysis - a crucial component of diagnosis. To address these challenges, we developed SecurePose, an open-source software that simultaneously provides reliable de-identification and automated kinematic extraction from videos recorded in clinic settings using smartphones/tablets. SecurePose utilizes pose estimation (using OpenPose) to extract full body kinematics, track individuals, identify the patient, and then accurately blur faces in the videos. We validated SecurePose on gait videos recorded in outpatient clinic visits of 116 children with cerebral palsy, assessing both the accuracy of its de-identification compared to the ground truth (manual blurring) and the reliability of the intermediate steps of kinematics extraction. Results demonstrate that SecurePose outperformed six existing methods in automated face detection and achieved comparable accuracy to robust manual blurring, but in significantly less time (91.08% faster). Ten experienced researchers also confirmed SecurePose’s usability via System Usability Scale scores. These findings validate SecurePose as a practical and effective tool for protecting patient privacy while enabling accurate kinematics extraction in clinical settings.

[156] VAPO: Visibility-Aware Keypoint Localization for Efficient 6DoF Object Pose Estimation

Ruyi Lian, Yuewei Lin, Longin Jan Latecki, Haibin Ling

Main category: cs.CV

TL;DR: The paper introduces VAPO, a visibility-aware pose estimator, to improve 6DoF object pose estimation by addressing unreliable keypoint localization. It generates visibility labels and derives importance scores, integrating them with a state-of-the-art algorithm.

DetailsMotivation: Unreliable localization of invisible 3D keypoints in 2D images degrades 3D-2D correspondences for pose estimation. The lack of visibility labels in datasets motivates the need for a visibility-aware solution.

Method: Proposes generating binary visibility labels from object-level annotations and deriving real-valued importance using PageRank. Integrates these with positional encoding into VAPO for CAD-based and CAD-free settings.

Result: VAPO achieves state-of-the-art performance on benchmarks like Linemod, Linemod-Occlusion, and YCB-V.

Conclusion: Visibility-aware keypoint localization significantly improves pose estimation, with VAPO outperforming existing methods.

Abstract: Localizing predefined 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for instance-level 6DoF object pose estimation. However, unreliable localization results of invisible keypoints degrade the quality of correspondences. In this paper, we address this issue by localizing the important keypoints in terms of visibility. Since keypoint visibility information is currently missing in the dataset collection process, we propose an efficient way to generate binary visibility labels from available object-level annotations, for keypoints of both asymmetric objects and symmetric objects. We further derive real-valued visibility-aware importance from binary labels based on the PageRank algorithm. Taking advantage of the flexibility of our visibility-aware importance, we construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm, along with additional positional encoding. VAPO can work in both CAD-based and CAD-free settings. Extensive experiments are conducted on popular pose estimation benchmarks including Linemod, Linemod-Occlusion, and YCB-V, demonstrating that VAPO clearly achieves state-of-the-art performances. Project page: https://github.com/RuyiLian/VAPO.

[157] Computer-Vision-Enabled Worker Video Analysis for Motion Amount Quantification

Hari Iyer, Neel Macwan, Shenghan Guo, Heejin Jeong

Main category: cs.CV

TL;DR: A framework for tracking and quantifying worker limb motions using posture estimation and Hotelling’s T² statistic, showing correlation with workload and high accuracy in identifying ergonomic risks.

DetailsMotivation: Monitoring and assessing physical worker motions is challenging, but real-time video analysis can improve performance and safety.

Method: Uses joint position data from posture estimation and Hotelling’s T² statistic to quantify motion, with a Random Forest model for risk pattern identification.

Result: Positive correlation between motion warnings and workload (r=0.218, p=0.0024); model achieves 94% accuracy in identifying ergonomic risks.

Conclusion: The framework effectively monitors worker motions and identifies ergonomic risks, generalizing well across environments.

Abstract: The performance of physical workers is significantly influenced by the extent of their motions. However, monitoring and assessing these motions remains a challenge. Recent advancements have enabled in-situ video analysis for real-time observation of worker behaviors. This paper introduces a novel framework for tracking and quantifying upper and lower limb motions, issuing alerts when critical thresholds are reached. Using joint position data from posture estimation, the framework employs Hotelling’s $T^2$ statistic to quantify and monitor motion amounts. A significant positive correlation was noted between motion warnings and the overall NASA Task Load Index (TLX) workload rating (\textit{r} = 0.218, \textit{p} = 0.0024). A supervised Random Forest model trained on the collected motion data was benchmarked against multiple datasets including UCF Sports Action and UCF50, and was found to effectively generalize across environments, identifying ergonomic risk patterns with accuracies up to 94%.

[158] FBSDiff: Plug-and-Play Frequency Band Substitution of Diffusion Features for Highly Controllable Text-Driven Image Translation

Xiang Gao, Jiaying Liu

Main category: cs.CV

TL;DR: The paper introduces a plug-and-play method for text-driven image-to-image translation using pre-trained diffusion models, enhancing controllability without training or fine-tuning.

DetailsMotivation: Current text-to-image models lack controllability for practical use, prompting the need for methods to leverage reference images for better synthesis.

Method: Proposes frequency band substitution in DCT spectral space to dynamically control T2I generation with a reference image, enabling flexible adjustments via frequency band tuning.

Result: Achieves high-quality, versatile I2I translation with superior visual quality and controllability compared to existing methods.

Conclusion: The approach effectively bridges the gap in controllability for diffusion models, offering a practical solution for real-life content creation.

Abstract: Large-scale text-to-image diffusion models have been a revolutionary milestone in the evolution of generative AI and multimodal technology, allowing wonderful image generation with natural-language text prompt. However, the issue of lacking controllability of such models restricts their practical applicability for real-life content creation. Thus, attention has been focused on leveraging a reference image to control text-to-image synthesis, which is also regarded as manipulating (or editing) a reference image as per a text prompt, namely, text-driven image-to-image translation. This paper contributes a novel, concise, and efficient approach that adapts pre-trained large-scale text-to-image (T2I) diffusion model to the image-to-image (I2I) paradigm in a plug-and-play manner, realizing high-quality and versatile text-driven I2I translation without any model training, model fine-tuning, or online optimization process. To guide T2I generation with a reference image, we propose to decompose diverse guiding factors with different frequency bands of diffusion features in the DCT spectral space, and accordingly devise a novel frequency band substitution layer which realizes dynamic control of the reference image to the T2I generation result in a plug-and-play manner. We demonstrate that our method allows flexible control over both guiding factor and guiding intensity of the reference image simply by tuning the type and bandwidth of the substituted frequency band, respectively. Extensive qualitative and quantitative experiments verify superiority of our approach over related methods in I2I translation visual quality, versatility, and controllability. The code is publicly available at: https://github.com/XiangGao1102/FBSDiff.

[159] Entropy Loss: An Interpretability Amplifier of 3D Object Detection Network for Intelligent Driving

Haobo Yang, Shiyan Zhang, Zhuoyi Yang, Xinyu Zhang, Jilong Guo, Zongyou Yang, Jun Li

Main category: cs.CV

TL;DR: The paper introduces ‘Entropy Loss,’ a novel loss function and training strategy to improve interpretability in intelligent driving perception models, achieving better accuracy in 3D object detection.

DetailsMotivation: Address the 'black box' issue in deep learning-based intelligent driving perception by enhancing interpretability.

Method: Develops Entropy Loss based on feature compression networks, modeling layer outputs as continuous random variables to quantify information changes.

Result: The method improves 3D object detection accuracy by up to 4.47% on the KITTI test set and speeds up training.

Conclusion: Entropy Loss effectively enhances interpretability and performance in intelligent driving perception models.

Abstract: With the increasing complexity of the traffic environment, the significance of safety perception in intelligent driving is intensifying. Traditional methods in the field of intelligent driving perception rely on deep learning, which suffers from limited interpretability, often described as a “black box.” This paper introduces a novel type of loss function, termed “Entropy Loss,” along with an innovative training strategy. Entropy Loss is formulated based on the functionality of feature compression networks within the perception model. Drawing inspiration from communication systems, the information transmission process in a feature compression network is expected to demonstrate steady changes in information volume and a continuous decrease in information entropy. By modeling network layer outputs as continuous random variables, we construct a probabilistic model that quantifies changes in information volume. Entropy Loss is then derived based on these expectations, guiding the update of network parameters to enhance network interpretability. Our experiments indicate that the Entropy Loss training strategy accelerates the training process. Utilizing the same 60 training epochs, the accuracy of 3D object detection models using Entropy Loss on the KITTI test set improved by up to 4.47% compared to models without Entropy Loss, underscoring the method’s efficacy. The implementation code is available at https://github.com/yhbcode000/Eloss-Interpretability.

[160] Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space

Junho Lee, Jeongwoo Shin, Seung Woo Ko, Seongsu Ha, Joonseok Lee

Main category: cs.CV

TL;DR: A novel method reduces the search space for frame sampling from O(T^N) to O(T) by selecting top N frames based on per-frame confidence, ensuring efficiency and performance.

DetailsMotivation: The vast search space of frame sampling (O(T^N)) makes existing methods computationally expensive, especially for large N.

Method: Proposes a semi-optimal policy that selects top N frames using independently estimated per-frame confidence, reducing complexity.

Result: The method efficiently approximates the optimal policy and maintains stable, high performance across datasets and model architectures.

Conclusion: The semi-optimal policy offers a computationally efficient and effective solution for frame sampling in video classification.

Abstract: Given a video with $T$ frames, frame sampling is a task to select $N \ll T$ frames, so as to maximize the performance of a fixed video classifier. Not just brute-force search, but most existing methods suffer from its vast search space of $\binom{T}{N}$, especially when $N$ gets large. To address this challenge, we introduce a novel perspective of reducing the search space from $O(T^N)$ to $O(T)$. Instead of exploring the entire $O(T^N)$ space, our proposed semi-optimal policy selects the top $N$ frames based on the independently estimated value of each frame using per-frame confidence, significantly reducing the computational complexity. We verify that our semi-optimal policy can efficiently approximate the optimal policy, particularly under practical settings. Additionally, through extensive experiments on various datasets and model architectures, we demonstrate that learning our semi-optimal policy ensures stable and high performance regardless of the size of $N$ and $T$.

[161] Horticultural Temporal Fruit Monitoring via 3D Instance Segmentation and Re-Identification using Colored Point Clouds

Daniel Fusaro, Federico Magistri, Jens Behley, Alberto Pretto, Cyrill Stachniss

Main category: cs.CV

TL;DR: A novel method for fruit instance segmentation and re-identification in 3D point clouds, outperforming existing techniques in dynamic orchard environments.

DetailsMotivation: Automated fruit monitoring is challenging due to variations in fruit appearance and orchard dynamics.

Method: Uses learning-based instance segmentation on point clouds, 3D sparse CNN for descriptors, and an attention-based matching network for temporal tracking.

Result: Outperforms existing methods in segmentation and re-identification for strawberries and apples.

Conclusion: Enables robust and precise fruit monitoring in complex orchards.

Abstract: Accurate and consistent fruit monitoring over time is a key step toward automated agricultural production systems. However, this task is inherently difficult due to variations in fruit size, shape, occlusion, orientation, and the dynamic nature of orchards where fruits may appear or disappear between observations. In this article, we propose a novel method for fruit instance segmentation and re-identification on 3D terrestrial point clouds collected over time. Our approach directly operates on dense colored point clouds, capturing fine-grained 3D spatial detail. We segment individual fruits using a learning-based instance segmentation method applied directly to the point cloud. For each segmented fruit, we extract a compact and discriminative descriptor using a 3D sparse convolutional neural network. To track fruits across different times, we introduce an attention-based matching network that associates fruits with their counterparts from previous sessions. Matching is performed using a probabilistic assignment scheme, selecting the most likely associations across time. We evaluate our approach on real-world datasets of strawberries and apples, demonstrating that it outperforms existing methods in both instance segmentation and temporal re-identification, enabling robust and precise fruit monitoring across complex and dynamic orchard environments.

[162] Progressively Exploring and Exploiting Cost-Free Data to Break Fine-Grained Classification Barriers

Li-Jun Zhao, Zhen-Duo Chen, Zhi-Yuan Xue, Xin Luo, Xin-Shun Xu

Main category: cs.CV

TL;DR: The paper proposes a novel learning paradigm (EXP2) for fine-grained classification, addressing challenges like diverse features and dynamic semantics by enabling progressive learning during inference.

DetailsMotivation: Fine-grained classification faces challenges in real-world scenarios due to difficult data annotation and dynamic features, limiting traditional methods.

Method: The paper introduces EXP2, a strategy that explores and exploits useful inference data samples to optimize classifiers, leveraging cost-free data.

Result: Experimental results show the method’s effectiveness in improving fine-grained classification accuracy.

Conclusion: The proposed paradigm and EXP2 method offer a promising direction for real-world fine-grained classification, guiding future research.

Abstract: Current fine-grained classification research primarily focuses on fine-grained feature learning. However, in real-world scenarios, fine-grained data annotation is challenging, and the features and semantics are highly diverse and frequently changing. These issues create inherent barriers between traditional experimental settings and real-world applications, limiting the effectiveness of conventional fine-grained classification methods. Although some recent studies have provided potential solutions to these issues, most of them still rely on limited supervised information and thus fail to offer effective solutions. In this paper, based on theoretical analysis, we propose a novel learning paradigm to break the barriers in fine-grained classification. This paradigm enables the model to progressively learn during inference, thereby leveraging cost-free data to more accurately represent fine-grained categories and adapt to dynamic semantic changes. On this basis, an efficient EXPloring and EXPloiting strategy and method (EXP2) is designed. Thereinto, useful inference data samples are explored according to class representations and exploited to optimize classifiers. Experimental results demonstrate the general effectiveness of our method, providing guidance for future in-depth understanding and exploration of real-world fine-grained classification.

[163] EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation

Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang

Main category: cs.CV

TL;DR: EvolveNav improves LLM-based vision-language navigation by combining formalized CoT training and self-reflective post-training, enhancing reasoning and decision accuracy.

DetailsMotivation: Addressing the difficulty in mapping learning and unexplainable decisions in VLN tasks by leveraging LLMs' reasoning abilities.

Method: Two-stage framework: (1) Formalized CoT supervised fine-tuning, (2) Self-reflective post-training with self-enriched CoT labels and an auxiliary task.

Result: Superior performance on VLN benchmarks compared to previous LLM-based approaches.

Conclusion: EvolveNav effectively boosts navigational reasoning and interpretability in VLN tasks.

Abstract: Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs’ reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs’ training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model’s navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.

[164] SIC: Similarity-Based Interpretable Image Classification with Neural Networks

Tom Nuno Wolf, Emre Kavak, Fabian Bongratz, Christian Wachinger

Main category: cs.CV

TL;DR: SIC is an interpretable neural network using case-based reasoning for local and global explanations, achieving competitive accuracy while providing verified insights.

DetailsMotivation: Balancing accuracy and interpretability in deep learning for critical domains.

Method: SIC uses case-based reasoning, support vectors, and B-Cos transformations for coherent explanations.

Result: Competitive accuracy on fine-grained, multi-label, and pathology tasks with verified explanations.

Conclusion: SIC is effective for applications requiring both accuracy and interpretability.

Abstract: The deployment of deep learning models in critical domains necessitates a balance between high accuracy and interpretability. We introduce SIC, an inherently interpretable neural network that provides local and global explanations of its decision-making process. Leveraging the concept of case-based reasoning, SIC extracts class-representative support vectors from training images, ensuring they capture relevant features while suppressing irrelevant ones. Classification decisions are made by calculating and aggregating similarity scores between these support vectors and the input’s latent feature vector. We employ B-Cos transformations, which align model weights with inputs, to yield coherent pixel-level explanations in addition to global explanations of case-based reasoning. We evaluate SIC on three tasks: fine-grained classification on Stanford Dogs and FunnyBirds, multi-label classification on Pascal VOC, and pathology detection on the RSNA dataset. Results indicate that SIC not only achieves competitive accuracy compared to state-of-the-art black-box and inherently interpretable models but also offers insightful explanations verified through practical evaluation on the FunnyBirds benchmark. Our theoretical analysis proves that these explanations fulfill established axioms for explanations. Our findings underscore SIC’s potential for applications where understanding model decisions is as critical as the decisions themselves.

[165] LLM-driven Medical Report Generation via Communication-efficient Heterogeneous Federated Learning

Haoxuan Che, Haibo Jin, Zhengrui Guo, Yi Lin, Cheng Jin, Hao Chen

Main category: cs.CV

TL;DR: FedMRG is a federated learning framework for privacy-preserving, multi-center development of LLM-driven medical report generation, addressing communication overhead and data heterogeneity.

DetailsMotivation: Centralizing medical image-report pairs for LLM-driven MRG is challenging due to privacy regulations, hindering model development.

Method: FedMRG employs low-rank factorization for efficient parameter updates and introduces client-aware contrastive learning and a dual-adapter mechanism to handle data heterogeneity.

Result: FedMRG demonstrates generalizability, adaptability, and communication efficiency in generating clinically accurate reports.

Conclusion: FedMRG effectively leverages multi-center data for LLM-driven MRG while maintaining privacy and efficiency.

Abstract: LLMs have demonstrated significant potential in Medical Report Generation (MRG), yet their development requires large amounts of medical image-report pairs, which are commonly scattered across multiple centers. Centralizing these data is exceptionally challenging due to privacy regulations, thereby impeding model development and broader adoption of LLM-driven MRG models. To address this challenge, we present FedMRG, the first framework that leverages Federated Learning (FL) to enable privacy-preserving, multi-center development of LLM-driven MRG models, specifically designed to overcome the critical challenge of communication-efficient LLM training under multi-modal data heterogeneity. To start with, our framework tackles the fundamental challenge of communication overhead in FL-LLM tuning by employing low-rank factorization to efficiently decompose parameter updates, significantly reducing gradient transmission costs and making LLM-driven MRG feasible in bandwidth-constrained FL settings. Furthermore, we observed the dual heterogeneity in MRG under the FL scenario: varying image characteristics across medical centers, as well as diverse reporting styles and terminology preferences. To address this, we further enhance FedMRG with (1) client-aware contrastive learning in the MRG encoder, coupled with diagnosis-driven prompts, which capture both globally generalizable and locally distinctive features while maintaining diagnostic accuracy; and (2) a dual-adapter mutual boosting mechanism in the MRG decoder that harmonizes generic and specialized adapters to address variations in reporting styles and terminology. Through extensive evaluation of our established FL-MRG benchmark, we demonstrate the generalizability and adaptability of FedMRG, underscoring its potential in harnessing multi-center data and generating clinically accurate reports while maintaining communication efficiency.

[166] Accelerating Diffusion Transformer via Error-Optimized Cache

Junxiang Qiu, Shuo Wang, Jinda Lu, Lin Liu, Houcheng Jiang, Xingyu Zhu, Yanbin Hao

Main category: cs.CV

TL;DR: The paper introduces Error-Optimized Cache (EOC) to reduce sampling time and improve content quality in Diffusion Transformer (DiT) by minimizing caching-induced errors.

DetailsMotivation: Existing caching methods in DiT reduce sampling time but degrade content quality due to unaddressed caching errors.

Method: EOC improves caching by (1) extracting prior knowledge, (2) judging cache optimization needs, and (3) reducing caching errors.

Result: EOC significantly reduces error accumulation, improving FID scores on ImageNet at various caching levels (e.g., 28.8% improvement at 75% caching).

Conclusion: EOC effectively balances sampling speed and content quality, outperforming existing caching methods.

Abstract: Diffusion Transformer (DiT) is a crucial method for content generation. However, it needs a lot of time to sample. Many studies have attempted to use caching to reduce the time consumption of sampling. Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching-induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity. To solve this problem, we propose the \textbf{E}rror-\textbf{O}ptimized \textbf{C}ache (\textbf{EOC}). This method introduces three key improvements: \textbf{(1)} Prior knowledge extraction: Extract and process the caching differences; \textbf{(2)} A judgment method for cache optimization: Determine whether certain caching steps need to be optimized; \textbf{(3)} Cache optimization: reduce caching errors. Experiments show that this algorithm significantly reduces the error accumulation caused by caching, especially excessive caching. On the ImageNet dataset, without substantially increasing the computational load, this method improves the FID of the generated images when the rule-based model FORA has a caching level of \textbf{75}%, \textbf{50}%, and \textbf{25}%, and the training-based model Learning-to-cache has a caching level of \textbf{22}%. Specifically, the FID values change from 30.454 to 21.690 (\textbf{28.8}%), from 6.857 to 5.821 (\textbf{15.1}%), from 3.870 to 3.692 (\textbf{4.6}%), and from 3.539 to 3.451 (\textbf{2.5}%) respectively. Code is available at https://github.com/qiujx0520/EOC_MM2025.git.

[167] CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation

Xiao Lin, Yun Peng, Liuyi Wang, Xianyou Zhong, Minghao Zhu, Jingwei Yang, Yi Feng, Chengju Liu, Qijun Chen

Main category: cs.CV

TL;DR: CleanPose integrates causal learning and knowledge distillation to improve category-level object pose estimation by addressing spurious correlations and enhancing generalization.

DetailsMotivation: Existing methods suffer from spurious correlations due to unclean confounders, limiting performance on novel instances with variations.

Method: Proposes CleanPose with a causal inference module (front-door adjustment) and residual-based knowledge distillation.

Result: Outperforms state-of-the-art methods on benchmarks (REAL275, CAMERA25, HouseCat6D).

Conclusion: CleanPose effectively mitigates spurious correlations and improves generalization in category-level pose estimation.

Abstract: Category-level object pose estimation aims to recover the rotation, translation and size of unseen instances within predefined categories. In this task, deep neural network-based methods have demonstrated remarkable performance. However, previous studies show they suffer from spurious correlations raised by “unclean” confounders in models, hindering their performance on novel instances with significant variations. To address this issue, we propose CleanPose, a novel approach integrating causal learning and knowledge distillation to enhance category-level pose estimation. To mitigate the negative effect of unobserved confounders, we develop a causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further improve generalization ability, we devise a residual-based knowledge distillation method that has proven effective in providing comprehensive category information guidance. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over state-of-the-art methods. Code will be available at https://github.com/chrislin0621/CleanPose.

[168] Accelerating Diffusion Transformer via Gradient-Optimized Cache

Junxiang Qiu, Lin Liu, Shuo Wang, Jinda Lu, Kezhou Chen, Yanbin Hao

Main category: cs.CV

TL;DR: The paper introduces Gradient-Optimized Cache (GOC) to improve diffusion transformer sampling by addressing error accumulation and dynamic perturbation patterns in feature caching.

DetailsMotivation: Feature caching accelerates DiT sampling but suffers from error accumulation and suboptimal error correction due to neglected dynamic patterns.

Method: GOC uses cached gradient propagation and inflection-aware optimization to compensate for errors and align gradient updates with critical phases.

Result: GOC achieves higher IS (216.28) and lower FID (3.907) with 50% cached blocks, outperforming baseline DiT.

Conclusion: GOC offers a robust trade-off between efficiency and quality, adaptable to various cache ratios.

Abstract: Feature caching has emerged as an effective strategy to accelerate diffusion transformer (DiT) sampling through temporal feature reuse. It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. These gradients are weighted and propagated to subsequent steps, directly compensating for the approximation errors introduced by caching. (2) Inflection-Aware Optimization: Through statistical analysis of feature variation patterns, we identify critical inflection points where the denoising trajectory changes direction. By aligning gradient updates with these detected phases, we prevent conflicting gradient directions during error correction. Extensive evaluations on ImageNet demonstrate GOC’s superior trade-off between efficiency and quality. With 50% cached blocks, GOC achieves IS 216.28 (26.3% higher) and FID 3.907 (43% lower) compared to baseline DiT, while maintaining identical computational costs. These improvements persist across various cache ratios, demonstrating robust adaptability to different acceleration requirements. Code is available at https://github.com/qiujx0520/GOC_ICCV2025.git.

[169] Cycle-Consistent Multi-Graph Matching for Self-Supervised Annotation of C.Elegans

Christoph Karg, Sebastian Stricker, Lisa Hutschenreiter, Bogdan Savchynskyy, Dagmar Kainmueller

Main category: cs.CV

TL;DR: A novel unsupervised method for multi-graph matching using Gaussian-distributed keypoint features, achieving state-of-the-art accuracy without ground truth.

DetailsMotivation: To enable unsupervised semantic cell annotation in 3D microscopy images, overcoming the bottleneck of requiring ground truth data.

Method: Uses cycle consistency as a self-supervised loss and Bayesian Optimization for Gaussian parameter determination, scaling efficiently to large datasets.

Result: Achieves accuracy comparable to supervised methods and creates the first unsupervised atlas of C. elegans cell nuclei.

Conclusion: The approach enables unsupervised cell-level atlas construction for model organisms, potentially advancing biomedical studies.

Abstract: In this work we present a novel approach for unsupervised multi-graph matching, which applies to problems for which a Gaussian distribution of keypoint features can be assumed. We leverage cycle consistency as loss for self-supervised learning, and determine Gaussian parameters through Bayesian Optimization, yielding a highly efficient approach that scales to large datasets. Our fully unsupervised approach enables us to reach the accuracy of state-of-the-art supervised methodology for the biomedical use case of semantic cell annotation in 3D microscopy images of the worm C. elegans. To this end, our approach yields the first unsupervised atlas of C. elegans, i.e. a model of the joint distribution of all of its cell nuclei, without the need for any ground truth cell annotation. This advancement enables highly efficient semantic annotation of cells in large microscopy datasets, overcoming a current key bottleneck. Beyond C. elegans, our approach offers fully unsupervised construction of cell-level atlases for any model organism with a stereotyped body plan down to the level of unique semantic cell labels, and thus bears the potential to catalyze respective biomedical studies in a range of further species.

[170] Consistency Trajectory Matching for One-Step Generative Super-Resolution

Weiyi You, Mingyang Zhang, Leheng Zhang, Xingyu Zhou, Kexuan Shi, Shuhang Gu

Main category: cs.CV

TL;DR: CTMSR is a distillation-free method for super-resolution (SR) that generates high-quality results in one step, avoiding the limitations of teacher-student models.

DetailsMotivation: Current SR methods using diffusion models are slow and rely on costly distillation techniques, which limit performance and increase training costs.

Method: CTMSR uses Probability Flow ODE trajectories and Consistency Training to directly map low-resolution images to high-resolution ones in one step, enhanced by a Distribution Trajectory Matching loss.

Result: The method achieves comparable or superior performance on synthetic and real datasets with minimal inference latency.

Conclusion: CTMSR offers an efficient, high-quality alternative to diffusion-based SR methods without relying on pre-trained models or distillation.

Abstract: Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.

[171] Hands-On: Segmenting Individual Signs from Continuous Sequences

JianHe Low, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: A transformer-based model for continuous sign language segmentation, using BIO tagging and HaMeR hand features, achieves state-of-the-art results.

DetailsMotivation: Addressing the challenge of continuous sign language segmentation for translation and annotation.

Method: Transformer-based architecture with BIO tagging, HaMeR hand features, and 3D Angles.

Result: State-of-the-art on DGS Corpus; features outperform on BSLCorpus.

Conclusion: Proposed method effectively models temporal dynamics and improves segmentation.

Abstract: This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

[172] CDUPatch: Color-Driven Universal Adversarial Patch Attack for Dual-Modal Visible-Infrared Detectors

Jiahuan Long, Wen Yao, Tingsong Jiang, Chao Ma

Main category: cs.CV

TL;DR: CDUPatch is a universal cross-modal adversarial patch attack for visible-infrared object detectors, improving effectiveness across scales, views, and scenarios by leveraging color-to-thermal mapping and multi-scale strategies.

DetailsMotivation: Existing dual-modal adversarial patches lack effectiveness in diverse physical scenarios, prompting the need for a more robust solution.

Method: Uses an RGB-to-infrared adapter for unified optimization, learns optimal color distribution for thermal manipulation, and employs multi-scale clipping and dataset augmentation (MSDrone).

Result: Outperforms existing attacks in digital tests and shows strong transferability in physical scenarios.

Conclusion: CDUPatch enhances adversarial patch robustness for dual-modal detectors, validated across benchmarks and real-world conditions.

Abstract: Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.

[173] BeetleVerse: A Study on Taxonomic Classification of Ground Beetles

S M Rayeed, Alyson East, Samuel Stevens, Sydne Record, Charles V Stewart

Main category: cs.CV

TL;DR: The paper evaluates vision models for automated taxonomic classification of ground beetles, achieving high accuracy with a Vision and Language Transformer. It also explores sample efficiency and domain adaptation challenges.

DetailsMotivation: Ground beetles are underutilized for biodiversity monitoring due to manual taxonomic challenges. Automating classification can enable widespread use.

Method: 12 vision models were tested on four datasets (230 genera, 1769 species), including lab and field images. Focus on sample efficiency and domain adaptation.

Result: Best model achieved 97% genus and 94% species accuracy. Sample efficiency improved (50% less data), but domain adaptation from lab to field images was challenging.

Conclusion: The study advances automated beetle classification and highlights challenges in cross-domain adaptation, paving the way for large-scale ecological applications.

Abstract: Ground beetles are a highly sensitive and speciose biological indicator, making them vital for monitoring biodiversity. However, they are currently an underutilized resource due to the manual effort required by taxonomic experts to perform challenging species differentiations based on subtle morphological differences, precluding widespread applications. In this paper, we evaluate 12 vision models on taxonomic classification across four diverse, long-tailed datasets spanning over 230 genera and 1769 species, with images ranging from controlled laboratory settings to challenging field-collected (in-situ) photographs. We further explore taxonomic classification in two important real-world contexts: sample efficiency and domain adaptation. Our results show that the Vision and Language Transformer combined with an MLP head is the best performing model, with 97% accuracy at genus and 94% at species level. Sample efficiency analysis shows that we can reduce train data requirements by up to 50% with minimal compromise in performance. The domain adaptation experiments reveal significant challenges when transferring models from lab to in-situ images, highlighting a critical domain gap. Overall, our study lays a foundation for large-scale automated taxonomic classification of beetles, and beyond that, advances sample-efficient learning and cross-domain adaptation for diverse long-tailed ecological datasets.

[174] PosePilot: Steering Camera Pose for Generative World Models with Self-supervised Depth

Bu Jin, Weize Li, Baihan Yang, Zhenxin Zhu, Junpeng Jiang, Huan-ang Gao, Haiyang Sun, Kun Zhan, Hengtong Hu, Xueyang Zhang, Peng Jia, Hao Zhao

Main category: cs.CV

TL;DR: PosePilot enhances camera pose control in generative world models for autonomous driving by leveraging self-supervised depth and pose estimation, improving viewpoint synthesis and motion reasoning.

DetailsMotivation: Precise camera pose control is critical for accurate viewpoint transformation and realistic scene dynamics in autonomous driving systems, but current methods lack flexibility and precision.

Method: PosePilot integrates self-supervised depth and pose estimation, uses pose-aware frame warping with photometric loss, and refines pose estimation with reverse warping and pose regression loss.

Result: Experiments show PosePilot improves structural understanding and motion reasoning in diffusion-based and auto-regressive world models, setting a new benchmark for pose controllability.

Conclusion: PosePilot advances camera pose control in generative world models, enabling physically consistent and reliable viewpoint synthesis for autonomous driving.

Abstract: Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

[175] TextDiffuser-RL: Efficient and Robust Text Layout Optimization for High-Fidelity Text-to-Image Synthesis

Kazi Mahathir Rahman, Showrin Rahman, Sharmin Sultana Srishty

Main category: cs.CV

TL;DR: A novel two-stage pipeline using RL for efficient text layout generation and diffusion-based image synthesis, outperforming TextDiffuser-2 in speed and flexibility while maintaining quality.

DetailsMotivation: Existing text-to-image methods like TextDiffuser-2 are resource-intensive and inefficient on CPUs/GPUs.

Method: Integrates RL for optimized text layout generation and a diffusion model for image synthesis.

Result: Achieves faster runtime (97.64% faster), reduced memory (2MB), and maintains/surpasses TextDiffuser-2’s quality.

Conclusion: The proposed framework is efficient, flexible, and high-quality for text-embedded image generation.

Abstract: Text-embedded image generation plays a critical role in industries such as graphic design, advertising, and digital content creation. Text-to-Image generation methods leveraging diffusion models, such as TextDiffuser-2, have demonstrated promising results in producing images with embedded text. TextDiffuser-2 effectively generates bounding box layouts that guide the rendering of visual text, achieving high fidelity and coherence. However, existing approaches often rely on resource-intensive processes and are limited in their ability to run efficiently on both CPU and GPU platforms. To address these challenges, we propose a novel two-stage pipeline that integrates reinforcement learning (RL) for rapid and optimized text layout generation with a diffusion-based image synthesis model. Our RL-based approach significantly accelerates the bounding box prediction step while reducing overlaps, allowing the system to run efficiently on both CPUs and GPUs. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2’s quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Extensive evaluations demonstrate that our framework maintains or surpasses TextDiffuser-2’s quality in text placement and image synthesis, with markedly faster runtime and increased flexibility. Our approach has been evaluated on the MARIOEval benchmark, achieving OCR and CLIPScore metrics close to state-of-the-art models, while being 97.64% more faster and requiring only 2MB of memory to run.

[176] FDSG: Forecasting Dynamic Scene Graphs

Yi Yang, Yuren Cong, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang

Main category: cs.CV

TL;DR: FDSG predicts future entity labels, bounding boxes, and relationships in videos, outperforming existing methods in dynamic scene graph generation.

DetailsMotivation: Existing methods lack explicit modeling of temporal dynamics or predict only relationships, limiting video scene understanding.

Method: FDSG uses query decomposition and neural stochastic differential equations for dynamics, plus a temporal aggregation module for refining predictions.

Result: FDSG outperforms state-of-the-art methods on dynamic scene graph generation, anticipation, and forecasting tasks.

Conclusion: FDSG advances video scene understanding by predicting full future scene graphs, with codes to be released.

Abstract: Dynamic scene graph generation extends scene graph generation from images to videos by modeling entity relationships and their temporal evolution. However, existing methods either generate scene graphs from observed frames without explicitly modeling temporal dynamics, or predict only relationships while assuming static entity labels and locations. These limitations hinder effective extrapolation of both entity and relationship dynamics, restricting video scene understanding. We propose Forecasting Dynamic Scene Graphs (FDSG), a novel framework that predicts future entity labels, bounding boxes, and relationships, for unobserved frames, while also generating scene graphs for observed frames. Our scene graph forecast module leverages query decomposition and neural stochastic differential equations to model entity and relationship dynamics. A temporal aggregation module further refines predictions by integrating forecasted and observed information via cross-attention. To benchmark FDSG, we introduce Scene Graph Forecasting, a new task for full future scene graph prediction. Experiments on Action Genome show that FDSG outperforms state-of-the-art methods on dynamic scene graph generation, scene graph anticipation, and scene graph forecasting. Codes will be released upon publication.

[177] VFaith: Do Large Multimodal Models Really Reason on Seen Images Rather than Previous Memories?

Jiachen Yu, Yufei Zhan, Ziheng Wu, Yousong Zhu, Jinqiao Wang, Minghui Qiu

Main category: cs.CV

TL;DR: The paper introduces a method to evaluate the visual reasoning faithfulness of MLLMs by creating a benchmark (VFaith-Bench) and an editing pipeline to alter visual cues.

DetailsMotivation: Understanding why long CoT enhances MLLMs' problem-solving and quantifying the role of visual cues in reasoning.

Method: Developed an automatic editing pipeline (GPT-Image-1) to modify visual cues and created VFaith-Bench with 755 entries for testing.

Result: The benchmark and pipeline reveal the relationship between MLLMs’ reasoning and visual perception through accuracy differences.

Conclusion: VFaith-Bench provides insights into MLLMs’ visual reasoning capabilities and their dependence on visual faithfulness.

Abstract: Recent extensive works have demonstrated that by introducing long CoT, the capabilities of MLLMs to solve complex problems can be effectively enhanced. However, the reasons for the effectiveness of such paradigms remain unclear. It is challenging to analysis with quantitative results how much the model’s specific extraction of visual cues and its subsequent so-called reasoning during inference process contribute to the performance improvements. Therefore, evaluating the faithfulness of MLLMs’ reasoning to visual information is crucial. To address this issue, we first present a cue-driven automatic and controllable editing pipeline with the help of GPT-Image-1. It enables the automatic and precise editing of specific visual cues based on the instruction. Furthermore, we introduce VFaith-Bench, the first benchmark to evaluate MLLMs’ visual reasoning capabilities and analyze the source of such capabilities with an emphasis on the visual faithfulness. Using the designed pipeline, we constructed comparative question-answer pairs by altering the visual cues in images that are crucial for solving the original reasoning problem, thereby changing the question’s answer. By testing similar questions with images that have different details, the average accuracy reflects the model’s visual reasoning ability, while the difference in accuracy before and after editing the test set images effectively reveals the relationship between the model’s reasoning ability and visual perception. We further designed specific metrics to expose this relationship. VFaith-Bench includes 755 entries divided into five distinct subsets, along with an additional human-labeled perception task. We conducted in-depth testing and analysis of existing mainstream flagship models and prominent open-source model series/reasoning models on VFaith-Bench, further investigating the underlying factors of their reasoning capabilities.

[178] ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

ZongHan Hsieh, Tzer-Jen Wei, ShengJing Yang

Main category: cs.CV

TL;DR: ZonUI-3B is a lightweight Vision-Language Model trained on a single GPU, achieving performance comparable to larger models on GUI grounding tasks through innovations like cross-platform datasets, two-stage fine-tuning, and data redundancy reduction.

DetailsMotivation: Addressing data scarcity in high-resolution desktop GUI environments and improving model adaptability with limited computational resources.

Method: Combines cross-platform datasets, employs a two-stage fine-tuning strategy (initial cross-platform training followed by specialized fine-tuning), and uses data curation to reduce redundancy.

Result: Achieves 84.9% accuracy on ScreenSpot and 86.4% on ScreenSpot-v2, outperforming prior models under 4B parameters.

Conclusion: ZonUI-3B demonstrates that balanced sampling and two-stage fine-tuning enhance robustness, especially in high-resolution scenarios, while being computationally efficient.

Abstract: In this paper, we present ZonUI-3B, a lightweight Vision-Language Model (VLM) that can be fully trained on a single consumer-grade GPU (RTX 4090) while delivering performance comparable to significantly larger models on GUI grounding tasks. The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks, including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights ZonUI-3B’s exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The ZonUI-3B is available at: https://github.com/Han1018/ZonUI-3B

[179] SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

Liangbin Xie, Yu Li, Shian Du, Menghan Xia, Xintao Wang, Fanghua Yu, Ziyan Chen, Pengfei Wan, Jiantao Zhou, Chao Dong

Main category: cs.CV

TL;DR: The paper proposes a two-stage video generation method using latent diffusion models, focusing on cascaded video super-resolution (VSR) for high-resolution outputs. It introduces degradation strategies, analyzes VSR model behavior, and presents architectural innovations for efficiency.

DetailsMotivation: User demands for higher-resolution video outputs make latent computation alone insufficient, necessitating a decoupled approach for semantic content generation and detail synthesis.

Method: The study explores degradation strategies for training pairs, analyzes timestep sampling and noise augmentation, and introduces interleaving temporal units and sparse local attention for efficient VSR.

Result: The framework outperforms existing methods, with ablation studies validating each design choice’s effectiveness.

Conclusion: The work establishes a practical baseline for cascaded VSR, offering insights for future efficient synthesis systems.

Abstract: Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.

[180] PhenoBench: A Comprehensive Benchmark for Cell Phenotyping

Jannik Franzen, Fabian H. Reith, Claudia Winklmayr, Jerome Luescher, Nora Koreuber, Elias Baumann, Christian M. Schuerch, Dagmar Kainmueller, Josef Lorenz Rumberger

Main category: cs.CV

TL;DR: PhenoBench is a new benchmark for evaluating foundational models (FMs) on cell phenotyping in H&E-stained histopathology images, featuring PhenoCell dataset and benchmarking code. Existing FMs perform poorly on PhenoCell, highlighting its challenge and utility for future research.

DetailsMotivation: The lack of a unified benchmark for evaluating FMs on cell phenotyping in histopathology images motivated the creation of PhenoBench.

Method: PhenoBench includes PhenoCell, a new dataset with 14 granular cell types, and provides fine-tuning and benchmarking code to evaluate FMs under various generalization scenarios.

Result: Existing FMs score poorly on PhenoCell (as low as 0.20 F1), unlike their performance on other benchmarks (e.g., Lizard, PanNuke), indicating its higher difficulty.

Conclusion: PhenoCell is a valuable resource for benchmarking FMs and supervised models, revealing gaps in current model capabilities.

Abstract: Digital pathology has seen the advent of a wealth of foundational models (FM), yet to date their performance on cell phenotyping has not been benchmarked in a unified manner. We therefore propose PhenoBench: A comprehensive benchmark for cell phenotyping on Hematoxylin and Eosin (H&E) stained histopathology images. We provide both PhenoCell, a new H&E dataset featuring 14 granular cell types identified by using multiplexed imaging, and ready-to-use fine-tuning and benchmarking code that allows the systematic evaluation of multiple prominent pathology FMs in terms of dense cell phenotype predictions in different generalization scenarios. We perform extensive benchmarking of existing FMs, providing insights into their generalization behavior under technical vs. medical domain shifts. Furthermore, while FMs achieve macro F1 scores > 0.70 on previously established benchmarks such as Lizard and PanNuke, on PhenoCell, we observe scores as low as 0.20. This indicates a much more challenging task not captured by previous benchmarks, establishing PhenoCell as a prime asset for future benchmarking of FMs and supervised models alike. Code and data are available on GitHub.

[181] GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing

Xianzhi Ma, Jianhui Li, Changhua Pei, Hao Liu

Main category: cs.CV

TL;DR: GeoMag is a new framework for remote sensing image understanding, addressing limitations of current Vision-Language Models (VLMs) by dynamically adjusting attention and resolution for multi-granularity tasks.

DetailsMotivation: Existing RS-VLMs struggle with pixel-level tasks, small-object recognition, and high computational costs for high-resolution images.

Method: GeoMag uses Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC) to focus on task-relevant areas and reduce computational overhead.

Result: GeoMag outperforms existing RS-VLMs in pixel-level tasks and maintains competitive performance across other granularities.

Conclusion: GeoMag enhances RS image parsing efficiency and accuracy, making it a practical solution for diverse remote sensing tasks.

Abstract: The application of Vision-Language Models (VLMs) in remote sensing (RS) image understanding has achieved notable progress, demonstrating the basic ability to recognize and describe geographical entities. However, existing RS-VLMs are mostly limited to image-level and region-level tasks, lacking the capability to handle pixel-level tasks and performing poorly in small-object recognition scenarios. Moreover, RS-VLMs consume significant computational resources when processing high-resolution RS images, further restricting their practical applicability. In this context, we propose GeoMag (Geographical Magnifier), an end-to-end general-purpose large model framework for RS. GeoMag dynamically focuses the attention scope based on prompt semantics to effectively perform remote sensing image parsing across multiple levels of granularity. This method introduces Task-driven Multi-granularity Resolution Adjustment (TMRA) and Prompt-guided Semantic-aware Cropping (PSC), which adaptively reduce the spatial resolution of task-irrelevant regions while enhancing the visual representation of task-relevant areas. This approach improves the model’s perception of critical target regions, suppresses background redundancy, and reduces the computational cost of interpreting high-resolution RS imagery. Extensive comparative experiments on 10 benchmarks demonstrate that GeoMag not only excels in handling pixel-level tasks but also maintains competitive performance across tasks of other granularities compared to existing RS-VLMs.

[182] Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models

Zejian Li, Yize Li, Chenye Meng, Zhongni Liu, Yang Ling, Shengyuan Zhang, Guang Yang, Changyuan Yang, Zhiyuan Yang, Lingyun Sun

Main category: cs.CV

TL;DR: Inversion-DPO is a novel alignment framework for diffusion models that avoids reward modeling by using DDIM inversion, improving training efficiency and precision.

DetailsMotivation: Existing alignment methods for diffusion models are computationally intensive and may reduce accuracy and efficiency.

Method: Inversion-DPO reformulates Direct Preference Optimization (DPO) with DDIM inversion, eliminating the need for reward models.

Result: The method shows significant performance improvements in text-to-image and compositional image generation tasks.

Conclusion: Inversion-DPO offers an efficient, high-precision alignment approach for diffusion models, enhancing their applicability to complex tasks.

Abstract: Recent advancements in diffusion models (DMs) have been propelled by alignment methods that post-train models to better conform to human preferences. However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. Our method conducts intractable posterior sampling in Diffusion-DPO with the deterministic inversion from winning and losing samples to noise and thus derive a new post-training paradigm. This paradigm eliminates the need for auxiliary reward models or inaccurate appromixation, significantly enhancing both precision and efficiency of training. We apply Inversion-DPO to a basic task of text-to-image generation and a challenging task of compositional image generation. Extensive experiments show substantial performance improvements achieved by Inversion-DPO compared to existing post-training methods and highlight the ability of the trained generative models to generate high-fidelity compositionally coherent images. For the post-training of compostitional image geneation, we curate a paired dataset consisting of 11,140 images with complex structural annotations and comprehensive scores, designed to enhance the compositional capabilities of generative models. Inversion-DPO explores a new avenue for efficient, high-precision alignment in diffusion models, advancing their applicability to complex realistic generation tasks. Our code is available at https://github.com/MIGHTYEZ/Inversion-DPO

[183] Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization

Hayat Ullah, Arslan Munir, Oliver Nina

Main category: cs.CV

TL;DR: PCL-Former, a hierarchical multi-stage transformer architecture, improves temporal action localization by using specialized transformer modules for proposal, classification, and localization, outperforming state-of-the-art methods.

DetailsMotivation: The success of transformers and multi-stage architectures in video recognition and object detection inspired their application to temporal action localization (TAL) to leverage spatio-temporal properties.

Method: PCL-Former consists of three dedicated transformer modules: Proposal-Former for segment identification, Classification-Former for action categorization, and Localization-Former for precise boundary prediction.

Result: PCL-Former outperformed state-of-the-art TAL methods by 2.8%, 1.2%, and 4.8% on THUMOS-14, ActivityNet-1.3, and HACS datasets, respectively.

Conclusion: The hierarchical multi-stage transformer approach is effective for TAL, demonstrating superior performance and validating the design of specialized modules.

Abstract: Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.

[184] Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays

Ethan Dack, Chengliang Dai

Main category: cs.CV

TL;DR: The paper revisits the ‘Name That Dataset’ task for chest X-ray datasets to explore biases, applies transformations, and analyzes results to ensure AI methods focus on pathology.

DetailsMotivation: To investigate if biases exist in popular open-source chest X-ray datasets and ensure AI methods prioritize relevant pathology over shortcuts.

Method: Applies the ‘Name That Dataset’ task to NIH, CheXpert, MIMIC-CXR, and PadChest datasets, uses transformations, and tests various network architectures.

Result: Identifies and explains biases in the datasets, emphasizing the need for explainable research.

Conclusion: Encourages more explainable research and open-source datasets in medical imaging to improve AI applications.

Abstract: Recent works have revisited the infamous task ``Name That Dataset’’, demonstrating that non-medical datasets contain underlying biases and that the dataset origin task can be solved with high accuracy. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. To extend our work, we apply simple transformations to the datasets, repeat the same task, and perform an analysis to identify and explain any detected biases. Given the importance of AI applications in medical imaging, it’s vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. Our code can be found here: https://github.com/eedack01/x_ray_ds_bias.

[185] How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

Che Liu, Jiazhen Pan, Weixiang Shen, Wenjia Bai, Daniel Rueckert, Rossella Arcucci

Main category: cs.CV

TL;DR: Evaluation of VLMs in medical tasks shows general-purpose models match or outperform medical-specific ones, but reasoning lags understanding, and reliability for clinical use remains unmet.

DetailsMotivation: Assess the competence of VLMs in medical tasks, given their increasing healthcare applications, and identify gaps in performance and reliability.

Method: Comprehensive evaluation of open-source general-purpose and medical VLMs (3B-72B parameters) across eight benchmarks, analyzing understanding and reasoning separately.

Result: General-purpose VLMs match or surpass medical-specific ones in some tasks, but reasoning is weaker than understanding, and performance varies widely across benchmarks.

Conclusion: No VLM meets clinical reliability standards; stronger multimodal alignment and rigorous evaluation protocols are needed.

Abstract: Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.

[186] OD-VIRAT: A Large-Scale Benchmark for Object Detection in Realistic Surveillance Environments

Hayat Ullah, Abbas Khan, Arslan Munir, Hari Kalva

Main category: cs.CV

TL;DR: The paper introduces OD-VIRAT Large and Tiny benchmarks for human surveillance, featuring diverse scenes and rich annotations, and evaluates state-of-the-art object detection models on challenging conditions.

DetailsMotivation: To advance robust computer vision models for surveillance by providing diverse, realistic datasets and benchmarking modern object detection architectures.

Method: Creation of two benchmarks (OD-VIRAT Large and Tiny) with extensive annotations, followed by evaluation of RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on these datasets.

Result: The benchmarks include 8.7M and 288K annotated instances, respectively, and provide insights into model performance under challenging conditions like occlusion and small-scale objects.

Conclusion: The work sets a foundation for developing more efficient object detection architectures in surveillance, addressing real-world complexities.

Abstract: Realistic human surveillance datasets are crucial for training and evaluating computer vision models under real-world conditions, facilitating the development of robust algorithms for human and human-interacting object detection in complex environments. These datasets need to offer diverse and challenging data to enable a comprehensive assessment of model performance and the creation of more reliable surveillance systems for public safety. To this end, we present two visual object detection benchmarks named OD-VIRAT Large and OD-VIRAT Tiny, aiming at advancing visual understanding tasks in surveillance imagery. The video sequences in both benchmarks cover 10 different scenes of human surveillance recorded from significant height and distance. The proposed benchmarks offer rich annotations of bounding boxes and categories, where OD-VIRAT Large has 8.7 million annotated instances in 599,996 images and OD-VIRAT Tiny has 288,901 annotated instances in 19,860 images. This work also focuses on benchmarking state-of-the-art object detection architectures, including RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on this object detection-specific variant of VIRAT dataset. To the best of our knowledge, it is the first work to examine the performance of these recently published state-of-the-art object detection architectures on realistic surveillance imagery under challenging conditions such as complex backgrounds, occluded objects, and small-scale objects. The proposed benchmarking and experimental settings will help in providing insights concerning the performance of selected object detection models and set the base for developing more efficient and robust object detection architectures.

[187] Demographic-aware fine-grained classification of pediatric wrist fractures

Ammar Ahmed, Ali Shariq Imran, Zenun Kastrati, Sher Muhammad Daudpota

Main category: cs.CV

TL;DR: A study improves wrist pathology diagnosis by combining fine-grained recognition, metadata fusion, and fine-grained pre-training, achieving 2-10% accuracy gains.

DetailsMotivation: Diagnosing wrist pathologies is time-consuming and requires expertise; limited datasets and reliance on single modalities like images are inadequate.

Method: Uses fine-grained recognition for subtle X-ray pathologies, fuses patient metadata with images, and employs fine-grained pre-training instead of ImageNet.

Result: Improves diagnostic accuracy by 2% with limited data and over 10% with a larger fracture-focused dataset.

Conclusion: Fine-grained strategies and metadata integration enhance wrist pathology diagnosis, even with limited datasets.

Abstract: Wrist pathologies are frequently observed, particularly among children who constitute the majority of fracture cases. However, diagnosing these conditions is time-consuming and requires specialized expertise. Computer vision presents a promising avenue, contingent upon the availability of extensive datasets, a notable challenge in medical imaging. Therefore, reliance solely on one modality, such as images, proves inadequate, especially in an era of diverse and plentiful data types. In this study, we employ a multifaceted approach to address the challenge of recognizing wrist pathologies using an extremely limited dataset. Initially, we approach the problem as a fine-grained recognition task, aiming to identify subtle X-ray pathologies that conventional CNNs overlook. Secondly, we enhance network performance by fusing patient metadata with X-ray images. Thirdly, rather than pre-training on a coarse-grained dataset like ImageNet, we utilize weights trained on a fine-grained dataset. While metadata integration has been used in other medical domains, this is a novel application for wrist pathologies. Our results show that a fine-grained strategy and metadata integration improve diagnostic accuracy by 2% with a limited dataset and by over 10% with a larger fracture-focused dataset.

[188] DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition

Hayat Ullah, Muhammad Ali Shafique, Abbas Khan, Arslan Munir

Main category: cs.CV

TL;DR: DVFL-Net, a lightweight Video Focal Modulation Network, uses knowledge distillation and spatio-temporal feature modulation to reduce computation while maintaining high accuracy, making it suitable for real-time HAR applications.

DetailsMotivation: Transformers for video recognition are computationally expensive, especially with dense video data, necessitating a more efficient solution.

Method: Proposes DVFL-Net, which distills knowledge from a large teacher model into a compact student model using forward KL divergence and spatio-temporal focal modulation.

Result: DVFL-Net achieves lower memory usage, reduced GFLOPs, and strong accuracy on benchmarks like UCF101 and Kinetics-400.

Conclusion: DVFL-Net balances performance and efficiency, making it practical for real-time HAR.

Abstract: The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.

cs.AI

[189] GraphTrafficGPT: Enhancing Traffic Management Through Graph-Based AI Agent Coordination

Nabil Abdelaziz Ferhat Taleb, Abdolazim Rezaei, Raj Atulkumar Patel, Mehdi Sookhak

Main category: cs.AI

TL;DR: GraphTrafficGPT improves traffic management by replacing sequential task execution with a graph-based architecture, reducing token usage and latency while enabling parallel processing.

DetailsMotivation: Current LLM-based traffic systems like TrafficGPT suffer from inefficiencies due to sequential execution, high token usage, and poor scalability, limiting their real-world applicability.

Method: GraphTrafficGPT uses a graph-based architecture with a Brain Agent to decompose queries, construct dependency graphs, and coordinate specialized agents for parallel task execution and dynamic resource allocation.

Result: The system reduces token consumption by 50.2%, response latency by 19.0%, and improves efficiency by 23.0% in multi-query scenarios compared to TrafficGPT.

Conclusion: GraphTrafficGPT offers a scalable and efficient solution for LLM-driven traffic management, outperforming existing sequential systems.

Abstract: Large Language Models (LLMs) offer significant promise for intelligent traffic management; however, current chain-based systems like TrafficGPT are hindered by sequential task execution, high token usage, and poor scalability, making them inefficient for complex, real-world scenarios. To address these limitations, we propose GraphTrafficGPT, a novel graph-based architecture, which fundamentally redesigns the task coordination process for LLM-driven traffic applications. GraphTrafficGPT represents tasks and their dependencies as nodes and edges in a directed graph, enabling efficient parallel execution and dynamic resource allocation. The main idea behind the proposed model is a Brain Agent that decomposes user queries, constructs optimized dependency graphs, and coordinates a network of specialized agents for data retrieval, analysis, visualization, and simulation. By introducing advanced context-aware token management and supporting concurrent multi-query processing, the proposed architecture handles interdependent tasks typical of modern urban mobility environments. Experimental results demonstrate that GraphTrafficGPT reduces token consumption by 50.2% and average response latency by 19.0% compared to TrafficGPT, while supporting simultaneous multi-query execution with up to 23.0% improvement in efficiency.

[190] PrefPalette: Personalized Preference Modeling with Latent Attributes

Shuyue Stella Li, Melanie Sclar, Hunter Lang, Ansong Ni, Jacqueline He, Puxin Xu, Andrew Cohen, Chan Young Park, Yulia Tsvetkov, Asli Celikyilmaz

Main category: cs.AI

TL;DR: PrefPalette improves AI preference prediction by decomposing preferences into interpretable attributes and tailoring them to social communities, outperforming GPT-4o by 46.6%.

DetailsMotivation: Current preference models treat human judgment as a black box, lacking understanding of underlying reasons for preferences.

Method: PrefPalette uses multi-attribute decision making: (1) scalable counterfactual attribute synthesis for isolated effects, and (2) attention-based preference modeling for dynamic attribute weighting by communities.

Result: Outperforms GPT-4o by 46.6% in prediction accuracy, revealing community-specific profiles (e.g., scholarly communities prioritize verbosity).

Conclusion: PrefPalette offers superior, interpretable preference modeling, advancing trustworthy, value-aware AI personalization.

Abstract: Personalizing AI systems requires understanding not just what users prefer, but the reasons that underlie those preferences - yet current preference models typically treat human judgment as a black box. We introduce PrefPalette, a framework that decomposes preferences into attribute dimensions and tailors its preference prediction to distinct social community values in a human-interpretable manner. PrefPalette operationalizes a cognitive science principle known as multi-attribute decision making in two ways: (1) a scalable counterfactual attribute synthesis step that involves generating synthetic training data to isolate for individual attribute effects (e.g., formality, humor, cultural values), and (2) attention-based preference modeling that learns how different social communities dynamically weight these attributes. This approach moves beyond aggregate preference modeling to capture the diverse evaluation frameworks that drive human judgment. When evaluated on 45 social communities from the online platform Reddit, PrefPalette outperforms GPT-4o by 46.6% in average prediction accuracy. Beyond raw predictive improvements, PrefPalette also shed light on intuitive, community-specific profiles: scholarly communities prioritize verbosity and stimulation, conflict-oriented communities value sarcasm and directness, and support-based communities emphasize empathy. By modeling the attribute-mediated structure of human judgment, PrefPalette delivers both superior preference modeling and transparent, interpretable insights, and serves as a first step toward more trustworthy, value-aware personalized applications.

[191] GOFAI meets Generative AI: Development of Expert Systems by means of Large Language Models

Eduardo C. Garrido-Merchán, Cristina Puente

Main category: cs.AI

TL;DR: A new approach combines LLMs with symbolic systems (Prolog) to create controlled, transparent expert systems, reducing hallucinations and ensuring reliability.

DetailsMotivation: Address the disadvantages of LLMs, such as hallucinations and unverifiable facts, by developing a controlled and transparent method for expert systems.

Method: Use domain-limited, structured prompts to extract knowledge into Prolog for human validation, ensuring interpretability and reliability.

Result: Quantitative and qualitative experiments with Claude Sonnet 3.7 and GPT-4.1 show strong fact adherence and semantic coherence.

Conclusion: The hybrid approach combines LLM recall with symbolic precision, enabling dependable AI in sensitive domains.

Abstract: The development of large language models (LLMs) has successfully transformed knowledge-based systems such as open domain question nswering, which can automatically produce vast amounts of seemingly coherent information. Yet, those models have several disadvantages like hallucinations or confident generation of incorrect or unverifiable facts. In this paper, we introduce a new approach to the development of expert systems using LLMs in a controlled and transparent way. By limiting the domain and employing a well-structured prompt-based extraction approach, we produce a symbolic representation of knowledge in Prolog, which can be validated and corrected by human experts. This approach also guarantees interpretability, scalability and reliability of the developed expert systems. Via quantitative and qualitative experiments with Claude Sonnet 3.7 and GPT-4.1, we show strong adherence to facts and semantic coherence on our generated knowledge bases. We present a transparent hybrid solution that combines the recall capacity of LLMs with the precision of symbolic systems, thereby laying the foundation for dependable AI applications in sensitive domains.

[192] DailyLLM: Context-Aware Activity Log Generation Using Multi-Modal Sensors and LLMs

Ye Tian, Xiaoyuan Ren, Zihao Wang, Onat Gungor, Xiaofan Yu, Tajana Rosing

Main category: cs.AI

TL;DR: DailyLLM is a novel system for generating and summarizing activity logs by integrating contextual data from smartphones and smartwatches, outperforming SOTA methods in accuracy and efficiency.

DetailsMotivation: Existing activity log generation methods lack accuracy, efficiency, and semantic richness, despite the potential of LLMs.

Method: DailyLLM uses a lightweight LLM framework with structured prompting and efficient feature extraction to integrate location, motion, environment, and physiology data.

Result: DailyLLM achieves a 17% higher BERTScore precision and 10x faster inference speed than a 70B-parameter SOTA baseline.

Conclusion: DailyLLM effectively addresses limitations in log generation, offering a scalable and efficient solution for ubiquitous computing.

Abstract: Rich and context-aware activity logs facilitate user behavior analysis and health monitoring, making them a key research focus in ubiquitous computing. The remarkable semantic understanding and generation capabilities of Large Language Models (LLMs) have recently created new opportunities for activity log generation. However, existing methods continue to exhibit notable limitations in terms of accuracy, efficiency, and semantic richness. To address these challenges, we propose DailyLLM. To the best of our knowledge, this is the first log generation and summarization system that comprehensively integrates contextual activity information across four dimensions: location, motion, environment, and physiology, using only sensors commonly available on smartphones and smartwatches. To achieve this, DailyLLM introduces a lightweight LLM-based framework that integrates structured prompting with efficient feature extraction to enable high-level activity understanding. Extensive experiments demonstrate that DailyLLM outperforms state-of-the-art (SOTA) log generation methods and can be efficiently deployed on personal computers and Raspberry Pi. Utilizing only a 1.5B-parameter LLM model, DailyLLM achieves a 17% improvement in log generation BERTScore precision compared to the 70B-parameter SOTA baseline, while delivering nearly 10x faster inference speed.

[193] Why Isn’t Relational Learning Taking Over the World?

David Poole

Main category: cs.AI

TL;DR: The paper argues for focusing on relational learning (modeling entities, properties, and relations) over traditional AI methods (like text and image modeling), highlighting its underutilization despite its potential.

DetailsMotivation: Current AI focuses on modeling pixels and words, but the world is made of entities and relations. Relational data (e.g., spreadsheets, databases) is valuable yet underrepresented in AI research.

Method: The paper critiques the dominance of non-relational AI methods and advocates for relational learning, discussing its current limitations and potential.

Result: Relational learning is underused except in niche cases, despite its relevance to real-world data.

Conclusion: The paper calls for advancing relational learning to better model the world’s structure and unlock its full potential.

Abstract: AI seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can’t be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world – except in a few cases with restricted relations – and what needs to be done to bring it to it’s rightful prominence.

[194] BifrostRAG: Bridging Dual Knowledge Graphs for Multi-Hop Question Answering in Construction Safety

Yuxin Zhang, Xi Wang, Mo Hu, Zhenyu Zhang

Main category: cs.AI

TL;DR: BifrostRAG is a dual-graph RAG system combining linguistic and document structure modeling, outperforming traditional methods in multi-hop question answering for compliance checking.

DetailsMotivation: The complexity of regulatory text and multi-hop queries in compliance checking challenges traditional RAG systems.

Method: BifrostRAG uses an Entity Network Graph for linguistic relationships and a Document Navigator Graph for document structure, enabling hybrid retrieval.

Result: Achieves 92.8% precision, 85.5% recall, and 87.3% F1 score, outperforming vector-only and graph-only RAG baselines.

Conclusion: BifrostRAG is a robust solution for compliance checking, offering a transferable approach for complex technical documents.

Abstract: Information retrieval and question answering from safety regulations are essential for automated construction compliance checking but are hindered by the linguistic and structural complexity of regulatory text. Many compliance-related queries are multi-hop, requiring synthesis of information across interlinked clauses. This poses a challenge for traditional retrieval-augmented generation (RAG) systems. To overcome this, we introduce BifrostRAG: a dual-graph RAG-integrated system that explicitly models both linguistic relationships (via an Entity Network Graph) and document structure (via a Document Navigator Graph). This architecture powers a hybrid retrieval mechanism that combines graph traversal with vector-based semantic search, enabling large language models to reason over both the meaning and the structure of the text. Evaluation on a multi-hop question dataset shows that BifrostRAG achieves 92.8 percent precision, 85.5 percent recall, and an F1 score of 87.3 percent. These results significantly outperform vector-only and graph-only RAG baselines that represent current leading approaches. Error analysis further highlights the comparative advantages of our hybrid method over single-modality RAGs. These findings establish BifrostRAG as a robust knowledge engine for LLM-driven compliance checking. Its dual-graph, hybrid retrieval mechanism offers a transferable blueprint for navigating complex technical documents across knowledge-intensive engineering domains.

[195] Cross-modal Causal Intervention for Alzheimer’s Disease Prediction

Yutao Jin, Haowen Xiao, Jielei Chu, Fengmao Lv, Yuxiao Li, Tianrui Li

Main category: cs.AI

TL;DR: A novel visual-language causal intervention framework (ADPC) is proposed for Alzheimer’s Disease (AD) prediction, using MRI/fMRI and LLM-generated text to classify CN, MCI, and AD, outperforming non-causal methods by addressing confounders.

DetailsMotivation: Early AD diagnosis is challenging due to data biases and complex variable relationships, necessitating a robust method to eliminate confounders.

Method: ADPC integrates MRI/fMRI and LLM-processed clinical data, applying causal intervention to remove confounders for accurate CN/MCI/AD classification.

Result: ADPC achieves SOTA performance in distinguishing CN, MCI, and AD, demonstrating superior reliability over non-causal models.

Conclusion: The study highlights the effectiveness of causal reasoning in multi-modal learning for neurological disease diagnosis.

Abstract: Mild Cognitive Impairment (MCI) serves as a prodromal stage of Alzheimer’s Disease (AD), where early identification and intervention can effectively slow the progression to dementia. However, diagnosing AD remains a significant challenge in neurology due to the confounders caused mainly by the selection bias of multimodal data and the complex relationships between variables. To address these issues, we propose a novel visual-language causal intervention framework named Alzheimer’s Disease Prediction with Cross-modal Causal Intervention (ADPC) for diagnostic assistance. Our ADPC employs large language model (LLM) to summarize clinical data under strict templates, maintaining structured text outputs even with incomplete or unevenly distributed datasets. The ADPC model utilizes Magnetic Resonance Imaging (MRI), functional MRI (fMRI) images and textual data generated by LLM to classify participants into Cognitively Normal (CN), MCI, and AD categories. Because of the presence of confounders, such as neuroimaging artifacts and age-related biomarkers, non-causal models are likely to capture spurious input-output correlations, generating less reliable results. Our framework implicitly eliminates confounders through causal intervention. Experimental results demonstrate the outstanding performance of our method in distinguishing CN/MCI/AD cases, achieving state-of-the-art (SOTA) metrics across most evaluation metrics. The study showcases the potential of integrating causal reasoning with multi-modal learning for neurological disease diagnosis.

[196] Buggy rule diagnosis for combined steps through final answer evaluation in stepwise tasks

Gerben van der Hoek, Johan Jeuring, Rogier Bos

Main category: cs.AI

TL;DR: The paper explores using final answers for error diagnosis in intelligent tutoring systems to avoid combinatorial explosion in stepwise tasks, validating the approach with a dataset of quadratic equation solutions.

DetailsMotivation: Combinatorial explosion in diagnosing stepwise tasks makes error diagnosis challenging. Using final answers can simplify this process.

Method: Designs a service for buggy rule diagnosis based on final answers and tests it on a dataset of quadratic equation solutions.

Result: Final answer evaluation diagnosed 29.4% of previously undiagnosed steps, with 97% alignment to teacher diagnoses in a subset.

Conclusion: The approach shows promise for further exploration in error diagnosis for intelligent tutoring systems.

Abstract: Many intelligent tutoring systems can support a student in solving a stepwise task. When a student combines several steps in one step, the number of possible paths connecting consecutive inputs may be very large. This combinatorial explosion makes error diagnosis hard. Using a final answer to diagnose a combination of steps can mitigate the combinatorial explosion, because there are generally fewer possible (erroneous) final answers than (erroneous) solution paths. An intermediate input for a task can be diagnosed by automatically completing it according to the task solution strategy and diagnosing this solution. This study explores the potential of automated error diagnosis based on a final answer. We investigate the design of a service that provides a buggy rule diagnosis when a student combines several steps. To validate the approach, we apply the service to an existing dataset (n=1939) of unique student steps when solving quadratic equations, which could not be diagnosed by a buggy rule service that tries to connect consecutive inputs with a single rule. Results show that final answer evaluation can diagnose 29,4% of these steps. Moreover, a comparison of the generated diagnoses with teacher diagnoses on a subset (n=115) shows that the diagnoses align in 97% of the cases. These results can be considered a basis for further exploration of the approach.

[197] Combining model tracing and constraint-based modeling for multistep strategy diagnoses

Gerben van der Hoek, Johan Jeuring, Rogier Bos

Main category: cs.AI

TL;DR: The paper proposes a hybrid approach combining model tracing and constraint-based modeling to diagnose student input in stepwise tasks, validated with a dataset of quadratic equation solutions.

DetailsMotivation: To improve student input diagnosis by merging the strengths of model tracing (identifying consecutive steps) and constraint-based modeling (handling combined steps).

Method: Defines constraints as shared properties between student input and strategy steps, enabling diagnosis of deviations even when steps are combined. Evaluated using a dataset of quadratic equation solutions (n=2136) and teacher-coded samples (n=140).

Result: The system’s diagnoses aligned perfectly with teacher coding in all 140 student steps.

Conclusion: The hybrid approach effectively diagnoses student input in multistep tasks, even when steps are combined, as validated by teacher agreement.

Abstract: Model tracing and constraint-based modeling are two approaches to diagnose student input in stepwise tasks. Model tracing supports identifying consecutive problem-solving steps taken by a student, whereas constraint-based modeling supports student input diagnosis even when several steps are combined into one step. We propose an approach that merges both paradigms. By defining constraints as properties that a student input has in common with a step of a strategy, it is possible to provide a diagnosis when a student deviates from a strategy even when the student combines several steps. In this study we explore the design of a system for multistep strategy diagnoses, and evaluate these diagnoses. As a proof of concept, we generate diagnoses for an existing dataset containing steps students take when solving quadratic equations (n=2136). To compare with human diagnoses, two teachers coded a random sample of deviations (n=70) and applications of the strategy (n=70). Results show that that the system diagnosis aligned with the teacher coding in all of the 140 student steps.

[198] OntView: What you See is What you Meant

Carlos Bobed, Carlota Quintana, Eduardo Mena, Jorge Bobed, Fernando Bobillo

Main category: cs.AI

TL;DR: OntView is an ontology viewer offering intuitive visualization of ontology structures, including GCIs, with features to simplify views and avoid overload.

DetailsMotivation: Existing ontology tools lack effective visualization, making it hard to comprehend large frameworks.

Method: OntView uses a DL reasoner for inferred knowledge, visualizes GCIs, and provides simplified views via summaries, focused TBox elements, and dynamic branch hiding.

Result: OntView successfully addresses visualization challenges with its open-source tool.

Conclusion: OntView enhances ontology comprehension through intuitive and dynamic visualization, filling a gap in existing tools.

Abstract: In the field of knowledge management and computer science, ontologies provide a structured framework for modeling domain-specific knowledge by defining concepts and their relationships. However, the lack of tools that provide effective visualization is still a significant challenge. While numerous ontology editors and viewers exist, most of them fail to graphically represent ontology structures in a meaningful and non-overwhelming way, limiting users’ ability to comprehend dependencies and properties within large ontological frameworks. In this paper, we present OntView, an ontology viewer that is designed to provide users with an intuitive visual representation of ontology concepts and their formal definitions through a user-friendly interface. Building on the use of a DL reasoner, OntView follows a “What you see is what you meant” paradigm, showing the actual inferred knowledge. One key aspect for this is its ability to visualize General Concept Inclusions (GCI), a feature absent in existing visualization tools. Moreover, to avoid a possible information overload, OntView also offers different ways to show a simplified view of the ontology by: 1) creating ontology summaries by assessing the importance of the concepts (according to different available algorithms), 2) focusing the visualization on the existing TBox elements between two given classes and 3) allowing to hide/show different branches in a dynamic way without losing the semantics. OntView has been released with an open-source license for the whole community.

[199] From Extraction to Synthesis: Entangled Heuristics for Agent-Augmented Strategic Reasoning

Renato Ghisellini, Remo Pareschi, Marco Pedroni, Giovanni Battista Raggi

Main category: cs.AI

TL;DR: A hybrid architecture for strategic reasoning combines heuristics, semantic activation, and compositional synthesis, fusing conflicting heuristics into context-sensitive narratives. Demonstrated via a Meta vs. FTC case study.

DetailsMotivation: To improve strategic reasoning by integrating diverse heuristics and semantic interdependence, moving beyond traditional rule-based decision engines.

Method: Combines heuristic extraction, semantic activation, and compositional synthesis, inspired by quantum cognition. Uses semantic interaction modeling and rhetorical framing.

Result: Preliminary validation via semantic metrics in a Meta vs. FTC case study, showing coherent fusion of conflicting heuristics.

Conclusion: The framework offers a novel approach to strategic reasoning, with potential extensions like dynamic interference tuning discussed.

Abstract: We present a hybrid architecture for agent-augmented strategic reasoning, combining heuristic extraction, semantic activation, and compositional synthesis. Drawing on sources ranging from classical military theory to contemporary corporate strategy, our model activates and composes multiple heuristics through a process of semantic interdependence inspired by research in quantum cognition. Unlike traditional decision engines that select the best rule, our system fuses conflicting heuristics into coherent and context-sensitive narratives, guided by semantic interaction modeling and rhetorical framing. We demonstrate the framework via a Meta vs. FTC case study, with preliminary validation through semantic metrics. Limitations and extensions (e.g., dynamic interference tuning) are discussed.

Haoyang Li, Yuming Xu, Yiming Li, Hanmo Liu, Darian Li, Chen Jason Zhang, Lei Chen, Qing Li

Main category: cs.AI

TL;DR: EAGLE is a lightweight framework for temporal link prediction in dynamic graphs, combining short-term recency and long-term structural patterns for efficiency and scalability.

DetailsMotivation: Existing Temporal Graph Neural Networks (T-GNNs) face scalability and efficiency issues due to high computational overhead.

Method: EAGLE integrates a time-aware module for recent neighbor aggregation and a structure-aware module using temporal personalized PageRank, with adaptive weighting.

Result: EAGLE outperforms state-of-the-art T-GNNs, achieving a 50x speedup over transformer-based T-GNNs.

Conclusion: EAGLE offers a scalable and efficient solution for temporal link prediction without complex architectures.

Abstract: Temporal link prediction in dynamic graphs is a critical task with applications in diverse domains such as social networks, recommendation systems, and e-commerce platforms. While existing Temporal Graph Neural Networks (T-GNNs) have achieved notable success by leveraging complex architectures to model temporal and structural dependencies, they often suffer from scalability and efficiency challenges due to high computational overhead. In this paper, we propose EAGLE, a lightweight framework that integrates short-term temporal recency and long-term global structural patterns. EAGLE consists of a time-aware module that aggregates information from a node’s most recent neighbors to reflect its immediate preferences, and a structure-aware module that leverages temporal personalized PageRank to capture the influence of globally important nodes. To balance these attributes, EAGLE employs an adaptive weighting mechanism to dynamically adjust their contributions based on data characteristics. Also, EAGLE eliminates the need for complex multi-hop message passing or memory-intensive mechanisms, enabling significant improvements in efficiency. Extensive experiments on seven real-world temporal graphs demonstrate that EAGLE consistently achieves superior performance against state-of-the-art T-GNNs in both effectiveness and efficiency, delivering more than a 50x speedup over effective transformer-based T-GNNs.

[201] Causal Knowledge Transfer for Multi-Agent Reinforcement Learning in Dynamic Environments

Kathrin Korte, Christian Medeiros Adriano, Sona Ghahremani, Holger Giese

Main category: cs.AI

TL;DR: A causal knowledge transfer framework improves MARL in non-stationary environments by sharing compact causal representations, reducing retraining costs.

DetailsMotivation: Address the challenge of knowledge transfer in MARL for non-stationary environments with changing goals, where traditional methods fail to generalize.

Method: Introduces a framework where agents learn and share causal representations of paths, modeling collisions as causal interventions with recovery macros transferred online.

Result: Agents bridged half the gap between random exploration and full retraining, with effectiveness tied to environment complexity and goal heterogeneity.

Conclusion: Causal knowledge transfer is viable for MARL in dynamic settings, though its impact varies with environmental and goal factors.

Abstract: [Context] Multi-agent reinforcement learning (MARL) has achieved notable success in environments where agents must learn coordinated behaviors. However, transferring knowledge across agents remains challenging in non-stationary environments with changing goals. [Problem] Traditional knowledge transfer methods in MARL struggle to generalize, and agents often require costly retraining to adapt. [Approach] This paper introduces a causal knowledge transfer framework that enables RL agents to learn and share compact causal representations of paths within a non-stationary environment. As the environment changes (new obstacles), agents’ collisions require adaptive recovery strategies. We model each collision as a causal intervention instantiated as a sequence of recovery actions (a macro) whose effect corresponds to a causal knowledge of how to circumvent the obstacle while increasing the chances of achieving the agent’s goal (maximizing cumulative reward). This recovery action macro is transferred online from a second agent and is applied in a zero-shot fashion, i.e., without retraining, just by querying a lookup model with local context information (collisions). [Results] Our findings reveal two key insights: (1) agents with heterogeneous goals were able to bridge about half of the gap between random exploration and a fully retrained policy when adapting to new environments, and (2) the impact of causal knowledge transfer depends on the interplay between environment complexity and agents’ heterogeneous goals.

[202] Large Language Models as Innovators: A Framework to Leverage Latent Space Exploration for Novelty Discovery

Mateusz Bystroński, Mikołaj Hołysz, Grzegorz Piotrowski, Nitesh V. Chawla, Tomasz Kajdanowicz

Main category: cs.AI

TL;DR: A model-agnostic latent-space ideation framework is proposed to enhance AI creativity by navigating embedding spaces, avoiding brittle heuristics and enabling scalable, controlled novelty.

DetailsMotivation: Large language models (LLMs) often lack novelty and relevance in idea generation, relying on training patterns and requiring extensive prompt engineering. Existing solutions are domain-specific and hard to generalize.

Method: The paper introduces a latent-space ideation framework that navigates continuous embedding spaces of ideas, eliminating the need for handcrafted rules and adapting to various domains and tasks.

Result: Preliminary results show the framework’s potential as a general-purpose co-ideator for human-AI collaboration, demonstrating controlled and scalable creativity.

Conclusion: The proposed framework offers a promising, adaptable solution for enhancing AI creativity without domain-specific constraints, paving the way for more effective human-AI collaboration.

Abstract: Innovative idea generation remains a core challenge in AI, as large language models (LLMs) often struggle to produce outputs that are both novel and relevant. Despite their fluency, LLMs tend to replicate patterns seen during training, limiting their ability to diverge creatively without extensive prompt engineering. Prior work has addressed this through domain-specific heuristics and structured prompting pipelines, but such solutions are brittle and difficult to generalize. In this paper, we propose a model-agnostic latent-space ideation framework that enables controlled, scalable creativity by navigating the continuous embedding space of ideas. Unlike prior methods, our framework requires no handcrafted rules and adapts easily to different domains, input formats, and creative tasks. This paper introduces an early-stage prototype of our method, outlining the conceptual framework and preliminary results highlighting its potential as a general-purpose co-ideator for human-AI collaboration.

[203] Towards Constraint Temporal Answer Set Programming

Pedro Cabalar, Martín Diéguez, François Olivier, Torsten Schaub, Igor Stéphan

Main category: cs.AI

TL;DR: A novel temporal and constraint-based extension of the logic of Here-and-There is introduced for nonmonotonic temporal reasoning in ASP, combining linear-time logic and constraint handling for dynamic systems.

DetailsMotivation: Addressing the challenges of reasoning about dynamic systems with fine-grained temporal and numeric resolution in ASP.

Method: Combines linear-time logic of Here-and-There (for nonmonotonic temporal reasoning) with constraint-based logic (for numeric constraints).

Result: An expressive system tailored for ASP, enabling high-resolution reasoning for complex dynamic systems.

Conclusion: Establishes a foundational logical framework for dynamic systems within ASP, integrating temporal and constraint-based reasoning.

Abstract: Reasoning about dynamic systems with a fine-grained temporal and numeric resolution presents significant challenges for logic-based approaches like Answer Set Programming (ASP). To address this, we introduce and elaborate upon a novel temporal and constraint-based extension of the logic of Here-and-There and its nonmonotonic equilibrium extension, representing, to the best of our knowledge, the first approach to nonmonotonic temporal reasoning with constraints specifically tailored for ASP. This expressive system is achieved by a synergistic combination of two foundational ASP extensions: the linear-time logic of Here-and-There, providing robust nonmonotonic temporal reasoning capabilities, and the logic of Here-and-There with constraints, enabling the direct integration and manipulation of numeric constraints, among others. This work establishes the foundational logical framework for tackling complex dynamic systems with high resolution within the ASP paradigm.

[204] KROMA: Ontology Matching with Knowledge Retrieval and Large Language Models

Lam Nguyen, Erika Barcelos, Roger French, Yinghui Wu

Main category: cs.AI

TL;DR: KROMA is a new OM framework using LLMs and RAG to enhance semantic context, outperforming traditional and LLM-based methods with optimized efficiency.

DetailsMotivation: Existing OM systems rely on rigid rules or specialized models, lacking adaptability. KROMA aims to improve flexibility and performance.

Method: KROMA uses LLMs in a RAG pipeline, integrating bisimilarity-based concept matching and lightweight ontology refinement to reduce overhead.

Result: Experiments show KROMA outperforms classic and LLM-based OM systems while maintaining low communication overhead.

Conclusion: KROMA demonstrates the feasibility of optimized techniques (knowledge retrieval, prompt enrichment, refinement) for scalable OM.

Abstract: Ontology Matching (OM) is a cornerstone task of semantic interoperability, yet existing systems often rely on handcrafted rules or specialized models with limited adaptability. We present KROMA, a novel OM framework that harnesses Large Language Models (LLMs) within a Retrieval-Augmented Generation (RAG) pipeline to dynamically enrich the semantic context of OM tasks with structural, lexical, and definitional knowledge. To optimize both performance and efficiency, KROMA integrates a bisimilarity-based concept matching and a lightweight ontology refinement step, which prune candidate concepts and substantially reduce the communication overhead from invoking LLMs. Through experiments on multiple benchmark datasets, we show that integrating knowledge retrieval with context-augmented LLMs significantly enhances ontology matching, outperforming both classic OM systems and cutting-edge LLM-based approaches while keeping communication overhead comparable. Our study highlights the feasibility and benefit of the proposed optimization techniques (targeted knowledge retrieval, prompt enrichment, and ontology refinement) for ontology matching at scale.

[205] Glucose-ML: A collection of longitudinal diabetes datasets for development of robust AI solutions

Temiloluwa Prioleau, Baiying Lu, Yanjun Cui

Main category: cs.AI

TL;DR: The paper introduces Glucose-ML, a collection of 10 publicly available diabetes datasets, to address barriers in AI development for diabetes management. It includes over 300,000 days of CGM data and provides benchmarks for blood glucose prediction.

DetailsMotivation: Access to large, high-quality datasets is a barrier in developing robust AI solutions for diabetes management. The authors aim to accelerate transparent and reproducible AI development by providing a curated dataset collection.

Method: The authors compile 10 diabetes datasets (Glucose-ML) with 38 million glucose samples from 2500+ participants. They conduct a comparative analysis and a case study on blood glucose prediction to benchmark performance across datasets.

Result: The study shows that AI algorithms yield significantly different prediction results depending on the dataset used, highlighting the importance of dataset selection for robust AI solutions.

Conclusion: The Glucose-ML collection and benchmarks support researchers in developing robust AI solutions for diabetes. The findings emphasize the need for careful dataset selection in health-related AI applications.

Abstract: Artificial intelligence (AI) algorithms are a critical part of state-of-the-art digital health technology for diabetes management. Yet, access to large high-quality datasets is creating barriers that impede development of robust AI solutions. To accelerate development of transparent, reproducible, and robust AI solutions, we present Glucose-ML, a collection of 10 publicly available diabetes datasets, released within the last 7 years (i.e., 2018 - 2025). The Glucose-ML collection comprises over 300,000 days of continuous glucose monitor (CGM) data with a total of 38 million glucose samples collected from 2500+ people across 4 countries. Participants include persons living with type 1 diabetes, type 2 diabetes, prediabetes, and no diabetes. To support researchers and innovators with using this rich collection of diabetes datasets, we present a comparative analysis to guide algorithm developers with data selection. Additionally, we conduct a case study for the task of blood glucose prediction - one of the most common AI tasks within the field. Through this case study, we provide a benchmark for short-term blood glucose prediction across all 10 publicly available diabetes datasets within the Glucose-ML collection. We show that the same algorithm can have significantly different prediction results when developed/evaluated with different datasets. Findings from this study are then used to inform recommendations for developing robust AI solutions within the diabetes or broader health domain. We provide direct links to each longitudinal diabetes dataset in the Glucose-ML collection and openly provide our code.

[206] Generative AI-Driven High-Fidelity Human Motion Simulation

Hari Iyer, Neel Macwan, Atharva Jitendra Hude, Heejin Jeong, Shenghan Guo

Main category: cs.AI

TL;DR: G-AI-HMS uses generative AI to improve human motion simulation by integrating text-to-text and text-to-motion models, outperforming human-created descriptions in accuracy and alignment.

DetailsMotivation: Existing human motion simulation methods lack fidelity. G-AI-HMS aims to enhance simulation quality for industrial tasks by leveraging AI.

Method: Combines Large Language Models (LLMs) and MotionGPT for task-to-motion translation, validated via computer vision and posture estimation.

Result: AI-enhanced motions showed lower error in spatial accuracy, alignment, and temporal similarity compared to human-created descriptions.

Conclusion: G-AI-HMS significantly improves motion simulation fidelity, reducing joint error and temporal misalignment while maintaining posture accuracy.

Abstract: Human motion simulation (HMS) supports cost-effective evaluation of worker behavior, safety, and productivity in industrial tasks. However, existing methods often suffer from low motion fidelity. This study introduces Generative-AI-Enabled HMS (G-AI-HMS), which integrates text-to-text and text-to-motion models to enhance simulation quality for physical tasks. G-AI-HMS tackles two key challenges: (1) translating task descriptions into motion-aware language using Large Language Models aligned with MotionGPT’s training vocabulary, and (2) validating AI-enhanced motions against real human movements using computer vision. Posture estimation algorithms are applied to real-time videos to extract joint landmarks, and motion similarity metrics are used to compare them with AI-enhanced sequences. In a case study involving eight tasks, the AI-enhanced motions showed lower error than human created descriptions in most scenarios, performing better in six tasks based on spatial accuracy, four tasks based on alignment after pose normalization, and seven tasks based on overall temporal similarity. Statistical analysis showed that AI-enhanced prompts significantly (p $<$ 0.0001) reduced joint error and temporal misalignment while retaining comparable posture accuracy.

[207] Automated Interpretation of Non-Destructive Evaluation Contour Maps Using Large Language Models for Bridge Condition Assessment

Viraj Nishesh Darji, Callie C. Liao, Duoduo Liao

Main category: cs.AI

TL;DR: The study explores using LLMs to interpret NDE contour maps for bridge maintenance, showing improved efficiency and accuracy, with ChatGPT-4 and Claude 3.5 Sonnet performing best.

DetailsMotivation: Bridge maintenance relies on NDE data interpretation, which is time-consuming and expertise-dependent. LLMs offer automation potential to streamline this process.

Method: Several LLMs were tested with tailored prompts to interpret NDE contour maps, evaluating their ability to describe images, identify defects, and provide recommendations.

Result: Four of nine LLMs excelled in image descriptions, with ChatGPT-4 and Claude 3.5 Sonnet producing the most effective summaries.

Conclusion: LLMs can enhance bridge inspection workflows by improving efficiency and accuracy, offering a promising tool for infrastructure management.

Abstract: Bridge maintenance and safety are essential for transportation authorities, and Non-Destructive Evaluation (NDE) techniques are critical to assessing structural integrity. However, interpreting NDE data can be time-consuming and requires expertise, potentially delaying decision-making. Recent advancements in Large Language Models (LLMs) offer new ways to automate and improve this analysis. This pilot study introduces a holistic assessment of LLM capabilities for interpreting NDE contour maps and demonstrates the effectiveness of LLMs in providing detailed bridge condition analyses. It establishes a framework for integrating LLMs into bridge inspection workflows, indicating that LLM-assisted analysis can enhance efficiency without compromising accuracy. In this study, several LLMs are explored with prompts specifically designed to enhance the quality of image descriptions, which are applied to interpret five different NDE contour maps obtained through technologies for assessing bridge conditions. Each LLM model is evaluated based on its ability to produce detailed descriptions, identify defects, provide actionable recommendations, and demonstrate overall accuracy. The research indicates that four of the nine models provide better image descriptions, effectively covering a wide range of topics related to the bridge’s condition. The outputs from these four models are summarized using five different LLMs to form a comprehensive overview of the bridge. Notably, LLMs ChatGPT-4 and Claude 3.5 Sonnet generate more effective summaries. The findings suggest that LLMs have the potential to significantly improve efficiency and accuracy. This pilot study presents an innovative approach that leverages LLMs for image captioning in parallel and summarization, enabling faster decision-making in bridge maintenance and enhancing infrastructure management and safety assessments.

[208] CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

Main category: cs.AI

TL;DR: CUDA-L1 is an automated reinforcement learning framework for CUDA optimization, achieving significant speedups across various GPU architectures and uncovering key optimization principles.

DetailsMotivation: The rapid growth in GPU demand, driven by Large Language Models, necessitates automated CUDA optimization due to the low success rates of current models.

Method: CUDA-L1 uses reinforcement learning to optimize CUDA kernels, trained on NVIDIA A100, and tested across multiple GPU architectures.

Result: Achieves average speedups up to x17.7 on A100, with peak speedups of x449, and demonstrates portability across GPUs (e.g., x19.0 on RTX 3090).

Conclusion: CUDA-L1 shows RL can transform LLMs into effective optimizers, extending reasoning to new kernels and promising to enhance GPU efficiency.

Abstract: The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

[209] CorMulT: A Semi-supervised Modality Correlation-aware Multimodal Transformer for Sentiment Analysis

Yangmin Li, Ruiqi Zhu, Wengen Li

Main category: cs.AI

TL;DR: Proposes CorMulT, a two-stage semi-supervised model for multimodal sentiment analysis, addressing weak modality correlations and outperforming existing methods.

DetailsMotivation: Existing methods rely on strong modality correlations and perform poorly with weak correlations, limiting their effectiveness.

Method: Introduces CorMulT with pre-training (modality correlation contrastive learning) and prediction stages, fusing learned correlations with modality representations.

Result: CorMulT outperforms state-of-the-art methods on the CMU-MOSEI dataset.

Conclusion: CorMulT effectively addresses weak modality correlations, enhancing sentiment analysis performance.

Abstract: Multimodal sentiment analysis is an active research area that combines multiple data modalities, e.g., text, image and audio, to analyze human emotions and benefits a variety of applications. Existing multimodal sentiment analysis methods can be classified as modality interaction-based methods, modality transformation-based methods and modality similarity-based methods. However, most of these methods highly rely on the strong correlations between modalities, and cannot fully uncover and utilize the correlations between modalities to enhance sentiment analysis. Therefore, these methods usually achieve bad performance for identifying the sentiment of multimodal data with weak correlations. To address this issue, we proposed a two-stage semi-supervised model termed Correlation-aware Multimodal Transformer (CorMulT) which consists pre-training stage and prediction stage. At the pre-training stage, a modality correlation contrastive learning module is designed to efficiently learn modality correlation coefficients between different modalities. At the prediction stage, the learned correlation coefficients are fused with modality representations to make the sentiment prediction. According to the experiments on the popular multimodal dataset CMU-MOSEI, CorMulT obviously surpasses state-of-the-art multimodal sentiment analysis methods.

[210] UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception

Chuang Chen, Xiao Sun, Zhi Liu

Main category: cs.AI

TL;DR: UniEmoX is a cross-modal pretraining framework for visual emotion analysis, integrating psychological insights and contrastive learning to improve emotional representation across diverse scenarios.

DetailsMotivation: Existing methods for visual emotion analysis lack generalizability due to emotion ambiguity and data diversity. UniEmoX addresses this by combining psychological theories with modern techniques.

Method: UniEmoX integrates scene-centric and person-centric image features, leverages CLIP’s semantic knowledge, and uses contrastive learning and masked image modeling.

Result: UniEmoX outperforms benchmarks on six datasets, validated by two downstream tasks. The Emo8 dataset supports diverse emotional scenarios.

Conclusion: UniEmoX advances visual emotion analysis by unifying psychology and deep learning, demonstrating strong performance and generalizability.

Abstract: Visual emotion analysis holds significant research value in both computer vision and psychology. However, existing methods for visual emotion analysis suffer from limited generalizability due to the ambiguity of emotion perception and the diversity of data scenarios. To tackle this issue, we introduce UniEmoX, a cross-modal semantic-guided large-scale pretraining framework. Inspired by psychological research emphasizing the inseparability of the emotional exploration process from the interaction between individuals and their environment, UniEmoX integrates scene-centric and person-centric low-level image spatial structural information, aiming to derive more nuanced and discriminative emotional representations. By exploiting the similarity between paired and unpaired image-text samples, UniEmoX distills rich semantic knowledge from the CLIP model to enhance emotional embedding representations more effectively. To the best of our knowledge, this is the first large-scale pretraining framework that integrates psychological theories with contemporary contrastive learning and masked image modeling techniques for emotion analysis across diverse scenarios. Additionally, we develop a visual emotional dataset titled Emo8. Emo8 samples cover a range of domains, including cartoon, natural, realistic, science fiction and advertising cover styles, covering nearly all common emotional scenes. Comprehensive experiments conducted on six benchmark datasets across two downstream tasks validate the effectiveness of UniEmoX. The source code is available at https://github.com/chincharles/u-emo.

[211] BLAST: A Stealthy Backdoor Leverage Attack against Cooperative Multi-Agent Deep Reinforcement Learning based Systems

Jing Fang, Saihao Yan, Xueyu Yin, Yinbo Yu, Chunwei Tian, Jiajia Liu

Main category: cs.AI

TL;DR: BLAST is a novel backdoor attack in c-MADRL, targeting a single agent to compromise the entire team with stealthy spatiotemporal triggers and unilateral reward hacking.

DetailsMotivation: Existing backdoor attacks in c-MADRL lack stealthiness or require additional networks, prompting the need for a more practical and covert method.

Method: BLAST uses adversary spatiotemporal behavior patterns as triggers and hacks the reward function of a single agent to achieve a leverage attack effect.

Result: BLAST achieves high attack success rates with low clean performance variance in tests against 3 c-MADRL algorithms and 2 defenses.

Conclusion: BLAST demonstrates effective and stealthy backdoor attacks in c-MADRL, highlighting vulnerabilities in cooperative systems.

Abstract: Recent studies have shown that cooperative multi-agent deep reinforcement learning (c-MADRL) is under the threat of backdoor attacks. Once a backdoor trigger is observed, it will perform malicious actions leading to failures or malicious goals. However, existing backdoor attacks suffer from several issues, e.g., instant trigger patterns lack stealthiness, the backdoor is trained or activated by an additional network, or all agents are backdoored. To this end, in this paper, we propose a novel backdoor leverage attack against c-MADRL, BLAST, which attacks the entire multi-agent team by embedding the backdoor only in a single agent. Firstly, we introduce adversary spatiotemporal behavior patterns as the backdoor trigger rather than manual-injected fixed visual patterns or instant status and control the period to perform malicious actions. This method can guarantee the stealthiness and practicality of BLAST. Secondly, we hack the original reward function of the backdoor agent via unilateral guidance to inject BLAST, so as to achieve the \textit{leverage attack effect} that can pry open the entire multi-agent system via a single backdoor agent. We evaluate our BLAST against 3 classic c-MADRL algorithms (VDN, QMIX, and MAPPO) in 2 popular c-MADRL environments (SMAC and Pursuit), and 2 existing defense mechanisms. The experimental results demonstrate that BLAST can achieve a high attack success rate while maintaining a low clean performance variance rate.

[212] To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Haozhe Wang, Long Li, Chao Qu, Fengming Zhu, Weidi Xu, Wei Chu, Fangzhen Lin

Main category: cs.AI

TL;DR: The paper introduces an Expectation-Maximization (EM) framework to improve autonomous code integration in language models, addressing limitations of rigid hybrid frameworks and inefficient RL exploration.

DetailsMotivation: Existing hybrid frameworks lack metacognitive awareness, relying on rigid instructions for code integration, which limits adaptability as models evolve.

Method: Proposes an EM framework combining structured exploration (E-step) with off-policy RL optimization (M-step) to enhance autonomous tool-use decisions.

Result: The 7B model achieves over 11% improvement on MATH500 and 9.4% on AIME, demonstrating superior performance through better exploration.

Conclusion: The EM framework effectively addresses the inefficiency of RL in learning autonomous code integration, enabling dynamic adaptation and improved problem-solving.

Abstract: Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness – the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.

[213] From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios

Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz

Main category: cs.AI

TL;DR: The paper proposes using LLMs with structured parsing and prompt engineering to automate the evaluation and generation of safety-critical driving scenarios, reducing reliance on handcrafted methods.

DetailsMotivation: Current scenario-based testing for autonomous vehicles relies on handcrafted scenarios, which are labor-intensive and lack scalability.

Method: Combines LLMs with structured scenario parsing and prompt engineering, introducing Cartesian and Ego-centric prompts for evaluation and an adversarial generation module for creating critical scenarios.

Result: The evaluation module detects collisions and assesses safety, while the generation module identifies high-risk agents and creates realistic scenarios.

Conclusion: LLMs with domain-informed prompting can effectively evaluate and generate safety-critical scenarios, reducing dependence on handcrafted metrics.

Abstract: Ensuring the safety of autonomous vehicles requires virtual scenario-based testing, which depends on the robust evaluation and generation of safety-critical scenarios. So far, researchers have used scenario-based testing frameworks that rely heavily on handcrafted scenarios as safety metrics. To reduce the effort of human interpretation and overcome the limited scalability of these approaches, we combine Large Language Models (LLMs) with structured scenario parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios. We introduce Cartesian and Ego-centric prompt strategies for scenario evaluation, and an adversarial generation module that modifies trajectories of risk-inducing vehicles (ego-attackers) to create critical scenarios. We validate our approach using a 2D simulation framework and multiple pre-trained LLMs. The results show that the evaluation module effectively detects collision scenarios and infers scenario safety. Meanwhile, the new generation module identifies high-risk agents and synthesizes realistic, safety-critical scenarios. We conclude that an LLM equipped with domain-informed prompting techniques can effectively evaluate and generate safety-critical driving scenarios, reducing dependence on handcrafted metrics. We release our open-source code and scenarios at: https://github.com/TUM-AVS/From-Words-to-Collisions.

[214] Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che

Main category: cs.AI

TL;DR: This survey provides a unified perspective on Long Chain-of-Thought (Long CoT) in reasoning with large language models, distinguishing it from Short CoT, exploring its characteristics, investigating key phenomena, and identifying future research directions.

DetailsMotivation: Despite advancements in reasoning with large language models, a comprehensive survey on Long CoT is lacking, hindering understanding of its distinctions from Short CoT and complicating debates on issues like 'overthinking' and 'inference-time scaling.'

Method: The survey introduces a taxonomy for reasoning paradigms, explores Long CoT’s characteristics (deep reasoning, extensive exploration, feasible reflection), and investigates key phenomena like overthinking and inference-time scaling.

Result: Long CoT enhances reasoning abilities, enabling models to handle complex tasks more efficiently and coherently than Short CoT. Key phenomena and research gaps are identified.

Conclusion: The survey aims to inspire future research in logical reasoning, highlighting directions like multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks.

Abstract: Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like “overthinking” and “inference-time scaling.” This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.

[215] What the F*ck Is Artificial General Intelligence?

Michael Timothy Bennett

Main category: cs.AI

TL;DR: The paper provides an overview of AGI, comparing definitions and tools, and discusses meta-approaches like scale-maxing, concluding AGI will combine tools and methods.

DetailsMotivation: To clarify the meaning of AGI and settle debates through scientific investigation by comparing definitions and foundational tools.

Method: Compares definitions of intelligence, discusses foundational tools (search and approximation), and analyzes meta-approaches (scale-maxing, simp-maxing, w-maxing) with examples like AIXI and language models.

Result: Scale-maxed approximation dominates, but AGI will require a fusion of tools and meta-approaches, with current bottlenecks being sample and energy efficiency.

Conclusion: AGI’s future lies in combining diverse tools and approaches, with hardware improvements enabling progress, though efficiency remains a challenge.

Abstract: Artificial general intelligence (AGI) is an established field of research. Yet some have questioned if the term still has meaning. AGI has been subject to so much hype and speculation it has become something of a Rorschach test. Melanie Mitchell argues the debate will only be settled through long term, scientific investigation. To that end here is a short, accessible and provocative overview of AGI. I compare definitions of intelligence, settling on intelligence in terms of adaptation and AGI as an artificial scientist. Taking my cue from Sutton’s Bitter Lesson I describe two foundational tools used to build adaptive systems: search and approximation. I compare pros, cons, hybrids and architectures like o3, AlphaGo, AERA, NARS and Hyperon. I then discuss overall meta-approaches to making systems behave more intelligently. I divide them into scale-maxing, simp-maxing, w-maxing based on the Bitter Lesson, Ockham’s and Bennett’s Razors. These maximise resources, simplicity of form, and the weakness of constraints on functionality. I discuss examples including AIXI, the free energy principle and The Embiggening of language models. I conclude that though scale-maxed approximation dominates, AGI will be a fusion of tools and meta-approaches. The Embiggening was enabled by improvements in hardware. Now the bottlenecks are sample and energy efficiency.

[216] SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator

Xueyang Zhou, Weidong Wang, Lin Lu, Jiawen Shi, Guiyao Tie, Yongtian Xu, Lixing Chen, Pan Zhou, Neil Zhenqiang Gong, Lichao Sun

Main category: cs.AI

TL;DR: AutoSafe is a framework for enhancing the safety of LLM-based agents through automated synthetic data generation, addressing risks from dynamic interactions and tool usage.

DetailsMotivation: Ensuring safety in LLM-based agents is challenging due to complex risks from user interactions and tool usage.

Method: AutoSafe uses an open threat model (OTS) and an automated pipeline to simulate unsafe behaviors and generate safe responses, creating a safety training dataset.

Result: AutoSafe improves safety scores by 45% on average and achieves a 28.91% boost on real-world tasks.

Conclusion: AutoSafe advances the safety and scalability of LLM-based agents for real-world deployment.

Abstract: Large Language Model (LLM)-based agents are increasingly deployed in real-world applications such as “digital assistants, autonomous customer service, and decision-support systems”, where their ability to “interact in multi-turn, tool-augmented environments” makes them indispensable. However, ensuring the safety of these agents remains a significant challenge due to the diverse and complex risks arising from dynamic user interactions, external tool usage, and the potential for unintended harmful behaviors. To address this critical issue, we propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation. Concretely, 1) we introduce an open and extensible threat model, OTS, which formalizes how unsafe behaviors emerge from the interplay of user instructions, interaction contexts, and agent actions. This enables precise modeling of safety risks across diverse scenarios. 2) we develop a fully automated data generation pipeline that simulates unsafe user behaviors, applies self-reflective reasoning to generate safe responses, and constructs a large-scale, diverse, and high-quality safety training dataset-eliminating the need for hazardous real-world data collection. To evaluate the effectiveness of our framework, we design comprehensive experiments on both synthetic and real-world safety benchmarks. Results demonstrate that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks, validating the generalization ability of our learned safety strategies. These results highlight the practical advancement and scalability of AutoSafe in building safer LLM-based agents for real-world deployment. We have released the project page at https://auto-safe.github.io/.

[217] Strategic Reflectivism In Intelligent Systems

Nick Byrd

Main category: cs.AI

TL;DR: The paper synthesizes historical debates on rationality and dual-process theories to propose Strategic Reflectivism, advocating pragmatic switching between intuitive and reflective thinking for intelligent systems.

DetailsMotivation: To bridge historical debates on rationality with modern applications in AI and cognitive science, emphasizing the importance of balancing intuitive and reflective thinking.

Method: Combines historical analysis of rationality theories with recent experimental results from human and machine cognition.

Result: Proposes Strategic Reflectivism, a framework for intelligent systems to pragmatically switch between intuitive and reflective inference.

Conclusion: Strategic Reflectivism offers actionable insights for designing intelligent systems, transcending traditional indicators of reflection and applying to both humans and AI.

Abstract: By late 20th century, the rationality wars had launched debates about the nature and norms of intuitive and reflective thinking. Those debates drew from mid-20th century ideas such as bounded rationality, which challenged more idealized notions of rationality observed since the 19th century. Now that 21st century cognitive scientists are applying the resulting dual pro-cess theories to artificial intelligence, it is time to dust off some lessons from this history. So this paper synthesizes old ideas with recent results from experiments on humans and machines. The result is Strategic Reflec-tivism, the position that one key to intelligent systems (human or artificial) is pragmatic switching between intuitive and reflective inference to opti-mally fulfill competing goals. Strategic Reflectivism builds on American Pragmatism, transcends superficial indicators of reflective thinking such as model size or chains of thought, applies to both individual and collective intelligence systems (including human-AI teams), and becomes increasingly actionable as we learn more about the value of intuition and reflection.

[218] The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, Mehrdad Farajtabar

Main category: cs.AI

TL;DR: The paper investigates the capabilities and limitations of Large Reasoning Models (LRMs) through controlled puzzle environments, revealing their scaling limits, accuracy collapse at high complexities, and inconsistent reasoning patterns.

DetailsMotivation: To address the insufficient understanding of LRMs' fundamental capabilities, scaling properties, and limitations, especially in reasoning traces beyond final answer accuracy.

Method: Uses controllable puzzle environments to manipulate complexity while analyzing both final answers and internal reasoning traces of LRMs.

Result: LRMs show accuracy collapse beyond certain complexities, exhibit counterintuitive scaling limits, and perform inconsistently across task complexities compared to standard LLMs.

Conclusion: LRMs have limitations in exact computation and inconsistent reasoning, raising questions about their capabilities despite advantages in medium-complexity tasks.

Abstract: Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

[219] Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know?

Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar

Main category: cs.AI

TL;DR: The paper explores uncertainty quantification in reasoning models, finding they are often overconfident, especially with deeper reasoning, and proposes introspective UQ to improve calibration.

DetailsMotivation: To address the issue of reasoning models generating incorrect but confident responses (hallucinations), ensuring safe deployment in real-world applications.

Method: Introduces introspective uncertainty quantification (UQ) to evaluate model calibration, testing three key questions about calibration and reasoning depth.

Result: Findings show reasoning models are typically overconfident, worsen with deeper reasoning, and can improve calibration through introspection, though not uniformly.

Conclusion: Highlights the need for better UQ benchmarks and methods to enhance reasoning model calibration.

Abstract: Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.

[220] GATSim: Urban Mobility Simulation with Generative Agents

Qi Liu, Can Li, Wanjing Ma

Main category: cs.AI

TL;DR: GATSim introduces generative agents for urban mobility simulations, enhancing realism with adaptive behaviors, memory systems, and learning, outperforming traditional rule-based methods.

DetailsMotivation: Traditional rule-based urban mobility simulations lack adaptability and behavioral diversity, prompting the use of AI advancements for more human-like agents.

Method: GATSim integrates an urban mobility foundation model with agent cognitive systems, hierarchical memory, and adaptive planning mechanisms.

Result: Generative agents produce believable travel behaviors, matching human annotators with 92% posterior probability and realistic traffic patterns.

Conclusion: GATSim demonstrates the potential of generative agents for realistic urban mobility simulations, offering a scalable and adaptable framework.

Abstract: Traditional agent-based urban mobility simulations often rely on rigid rule-based systems that struggle to capture the complexity, adaptability, and behavioral diversity inherent in human travel decision making. Recent advancements in large language models and AI agent technologies present new opportunities to develop agents with enhanced reasoning capabilities, persistent memory, and adaptive learning. We introduce GATSim (Generative-Agent Transport Simulation), a novel framework that leverages these advancements to simulate urban mobility using generative agents with rich, human-like behaviors. Unlike conventional approaches, GATSim agents are characterized by diverse socioeconomic profiles, individual lifestyles, and evolving preferences shaped through psychologically informed memory systems, tool usage, and lifelong learning. The main contributions of this work are: (1) a comprehensive architecture that integrates an urban mobility foundation model with agent cognitive systems and a transport simulation environment; (2) a hierarchical memory designed for efficient retrieval of contextually relevant information, incorporating spatial and temporal associations, keyword matching, and semantic relevance; (3) innovative planning and reactive mechanisms for modeling adaptive mobility behaviors which integrate a multi-scale reflection process to transform specific travel experiences into generalized behavioral insights. We implement a prototype system and conduct systematic validation, demonstrating that generative agents produce believable and coherent travel behaviors. Experimental results indicate that generative agents perform at least as well as human annotators with 92% posterior probability, while naturally producing realistic macroscopic traffic patterns. The code for the prototype implementation is publicly available at https://github.com/qiliuchn/gatsim.

[221] Multi-Agent LLMs as Ethics Advocates for AI-Based Systems

Asma Yamani, Malak Baslyman, Moataz Ahmed

Main category: cs.AI

TL;DR: The paper proposes a framework using an ethics advocate agent in a multi-agent LLM setting to automate ethics requirement drafts, showing effectiveness but needing human oversight.

DetailsMotivation: Ethics requirements are often neglected in requirements elicitation due to time and resource constraints, despite their importance for ethically aligned systems.

Method: Introduces an ethics advocate agent in a multi-agent LLM to critique and generate ethics requirements from system descriptions, evaluated via two case studies.

Result: The framework captures most ethics requirements from interviews and adds new ones, but reliability issues necessitate human feedback.

Conclusion: The framework aids in integrating ethics into requirements engineering, though human involvement remains crucial for reliability.

Abstract: Incorporating ethics into the requirement elicitation process is essential for creating ethically aligned systems. Although eliciting manual ethics requirements is effective, it requires diverse input from multiple stakeholders, which can be challenging due to time and resource constraints. Moreover, it is often given a low priority in the requirements elicitation process. This study proposes a framework for generating ethics requirements drafts by introducing an ethics advocate agent in a multi-agent LLM setting. This agent critiques and provides input on ethical issues based on the system description. The proposed framework is evaluated through two case studies from different contexts, demonstrating that it captures the majority of ethics requirements identified by researchers during 30-minute interviews and introduces several additional relevant requirements. However, it also highlights reliability issues in generating ethics requirements, emphasizing the need for human feedback in this sensitive domain. We believe this work can facilitate the broader adoption of ethics in the requirements engineering process, ultimately leading to more ethically aligned products.

[222] Instance space analysis of the capacitated vehicle routing problem

Alessandra M. M. M. Gouvêa, Nuno Paulos, Eduardo Uchoa, Mariá C. V. Nascimento

Main category: cs.AI

TL;DR: The paper introduces Instance Space Analysis (ISA) to study how CVRP instance characteristics affect metaheuristic performance, using DIMACS data and dimensionality reduction.

DetailsMotivation: To understand the nuanced relationships between CVRP instance characteristics and metaheuristic performance.

Method: Combines ISA with DIMACS data, using PRELIM, SIFTED, and PILOT stages for dimensionality reduction and machine learning to project instance space.

Result: Identified 23 relevant instance characteristics and created a 2D projection of the instance space, providing a projection matrix for future analysis.

Conclusion: ISA offers a new perspective and tool for CVRP research, enabling easier incorporation of new instances and advanced instance analysis.

Abstract: This paper seeks to advance CVRP research by addressing the challenge of understanding the nuanced relationships between instance characteristics and metaheuristic (MH) performance. We present Instance Space Analysis (ISA) as a valuable tool that allows for a new perspective on the field. By combining the ISA methodology with a dataset from the DIMACS 12th Implementation Challenge on Vehicle Routing, our research enabled the identification of 23 relevant instance characteristics. Our use of the PRELIM, SIFTED, and PILOT stages, which employ dimensionality reduction and machine learning methods, allowed us to create a two-dimensional projection of the instance space to understand how the structure of instances affect the behavior of MHs. A key contribution of our work is that we provide a projection matrix, which makes it straightforward to incorporate new instances into this analysis and allows for a new method for instance analysis in the CVRP field.

[223] Illuminating the Three Dogmas of Reinforcement Learning under Evolutionary Light

Mani Hamidi, Terrence W. Deacon

Main category: cs.AI

TL;DR: The paper critiques three core tenets of RL, proposing an evolutionary-inspired framework to rethink agency, learning objectives, and the reward hypothesis, with implications for biological learning and practical RL applications.

DetailsMotivation: To address conceptual limitations in RL by drawing parallels with evolutionary theory, aiming to refine RL's theoretical foundations and applicability to biological systems.

Method: The authors revisit each RL dogma using evolutionary insights, argue for evolutionary dynamics in brains, and integrate origins-of-life theory for agency.

Result: A framework is proposed to rethink RL assumptions, emphasizing evolutionary adaptation, multi-objective rewards, and thermodynamic foundations for agency.

Conclusion: Evolutionary theory enriches RL but alone cannot resolve agency; integrating origins-of-life thermodynamics offers a promising path forward.

Abstract: Three core tenets of reinforcement learning (RL)–concerning the definition of agency, the objective of learning, and the scope of the reward hypothesis–have been highlighted as key targets for conceptual revision, with major implications for theory and application. We propose a framework, inspired by open-ended evolutionary theory, to reconsider these three “dogmas.” We revisit each assumption and address related concerns raised alongside them. To make our arguments relevant to RL as a model of biological learning, we first establish that evolutionary dynamics can plausibly operate within living brains over an individual’s lifetime, and are not confined to cross-generational processes. We begin by revisiting the second dogma, drawing on evolutionary insights to enrich the “adaptation-rather-than-search” view of learning. We then address the third dogma regarding the limits of the reward hypothesis, using analogies from evolutionary fitness to illuminate the scalar reward vs. multi-objective debate. After discussing practical implications for exploration in RL, we turn to the first–and arguably most fundamental–issue: the absence of a formal account of agency. We argue that unlike the other two problems, the evolutionary paradigm alone cannot resolve the agency question, though it gestures in a productive direction. We advocate integrating ideas from origins-of-life theory, where the thermodynamics of sustenance and replication offer promising foundations for understanding agency and resource-constrained reinforcement learning in biological systems.

[224] From Roots to Rewards: Dynamic Tree Reasoning with RL

Ahmed Bahloul, Simon Malberg

Main category: cs.AI

TL;DR: A dynamic reinforcement learning framework enhances tree-structured reasoning by adapting ProbTree’s static approach, improving efficiency and solution quality.

DetailsMotivation: Address limitations of static tree-structured reasoning (ProbTree) in language models, such as lack of dynamic adaptation and computational inefficiency.

Method: Introduces a dynamic reinforcement learning framework for adaptive tree construction and action selection (decomposition, retrieval, aggregation).

Result: Improves solution quality and computational efficiency through selective expansion and resource allocation.

Conclusion: Establishes a flexible, reliable paradigm for tree-structured reasoning in real-world question answering.

Abstract: Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.

cs.SD

[225] Temporal Adaptation of Pre-trained Foundation Models for Music Structure Analysis

Yixiao Zhang, Haonan Chen, Ju-Chiang Wang, Jitong Chen

Main category: cs.SD

TL;DR: The paper introduces a temporal adaptation method for fine-tuning music foundation models to improve music structure analysis (MSA), addressing limitations of high temporal resolution and short audio windows.

DetailsMotivation: Current music foundation models for MSA are inefficient and biased due to high temporal resolution and short audio windows, limiting their application to long-form audio.

Method: The proposed method incorporates audio window extension and low-resolution adaptation to enable efficient full-length song analysis in a single forward pass.

Result: Experiments on Harmonix Set and RWC-Pop datasets show improved boundary detection and structural function prediction without compromising memory usage or inference speed.

Conclusion: The temporal adaptation approach effectively enhances MSA performance while maintaining efficiency, making it suitable for long-form audio analysis.

Abstract: Audio-based music structure analysis (MSA) is an essential task in Music Information Retrieval that remains challenging due to the complexity and variability of musical form. Recent advances highlight the potential of fine-tuning pre-trained music foundation models for MSA tasks. However, these models are typically trained with high temporal feature resolution and short audio windows, which limits their efficiency and introduces bias when applied to long-form audio. This paper presents a temporal adaptation approach for fine-tuning music foundation models tailored to MSA. Our method enables efficient analysis of full-length songs in a single forward pass by incorporating two key strategies: (1) audio window extension and (2) low-resolution adaptation. Experiments on the Harmonix Set and RWC-Pop datasets show that our method significantly improves both boundary detection and structural function prediction, while maintaining comparable memory usage and inference speed.

[226] Controlling the Parameterized Multi-channel Wiener Filter using a tiny neural network

Eric Grinstein, Ashutosh Pandey, Cole Li, Shanmukha Srinivas, Juan Azcarreta, Jacob Donley, Sanha Lee, Ali Aroudi, Cagdas Bilen

Main category: cs.SD

TL;DR: NeuralPMWF combines PMWF with a neural network for balanced noise suppression and low speech distortion in speech enhancement.

DetailsMotivation: To balance noise suppression and speech distortion in multi-channel SE, addressing limitations of neural networks and classical methods.

Method: Uses a neural network to control the PMWF beamformer, creating a low-complexity system.

Result: Achieves better perceptual and objective SE compared to baselines with similar compute.

Conclusion: NeuralPMWF effectively balances noise reduction and speech distortion, outperforming existing methods.

Abstract: Noise suppression and speech distortion are two important aspects to be balanced when designing multi-channel Speech Enhancement (SE) algorithms. Although neural network models have achieved state-of-the-art noise suppression, their non-linear operations often introduce high speech distortion. Conversely, classical signal processing algorithms such as the Parameterized Multi-channel Wiener Filter ( PMWF) beamformer offer explicit mechanisms for controlling the suppression/distortion trade-off. In this work, we present NeuralPMWF, a system where the PMWF is entirely controlled using a low-latency, low-compute neural network, resulting in a low-complexity system offering high noise reduction and low speech distortion. Experimental results show that our proposed approach results in significantly better perceptual and objective speech enhancement in comparison to several competitive baselines using similar computational resources.

[227] OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe

Main category: cs.SD

TL;DR: OpenBEATs extends BEATs with multi-domain audio pre-training, achieving state-of-the-art performance across diverse audio tasks while being more efficient than larger models.

DetailsMotivation: To address the limited exploration of masked token prediction for general audio understanding and the lack of open-source pre-training code for BEATs.

Method: Extends BEATs via multi-domain audio pre-training and evaluates across six task types, 25 datasets, and three audio domains.

Result: State-of-the-art performance on bioacoustics, environmental sound, and reasoning datasets, outperforming larger models.

Conclusion: Multi-domain datasets and masked token prediction are effective for general-purpose audio representations; OpenBEATs promotes reproducibility with released resources.

Abstract: Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general audio understanding remains underexplored, with BEATs being the only notable example. BEATs has seen limited modifications due to the absence of open-source pre-training code. Furthermore, BEATs was trained only on AudioSet, restricting its broader downstream applicability. To address these gaps, we present OpenBEATs, an open-source framework that extends BEATs via multi-domain audio pre-training. We conduct comprehensive evaluations across six types of tasks, twenty five datasets, and three audio domains, including audio reasoning tasks such as audio question answering, entailment, and captioning. OpenBEATs achieves state-of-the-art performance on six bioacoustics datasets, two environmental sound datasets and five reasoning datasets, performing better than models exceeding a billion parameters at one-fourth their parameter size. These results demonstrate the effectiveness of multi-domain datasets and masked token prediction task to learn general-purpose audio representations. To promote further research and reproducibility, we release all pre-training and evaluation code, pretrained and fine-tuned checkpoints, and training logs at https://shikhar-s.github.io/OpenBEATs

[228] Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Yixiao Zhang, Yukara Ikemiya, Woosung Choi, Naoki Murata, Marco A. Martínez-Ramírez, Liwei Lin, Gus Xia, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon

Main category: cs.SD

TL;DR: Instruct-MusicGen finetunes a pretrained MusicGen model for efficient text-to-music editing, outperforming baselines with minimal added parameters and training steps.

DetailsMotivation: Existing methods for text-to-music editing are resource-intensive or imprecise, necessitating a more efficient and accurate solution.

Method: Modifies MusicGen with text and audio fusion modules to process instructions and audio inputs concurrently for precise editing.

Result: Achieves superior performance with only 8% new parameters and 5K training steps, matching task-specific models.

Conclusion: Instruct-MusicGen enhances efficiency and broadens applicability in dynamic music production.

Abstract: Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for AI-assisted music creation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.

[229] Source Separation by Flow Matching

Robin Scheibler, John R. Hershey, Arnaud Doucet, Henry Li

Main category: cs.SD

TL;DR: FLOSS (FLOw matching for Source Separation) is a method for single-channel audio source separation using flow matching and equivariant neural networks to reconstruct multiple sources from a mixture.

DetailsMotivation: The problem of separating multiple audio sources from a single-channel mixture is ill-posed, requiring innovative methods to ensure accurate reconstruction.

Method: FLOSS uses flow matching to learn a transformation between the mixture and source distributions, augmented with artificial noise and an equivariant neural network to handle source permutations.

Result: The method is demonstrated to effectively separate overlapping speech sources.

Conclusion: FLOSS provides a robust framework for audio source separation by leveraging flow matching and equivariant architectures.

Abstract: We consider the problem of single-channel audio source separation with the goal of reconstructing $K$ sources from their mixture. We address this ill-posed problem with FLOSS (FLOw matching for Source Separation), a constrained generation method based on flow matching, ensuring strict mixture consistency. Flow matching is a general methodology that, when given samples from two probability distributions defined on the same space, learns an ordinary differential equation to output a sample from one of the distributions when provided with a sample from the other. In our context, we have access to samples from the joint distribution of $K$ sources and so the corresponding samples from the lower-dimensional distribution of their mixture. To apply flow matching, we augment these mixture samples with artificial noise components to match the dimensionality of the $K$ source distribution. Additionally, as any permutation of the sources yields the same mixture, we adopt an equivariant formulation of flow matching which relies on a neural network architecture that is equivariant by design. We demonstrate the performance of the method for the separation of overlapping speech.

[230] SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

Main category: cs.SD

TL;DR: SpecMaskFoley improves ControlNet-based foley synthesis by aligning video features with a pretrained audio model, outperforming from-scratch methods.

DetailsMotivation: To bridge the performance gap between ControlNet-based and from-scratch foley synthesis models by leveraging pretrained audio models and video features.

Method: Uses SpecMaskGIT with ControlNet and a frequency-aware temporal feature aligner to synchronize video and audio without complex conditioning.

Result: Outperforms from-scratch baselines on a foley synthesis benchmark.

Conclusion: SpecMaskFoley advances ControlNet-based foley synthesis, offering a simpler and more effective approach.

Abstract: Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models. Demo page: https://zzaudio.github.io/SpecMaskFoley_Demo/

[231] MuteSwap: Visual-informed Silent Video Identity Conversion

Yifan Liu, Yu Fang, Zhouhan Lin

Main category: cs.SD

TL;DR: MuteSwap enables voice conversion from silent videos using visual inputs, outperforming audio-dependent methods in noisy conditions.

DetailsMotivation: Address the challenge of voice conversion when clean audio is unavailable, such as in silent videos or noisy environments.

Method: Introduces MuteSwap, a framework using contrastive learning to align cross-modality identities and minimize mutual information for feature separation.

Result: Achieves impressive performance in speech synthesis and identity conversion, especially in noisy conditions.

Conclusion: Demonstrates the effectiveness of MuteSwap and feasibility of Silent Face-based Voice Conversion (SFVC).

Abstract: Conventional voice conversion modifies voice characteristics from a source speaker to a target speaker, relying on audio input from both sides. However, this process becomes infeasible when clean audio is unavailable, such as in silent videos or noisy environments. In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. i.e., given images of a target speaker and a silent video of a source speaker containing lip motion, SFVC generates speech aligning the identity of the target speaker while preserving the speech content in the source silent video. As this task requires generating intelligible speech and converting identity using only visual cues, it is particularly challenging. To address this, we introduce MuteSwap, a novel framework that employs contrastive learning to align cross-modality identities and minimize mutual information to separate shared visual features. Experimental results show that MuteSwap achieves impressive performance in both speech synthesis and identity conversion, especially under noisy conditions where methods dependent on audio input fail to produce intelligible results, demonstrating both the effectiveness of our training approach and the feasibility of SFVC.

[232] WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling

Qihui Yang, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack

Main category: cs.SD

TL;DR: WildFX introduces a Docker-based pipeline for generating multi-track audio mixing datasets with professional DSP workflows, bridging AI research and practical DSP demands.

DetailsMotivation: AI struggles to replicate nuanced DSP workflows and parameter interactions in professional audio processing, leading to inferior performance compared to real-world tools.

Method: WildFX uses a professional DAW backend to containerize audio mixing datasets, supporting cross-platform plugins (VST/VST3/LV2/CLAP) and enabling complex signal flows like sidechains.

Result: Experiments show WildFX’s ability to estimate mixing graphs and plugin parameters, validating its practical utility.

Conclusion: WildFX successfully bridges AI research with professional DSP needs, offering a scalable and efficient solution for audio effect modeling.

Abstract: Despite rapid progress in end-to-end AI music generation, AI-driven modeling of professional Digital Signal Processing (DSP) workflows remains challenging. In particular, while there is growing interest in neural black-box modeling of audio effect graphs (e.g. reverb, compression, equalization), AI-based approaches struggle to replicate the nuanced signal flow and parameter interactions used in professional workflows. Existing differentiable plugin approaches often diverge from real-world tools, exhibiting inferior performance relative to simplified neural controllers under equivalent computational constraints. We introduce WildFX, a pipeline containerized with Docker for generating multi-track audio mixing datasets with rich effect graphs, powered by a professional Digital Audio Workstation (DAW) backend. WildFX supports seamless integration of cross-platform commercial plugins or any plugins in the wild, in VST/VST3/LV2/CLAP formats, enabling structural complexity (e.g., sidechains, crossovers) and achieving efficient parallelized processing. A minimalist metadata interface simplifies project/plugin configuration. Experiments demonstrate the pipeline’s validity through blind estimation of mixing graphs, plugin/gain parameters, and its ability to bridge AI research with practical DSP demands. The code is available on: https://github.com/IsaacYQH/WildFX.

cs.LG

[233] Physical models realizing the transformer architecture of large language models

Zeqian Chen

Main category: cs.LG

TL;DR: The paper explores the transformer architecture from a physical perspective, modeling it as an open quantum system in Fock space to better understand its theoretical foundations.

DetailsMotivation: There is a gap in theoretical understanding of why the transformer architecture works, prompting a physical interpretation.

Method: Constructs physical models in Fock space over the Hilbert space of tokens, treating transformers as open quantum systems.

Result: Develops physical models that underlie the transformer architecture for large language models.

Conclusion: The study provides a physical framework to explain the transformer’s effectiveness, bridging theoretical gaps.

Abstract: The introduction of the transformer architecture in 2017 (cf.\cite{VSP2017}) marked the most striking advancement in natural language processing. The transformer is a model architecture relying entirely on an attention mechanism to draw global dependencies between input and output. However, we believe there is a gap in our theoretical understanding of what the transformer is, and why it works physically. In this paper, from a physical perspective on modern chips, we construct physical models in the Fock space over the Hilbert space of tokens realizing large language models based on a transformer architecture as open quantum systems. Our physical models underlie the transformer architecture for large language models.

[234] Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models

Charvi Rastogi, Tian Huey Teh, Pushkar Mishra, Roma Patel, Ding Wang, Mark Díaz, Alicia Parrish, Aida Mostafazadeh Davani, Zoe Ashwood, Michela Paganini, Vinodkumar Prabhakaran, Verena Rieser, Lora Aroyo

Main category: cs.LG

TL;DR: The paper proposes pluralistic alignment for text-to-image (T2I) models to address misalignment with diverse human values. It introduces the DIVE dataset, confirms demographics as a proxy for diverse viewpoints, and discusses implications for equitable T2I systems.

DetailsMotivation: Current T2I models often misalign with diverse human experiences, necessitating a pluralistic approach to alignment.

Method: Introduces the DIVE dataset for pluralistic alignment, uses intersectional human raters, and analyzes demographic differences in harm perception.

Result: Demographics are a key proxy for diverse viewpoints, revealing context-dependent differences in harm perception.

Conclusion: The research provides foundational tools for more equitable and aligned T2I systems, emphasizing efficient data collection and model steerability.

Abstract: Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems. We advocate for pluralistic alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values. Our work provides three core contributions to achieve this in T2I models. First, we introduce a novel dataset for Diverse Intersectional Visual Evaluation (DIVE) – the first multimodal dataset for pluralistic alignment. It enable deep alignment to diverse safety perspectives through a large pool of demographically intersectional human raters who provided extensive feedback across 1000 prompts, with high replication, capturing nuanced safety perceptions. Second, we empirically confirm demographics as a crucial proxy for diverse viewpoints in this domain, revealing significant, context-dependent differences in harm perception that diverge from conventional evaluations. Finally, we discuss implications for building aligned T2I models, including efficient data collection strategies, LLM judgment capabilities, and model steerability towards diverse perspectives. This research offers foundational tools for more equitable and aligned T2I systems. Content Warning: The paper includes sensitive content that may be harmful.

[235] Improving KAN with CDF normalization to quantiles

Jakub Strawa, Jarek Duda

Main category: cs.LG

TL;DR: The paper highlights the benefits of CDF normalization, a method from copula theory, in machine learning, demonstrating its advantages over traditional rescaling methods using Kolmogorov-Arnold Networks (KANs).

DetailsMotivation: Traditional normalization methods in machine learning (mean subtraction, standard deviation division, or fixed-range rescaling) are common but may not be optimal. The paper explores CDF normalization, a less-known method from copula theory, to improve model performance and reduce overfitting.

Method: The study employs CDF normalization, transforming data to approximate quantiles using the estimated cumulative distribution function (CDF), resulting in a near-uniform distribution in [0,1]. This method is tested on Kolmogorov-Arnold Networks (KANs), replacing traditional rescaling.

Result: Switching to CDF normalization in KANs improves predictions compared to traditional methods like Legendre-KAN. The approach also enables mixed moments as neuron weights, facilitating local joint distribution modeling and flexible propagation of probability distributions.

Conclusion: CDF normalization, though underutilized in machine learning, offers significant advantages, including improved prediction accuracy and reduced overfitting, as demonstrated in KANs. It also provides interpretability and flexibility in modeling joint distributions.

Abstract: Data normalization is crucial in machine learning, usually performed by subtracting the mean and dividing by standard deviation, or by rescaling to a fixed range. In copula theory, popular in finance, there is used normalization to approximately quantiles by transforming x to CDF(x) with estimated CDF (cumulative distribution function) to nearly uniform distribution in [0,1], allowing for simpler representations which are less likely to overfit. It seems nearly unknown in machine learning, therefore, we would like to present some its advantages on example of recently popular Kolmogorov-Arnold Networks (KANs), improving predictions from Legendre-KAN by just switching rescaling to CDF normalization. Additionally, in HCR interpretation, weights of such neurons are mixed moments providing local joint distribution models, allow to propagate also probability distributions, and change propagation direction.

[236] Selective Embedding for Deep Learning

Mert Sehri, Zehui Hua, Francisco de Assis Boldt, Patrick Dumond

Main category: cs.LG

TL;DR: The paper introduces selective embedding, a novel data loading strategy for deep learning, improving generalization and computational efficiency by alternating data segments from multiple sources within a single channel.

DetailsMotivation: Deep learning struggles with nonstationary conditions and dissimilar domains, especially in time-domain data, limiting generalization or increasing computational costs.

Method: Selective embedding alternates short data segments from multiple sources in a single input channel, inspired by human-like information processing.

Result: Validated on six time-domain datasets, the method achieves high classification accuracy and reduces training times across various architectures.

Conclusion: Selective embedding is scalable and resource-efficient, ideal for real-world applications requiring robustness and adaptability.

Abstract: Deep learning has revolutionized many industries by enabling models to automatically learn complex patterns from raw data, reducing dependence on manual feature engineering. However, deep learning algorithms are sensitive to input data, and performance often deteriorates under nonstationary conditions and across dissimilar domains, especially when using time-domain data. Conventional single-channel or parallel multi-source data loading strategies either limit generalization or increase computational costs. This study introduces selective embedding, a novel data loading strategy, which alternates short segments of data from multiple sources within a single input channel. Drawing inspiration from cognitive psychology, selective embedding mimics human-like information processing to reduce model overfitting, enhance generalization, and improve computational efficiency. Validation is conducted using six time-domain datasets, demonstrating that the proposed method consistently achieves high classification accuracy across various deep learning architectures while significantly reducing training times. The approach proves particularly effective for complex systems with multiple data sources, offering a scalable and resource-efficient solution for real-world applications in healthcare, heavy machinery, marine, railway, and agriculture, where robustness and adaptability are critical.

[237] Scalable Submodular Policy Optimization via Pruned Submodularity Graph

Aditi Anand, Suman Banerjee, Dildar Ali

Main category: cs.LG

TL;DR: The paper introduces a variant of RL with submodular reward functions, proposing a pruned submodularity graph-based approach for efficient optimization.

DetailsMotivation: Traditional RL assumes additive rewards, but many real-world problems (e.g., path planning) exhibit diminishing returns, modeled as submodular functions. This work addresses such scenarios.

Method: A pruned submodularity graph-based approach is developed to find an optimal policy, with analysis of time, space, and performance guarantees.

Result: Experiments on a benchmark setup show the proposed method outperforms baselines in reward maximization.

Conclusion: The approach effectively handles submodular rewards in RL, offering computational feasibility and improved performance.

Abstract: In Reinforcement Learning (abbreviated as RL), an agent interacts with the environment via a set of possible actions, and a reward is generated from some unknown distribution. The task here is to find an optimal set of actions such that the reward after a certain time step gets maximized. In a traditional setup, the reward function in an RL Problem is considered additive. However, in reality, there exist many problems, including path planning, coverage control, etc., the reward function follows the diminishing return, which can be modeled as a submodular function. In this paper, we study a variant of the RL Problem where the reward function is submodular, and our objective is to find an optimal policy such that this reward function gets maximized. We have proposed a pruned submodularity graph-based approach that provides a provably approximate solution in a feasible computation time. The proposed approach has been analyzed to understand its time and space requirements as well as a performance guarantee. We have experimented with a benchmark agent-environment setup, which has been used for similar previous studies, and the results are reported. From the results, we observe that the policy obtained by our proposed approach leads to more reward than the baseline methods.

[238] LightAutoDS-Tab: Multi-AutoML Agentic System for Tabular Data

Aleksey Lapin, Igor Hromov, Stanislav Chumakov, Mile Mitrovic, Dmitry Simakov, Nikolay O. Nikitin, Andrey V. Savchenko

Main category: cs.LG

TL;DR: LightAutoDS-Tab is a multi-AutoML system combining LLM-based code generation with AutoML tools, enhancing flexibility and robustness for tabular data tasks, outperforming existing solutions.

DetailsMotivation: AutoML's efficiency is limited by tool dependency; this paper aims to improve flexibility and robustness in pipeline design for tabular data tasks.

Method: Integrates LLM-based code generation with multiple AutoML tools to create a multi-agentic system.

Result: Outperforms state-of-the-art open-source solutions on Kaggle data science tasks.

Conclusion: LightAutoDS-Tab offers a more flexible and robust approach for AutoML in tabular data tasks, with open-source availability.

Abstract: AutoML has advanced in handling complex tasks using the integration of LLMs, yet its efficiency remains limited by dependence on specific underlying tools. In this paper, we introduce LightAutoDS-Tab, a multi-AutoML agentic system for tasks with tabular data, which combines an LLM-based code generation with several AutoML tools. Our approach improves the flexibility and robustness of pipeline design, outperforming state-of-the-art open-source solutions on several data science tasks from Kaggle. The code of LightAutoDS-Tab is available in the open repository https://github.com/sb-ai-lab/LADS

[239] Gauge Flow Models

Alexander Strunk, Roland Assam

Main category: cs.LG

TL;DR: Gauge Flow Models, a new class of Generative Flow Models, use a learnable Gauge Field in Flow ODEs, outperforming traditional Flow Models in Gaussian Mixture Model experiments.

DetailsMotivation: To improve performance of Generative Flow Models by integrating a learnable Gauge Field into the Flow ODE framework.

Method: Introduces Gauge Flow Models with a mathematical framework and tests them using Flow Matching on Gaussian Mixture Models.

Result: Gauge Flow Models show significantly better performance than traditional Flow Models, even when smaller in size.

Conclusion: Gauge Flow Models are promising for generative tasks, with potential for broader applications.

Abstract: This paper introduces Gauge Flow Models, a novel class of Generative Flow Models. These models incorporate a learnable Gauge Field within the Flow Ordinary Differential Equation (ODE). A comprehensive mathematical framework for these models, detailing their construction and properties, is provided. Experiments using Flow Matching on Gaussian Mixture Models demonstrate that Gauge Flow Models yields significantly better performance than traditional Flow Models of comparable or even larger size. Additionally, unpublished research indicates a potential for enhanced performance across a broader range of generative tasks.

[240] Single- to multi-fidelity history-dependent learning with uncertainty quantification and disentanglement: application to data-driven constitutive modeling

Jiaxiang Yi, Bernardo P. Ferreira, Miguel A. Bessa

Main category: cs.LG

TL;DR: The paper generalizes data-driven learning to handle history-dependent multi-fidelity data, quantifying epistemic uncertainty and separating it from data noise. It adapts to various learning scenarios, from simple deterministic models to complex Bayesian recurrent neural networks.

DetailsMotivation: To address challenges in data-driven constitutive modeling, especially in scenarios with multi-fidelity data and uncertainty, by providing a versatile and hierarchical learning framework.

Method: Proposes a hierarchical, adaptive methodology for multi-fidelity variance estimation using Bayesian recurrent neural networks, applicable to deterministic and noisy data scenarios.

Result: The method accurately predicts responses, quantifies model error, and identifies noise distributions, demonstrating versatility in diverse data-driven modeling cases.

Conclusion: The framework opens opportunities for real-world applications in design and analysis under uncertainty, particularly in scientific and engineering domains.

Abstract: Data-driven learning is generalized to consider history-dependent multi-fidelity data, while quantifying epistemic uncertainty and disentangling it from data noise (aleatoric uncertainty). This generalization is hierarchical and adapts to different learning scenarios: from training the simplest single-fidelity deterministic neural networks up to the proposed multi-fidelity variance estimation Bayesian recurrent neural networks. The versatility and generality of the proposed methodology are demonstrated by applying it to different data-driven constitutive modeling scenarios that include multiple fidelities with and without aleatoric uncertainty (noise). The method accurately predicts the response and quantifies model error while also discovering the noise distribution (when present). This opens opportunities for future real-world applications in diverse scientific and engineering domains; especially, the most challenging cases involving design and analysis under uncertainty.

[241] Agentic Neural Networks: Self-Evolving Multi-Agent Systems via Textual Backpropagation

Xiaowen Ma, Chenyang Lin, Yao Zhang, Volker Tresp, Yunpu Ma

Main category: cs.LG

TL;DR: The paper introduces Agentic Neural Network (ANN), a framework for dynamic multi-agent collaboration inspired by neural network architecture, improving accuracy and adaptability in complex tasks.

DetailsMotivation: Current multi-agent systems rely on static configurations, limiting flexibility and scalability. ANN aims to address this by enabling dynamic, data-driven collaboration.

Method: ANN uses a two-phase optimization: (1) Forward Phase for task decomposition and team formation, and (2) Backward Phase for iterative refinement of agent roles and coordination.

Result: ANN outperforms existing multi-agent baselines across four benchmark datasets, demonstrating consistent performance gains.

Conclusion: ANN offers a scalable, neuro-symbolic framework for multi-agent systems, combining LLM collaboration with neural network efficiency. The framework will be open-sourced.

Abstract: Leveraging multiple Large Language Models(LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network(ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative “team” focused on a specific subtask. Agentic Neural Network follows a two-phase optimization strategy: (1) Forward Phase-Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase-Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables ANN to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across four benchmark datasets, ANN surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements. Our findings indicate that ANN provides a scalable, data-driven framework for multi-agent systems, combining the collaborative capabilities of LLMs with the efficiency and flexibility of neural network principles. We plan to open-source the entire framework.

[242] Soft-ECM: An extension of Evidential C-Means for complex data

Armel Soubeiga, Thomas Guyet, Violaine Antoine

Main category: cs.LG

TL;DR: The paper introduces Soft-ECM, a belief function-based clustering algorithm for complex data like mixed or non-tabular data, using semi-metrics instead of Euclidean space properties.

DetailsMotivation: Existing belief function clustering algorithms fail for complex data (e.g., mixed or non-tabular) due to reliance on Euclidean space properties.

Method: Reformulates Evidential C-Means (ECM) for complex data, proposing Soft-ECM, which uses semi-metrics for centroid positioning in imprecise clusters.

Result: Soft-ECM performs comparably to fuzzy clustering on numerical data and handles mixed data effectively, showing benefits with semi-metrics like DTW for time series.

Conclusion: Soft-ECM extends belief function clustering to complex data, offering a flexible and effective alternative to traditional methods.

Abstract: Clustering based on belief functions has been gaining increasing attention in the machine learning community due to its ability to effectively represent uncertainty and/or imprecision. However, none of the existing algorithms can be applied to complex data, such as mixed data (numerical and categorical) or non-tabular data like time series. Indeed, these types of data are, in general, not represented in a Euclidean space and the aforementioned algorithms make use of the properties of such spaces, in particular for the construction of barycenters. In this paper, we reformulate the Evidential C-Means (ECM) problem for clustering complex data. We propose a new algorithm, Soft-ECM, which consistently positions the centroids of imprecise clusters requiring only a semi-metric. Our experiments show that Soft-ECM present results comparable to conventional fuzzy clustering approaches on numerical data, and we demonstrate its ability to handle mixed data and its benefits when combining fuzzy clustering with semi-metrics such as DTW for time series data.

[243] Air Traffic Controller Task Demand via Graph Neural Networks: An Interpretable Approach to Airspace Complexity

Edward Henderson, Dewi Gould, Richard Everson, George De Ath, Nick Pepper

Main category: cs.LG

TL;DR: A Graph Neural Network (GNN) framework is introduced to predict ATCO task demand by analyzing traffic scenarios, outperforming heuristics and baselines.

DetailsMotivation: Existing complexity metrics lack nuance in assessing ATCO task demand in crowded airspace.

Method: An attention-based GNN predicts upcoming clearances and derives per-aircraft task demand scores via systematic ablation.

Result: The framework outperforms ATCO-inspired heuristics and baselines, providing reliable complexity estimation.

Conclusion: The tool offers interpretable task demand attribution, aiding controller training and airspace redesign.

Abstract: Real-time assessment of near-term Air Traffic Controller (ATCO) task demand is a critical challenge in an increasingly crowded airspace, as existing complexity metrics often fail to capture nuanced operational drivers beyond simple aircraft counts. This work introduces an interpretable Graph Neural Network (GNN) framework to address this gap. Our attention-based model predicts the number of upcoming clearances, the instructions issued to aircraft by ATCOs, from interactions within static traffic scenarios. Crucially, we derive an interpretable, per-aircraft task demand score by systematically ablating aircraft and measuring the impact on the model’s predictions. Our framework significantly outperforms an ATCO-inspired heuristic and is a more reliable estimator of scenario complexity than established baselines. The resulting tool can attribute task demand to specific aircraft, offering a new way to analyse and understand the drivers of complexity for applications in controller training and airspace redesign.

[244] Improving Out-of-distribution Human Activity Recognition via IMU-Video Cross-modal Representation Learning

Seyyed Saeid Cheshmi, Buyao Lyu, Thomas Lisko, Rajesh Rajamani, Robert A. McGovern, Yogatheesan Varatharajah

Main category: cs.LG

TL;DR: The paper proposes a cross-modal self-supervised pretraining method for Human Activity Recognition (HAR) using IMU-video data, improving generalizability for out-of-distribution datasets, including Parkinson’s disease patients.

DetailsMotivation: Current HAR methods lack generalizability across environments or populations due to reliance on application-specific labels.

Method: A cross-modal self-supervised pretraining approach using large-scale unlabeled IMU-video data.

Result: Outperforms state-of-the-art IMU-video and IMU-only pretraining in zero-shot and few-shot evaluations.

Conclusion: Cross-modal pretraining is effective for learning generalizable representations in dynamic data like IMU signals.

Abstract: Human Activity Recognition (HAR) based on wearable inertial sensors plays a critical role in remote health monitoring. In patients with movement disorders, the ability to detect abnormal patient movements in their home environments can enable continuous optimization of treatments and help alert caretakers as needed. Machine learning approaches have been proposed for HAR tasks using Inertial Measurement Unit (IMU) data; however, most rely on application-specific labels and lack generalizability to data collected in different environments or populations. To address this limitation, we propose a new cross-modal self-supervised pretraining approach to learn representations from large-sale unlabeled IMU-video data and demonstrate improved generalizability in HAR tasks on out of distribution (OOD) IMU datasets, including a dataset collected from patients with Parkinson’s disease. Specifically, our results indicate that the proposed cross-modal pretraining approach outperforms the current state-of-the-art IMU-video pretraining approach and IMU-only pretraining under zero-shot and few-shot evaluations. Broadly, our study provides evidence that in highly dynamic data modalities, such as IMU signals, cross-modal pretraining may be a useful tool to learn generalizable data representations. Our software is available at https://github.com/scheshmi/IMU-Video-OOD-HAR.

[245] Model-free Reinforcement Learning for Model-based Control: Towards Safe, Interpretable and Sample-efficient Agents

Thomas Banker, Ali Mesbah

Main category: cs.LG

TL;DR: The paper explores model-based agents as an alternative to model-free RL for safer, more interpretable, and sample-efficient decision-making in autonomous systems.

DetailsMotivation: Model-free RL, while effective, suffers from sample inefficiency, unsafe learning, and limited interpretability due to reliance on deep neural networks.

Method: The work introduces model-based agents (e.g., model predictive control) that leverage adaptable models of system dynamics, cost, and constraints. It combines these with model-free RL to address model mismatch.

Result: Model-based agents offer safer, more interpretable, and sample-efficient learning, with potential synergies when combined with model-free RL.

Conclusion: The interplay between model-based and model-free RL presents untapped potential for developing efficient, safe, and interpretable decision-making agents.

Abstract: Training sophisticated agents for optimal decision-making under uncertainty has been key to the rapid development of modern autonomous systems across fields. Notably, model-free reinforcement learning (RL) has enabled decision-making agents to improve their performance directly through system interactions, with minimal prior knowledge about the system. Yet, model-free RL has generally relied on agents equipped with deep neural network function approximators, appealing to the networks’ expressivity to capture the agent’s policy and value function for complex systems. However, neural networks amplify the issues of sample inefficiency, unsafe learning, and limited interpretability in model-free RL. To this end, this work introduces model-based agents as a compelling alternative for control policy approximation, leveraging adaptable models of system dynamics, cost, and constraints for safe policy learning. These models can encode prior system knowledge to inform, constrain, and aid in explaining the agent’s decisions, while deficiencies due to model mismatch can be remedied with model-free RL. We outline the benefits and challenges of learning model-based agents – exemplified by model predictive control – and detail the primary learning approaches: Bayesian optimization, policy search RL, and offline strategies, along with their respective strengths. While model-free RL has long been established, its interplay with model-based agents remains largely unexplored, motivating our perspective on their combined potentials for sample-efficient learning of safe and interpretable decision-making agents.

[246] Fake or Real: The Impostor Hunt in Texts for Space Operations

Agata Kaczmarek, Dawid Płudowski, Piotr Wilczyński, Przemysław Biecek, Krzysztof Kotowski, Ramez Shendy, Jakub Nalepa, Artur Janicki, Evridiki Ntagiou

Main category: cs.LG

TL;DR: The Kaggle competition ‘Fake or Real’ challenges participants to detect maliciously modified outputs from Large Language Models (LLMs), addressing AI security threats like data poisoning and overreliance.

DetailsMotivation: The competition aims to tackle under-researched AI security threats, specifically data poisoning and overreliance in LLMs, identified in the ESA-funded 'Assurance for Space Domain AI Applications' project.

Method: Participants must develop or adapt techniques to distinguish between genuine and maliciously altered LLM outputs.

Result: The competition seeks innovative solutions to a novel problem, fostering research in AI security.

Conclusion: This initiative highlights the need for robust methods to detect manipulated AI outputs, contributing to safer AI applications in critical domains.

Abstract: The “Fake or Real” competition hosted on Kaggle (\href{https://www.kaggle.com/competitions/fake-or-real-the-impostor-hunt}{https://www.kaggle.com/competitions/fake-or-real-the-impostor-hunt}) is the second part of a series of follow-up competitions and hackathons related to the “Assurance for Space Domain AI Applications” project funded by the European Space Agency (\href{https://assurance-ai.space-codev.org/}{https://assurance-ai.space-codev.org/}). The competition idea is based on two real-life AI security threats identified within the project – data poisoning and overreliance in Large Language Models. The task is to distinguish between the proper output from LLM and the output generated under malicious modification of the LLM. As this problem was not extensively researched, participants are required to develop new techniques to address this issue or adjust already existing ones to this problem’s statement.

[247] Provable Low-Frequency Bias of In-Context Learning of Representations

Yongyi Yang, Hidenori Tanaka, Wei Hu

Main category: cs.LG

TL;DR: The paper explains how in-context learning (ICL) in LLMs works through a double convergence framework, leading to smooth representations and robustness to noise.

DetailsMotivation: To uncover the mechanisms behind ICL in LLMs, which surpass pretraining by internalizing data-generating processes.

Method: Introduces a unified framework of double convergence (convergence over context and layers) to analyze hidden representations.

Result: Proves and verifies an implicit bias towards smooth representations, explains empirical observations, and predicts noise robustness.

Conclusion: Provides theoretical insights into ICL mechanisms, offering a foundation for broader studies.

Abstract: In-context learning (ICL) enables large language models (LLMs) to acquire new behaviors from the input sequence alone without any parameter updates. Recent studies have shown that ICL can surpass the original meaning learned in pretraining stage through internalizing the structure the data-generating process (DGP) of the prompt into the hidden representations. However, the mechanisms by which LLMs achieve this ability is left open. In this paper, we present the first rigorous explanation of such phenomena by introducing a unified framework of double convergence, where hidden representations converge both over context and across layers. This double convergence process leads to an implicit bias towards smooth (low-frequency) representations, which we prove analytically and verify empirically. Our theory explains several open empirical observations, including why learned representations exhibit globally structured but locally distorted geometry, and why their total energy decays without vanishing. Moreover, our theory predicts that ICL has an intrinsic robustness towards high-frequency noise, which we empirically confirm. These results provide new insights into the underlying mechanisms of ICL, and a theoretical foundation to study it that hopefully extends to more general data distributions and settings.

[248] Acoustic Index: A Novel AI-Driven Parameter for Cardiac Disease Risk Stratification Using Echocardiography

Beka Begiashvili, Carlos J. Fernandez-Candel, Matías Pérez Paredes

Main category: cs.LG

TL;DR: The paper introduces the Acoustic Index, an AI-derived echocardiographic parameter for early detection of cardiac dysfunction, outperforming traditional methods like EF and GLS.

DetailsMotivation: Limitations of traditional echocardiographic parameters (EF, GLS) in detecting early cardiac dysfunction drive the need for reproducible, interpretable, and operator-independent alternatives.

Method: The Acoustic Index combines Extended Dynamic Mode Decomposition (EDMD) with a hybrid neural network, incorporating clinical metadata and spatiotemporal dynamics from echocardiographic sequences.

Result: In a cohort of 736 patients, the Acoustic Index achieved an AUC of 0.89, with sensitivity and specificity exceeding 0.8 in cross-validation.

Conclusion: The Acoustic Index is a promising, scalable, and vendor-independent tool for early cardiac dysfunction detection, with potential for future validation and disease-specific adaptation.

Abstract: Traditional echocardiographic parameters such as ejection fraction (EF) and global longitudinal strain (GLS) have limitations in the early detection of cardiac dysfunction. EF often remains normal despite underlying pathology, and GLS is influenced by load conditions and vendor variability. There is a growing need for reproducible, interpretable, and operator-independent parameters that capture subtle and global cardiac functional alterations. We introduce the Acoustic Index, a novel AI-derived echocardiographic parameter designed to quantify cardiac dysfunction from standard ultrasound views. The model combines Extended Dynamic Mode Decomposition (EDMD) based on Koopman operator theory with a hybrid neural network that incorporates clinical metadata. Spatiotemporal dynamics are extracted from echocardiographic sequences to identify coherent motion patterns. These are weighted via attention mechanisms and fused with clinical data using manifold learning, resulting in a continuous score from 0 (low risk) to 1 (high risk). In a prospective cohort of 736 patients, encompassing various cardiac pathologies and normal controls, the Acoustic Index achieved an area under the curve (AUC) of 0.89 in an independent test set. Cross-validation across five folds confirmed the robustness of the model, showing that both sensitivity and specificity exceeded 0.8 when evaluated on independent data. Threshold-based analysis demonstrated stable trade-offs between sensitivity and specificity, with optimal discrimination near this threshold. The Acoustic Index represents a physics-informed, interpretable AI biomarker for cardiac function. It shows promise as a scalable, vendor-independent tool for early detection, triage, and longitudinal monitoring. Future directions include external validation, longitudinal studies, and adaptation to disease-specific classifiers.

[249] Time Series Forecastability Measures

Rui Wang, Steven Klee, Alexis Roos

Main category: cs.LG

TL;DR: The paper introduces two metrics—spectral predictability score and largest Lyapunov exponent—to assess time series forecastability before model development, showing strong correlation with actual forecast performance.

DetailsMotivation: To evaluate the inherent forecastability of time series data before modeling, helping practitioners prioritize efforts and set realistic expectations.

Method: Uses spectral predictability score for frequency regularity and Lyapunov exponents for chaos/stability, tested on synthetic and M5 competition datasets.

Result: Metrics effectively reflect inherent forecastability and correlate with model performance, aiding in strategic planning.

Conclusion: Pre-model forecastability assessment improves resource allocation and expectation setting for time series forecasting.

Abstract: This paper proposes using two metrics to quantify the forecastability of time series prior to model development: the spectral predictability score and the largest Lyapunov exponent. Unlike traditional model evaluation metrics, these measures assess the inherent forecastability characteristics of the data before any forecast attempts. The spectral predictability score evaluates the strength and regularity of frequency components in the time series, whereas the Lyapunov exponents quantify the chaos and stability of the system generating the data. We evaluated the effectiveness of these metrics on both synthetic and real-world time series from the M5 forecast competition dataset. Our results demonstrate that these two metrics can correctly reflect the inherent forecastability of a time series and have a strong correlation with the actual forecast performance of various models. By understanding the inherent forecastability of time series before model training, practitioners can focus their planning efforts on products and supply chain levels that are more forecastable, while setting appropriate expectations or seeking alternative strategies for products with limited forecastability.

[250] Change of Thought: Adaptive Test-Time Computation

Mrinal Mathur, Mike Doan, Barak Pearlmutter, Sergey Plis

Main category: cs.LG

TL;DR: The SELF-Transformer enhances encoder Transformers by iteratively refining attention weights internally, avoiding token-level autoregression, and improving accuracy by up to 20% without extra parameters.

DetailsMotivation: Transformers with fixed-depth passes are limited in expressive power. Autoregressive methods rely on externalizing intermediate states, unlike biological brains. The goal is to boost expressive power without token-level autoregression.

Method: Introduces the SELF-Transformer, an encoder layer that iteratively updates its attention weights internally to a fixed point, scaling computation with input difficulty.

Result: Achieves up to 20% accuracy gains on benchmarks without increasing parameter count, showing benefits of input-adaptive alignment.

Conclusion: SELF-Transformers recover expressive power of iterative reasoning while maintaining encoder simplicity, offering significant accuracy improvements with modest extra compute.

Abstract: Transformers evaluated in a single, fixed-depth pass are provably limited in expressive power to the constant-depth circuit class TC0. Running a Transformer autoregressively removes that ceiling – first in next-token prediction and, more recently, in chain-of-thought reasoning. Both regimes rely on feedback loops that decode internal states into tokens only to re-encode them in subsequent steps. While this “thinking aloud” mirrors human reasoning, biological brains iterate without externalising intermediate states as language. To boost the expressive power of encoder Transformers without resorting to token-level autoregression, we introduce the SELF-Transformer: an encoder layer that iteratively refines its own attention weights to a fixed point. Instead of producing – in one pass – the alignment matrix that remixes the input sequence, the SELF-Transformer iteratively updates that matrix internally, scaling test-time computation with input difficulty. This adaptivity yields up to 20% accuracy gains on encoder-style benchmarks without increasing parameter count, demonstrating that input-adaptive alignment at test time offers substantial benefits for only a modest extra compute budget. Self-Transformers thus recover much of the expressive power of iterative reasoning while preserving the simplicity of pure encoder architectures.

[251] Apple Intelligence Foundation Language Models: Tech Report 2025

Hanzhi Zhou, Erik Hornberger, Pengsheng Guo, Xiyou Zhou, Saiwen Wang, Xin Wang, Yifei He, Xuankai Chang, Rene Rauch, Louis D’hauwe, John Peebles, Alec Doane, Kohen Chia, Jenna Thibodeau, Zi-Yi Dou, Yuanyang Zhang, Ruoming Pang, Reed Li, Zhifeng Chen, Jeremy Warner, Zhaoyang Xu, Sophy Lee, David Mizrahi, Ramsey Tantawi, Chris Chaney, Kelsey Peterson, Jun Qin, Alex Dombrowski, Mira Chiang, Aiswarya Raghavan, Gerard Casamayor, Qibin Chen, Aonan Zhang, Nathalie Tran, Jianyu Wang, Hang Su, Thomas Voice, Alessandro Pappalardo, Brycen Wershing, Prasanth Yadla, Rui Li, Priyal Chhatrapati, Ismael Fernandez, Yusuf Goren, Xin Zheng, Forrest Huang, Tao Lei, Eray Yildiz, Alper Kokmen, Gokul Santhanam, Areeba Kamal, Kaan Elgin, Dian Ang Yap, Jeremy Liu, Peter Gray, Howard Xing, Kieran Liu, Matteo Ronchi, Moritz Schwarzer-Becker, Yun Zhu, Mandana Saebi, Jeremy Snow, David Griffiths, Guillaume Tartavel, Erin Feldman, Simon Lehnerer, Fernando Bermúdez-Medina, Hans Han, Joe Zhou, Xiaoyi Ren, Sujeeth Reddy, Zirui Wang, Tom Gunter, Albert Antony, Yuanzhi Li, John Dennison, Tony Sun, Yena Han, Yi Qin, Sam Davarnia, Jeffrey Bigham, Wayne Shan, Hannah Gillis Coleman, Guillaume Klein, Peng Liu, Muyang Yu, Jack Cackler, Yuan Gao, Crystal Xiao, Binazir Karimzadeh, Zhengdong Zhang, Felix Bai, Albin Madappally Jose, Feng Nan, Nazir Kamaldin, Dong Yin, Hans Hao, Yanchao Sun, Yi Hua, Charles Maalouf, Alex Guillen Garcia, Guoli Yin, Lezhi Li, Mohana Prasad Sathya Moorthy, Hongbin Gao, Jay Tang, Joanna Arreaza-Taylor, Faye Lao, Carina Peng, Josh Shaffer, Dan Masi, Sushma Rao, Tommi Vehvilainen, Senyu Tong, Dongcai Shen, Yang Zhao, Chris Bartels, Peter Fu, Qingqing Cao, Christopher Neubauer, Ethan Li, Mingfei Gao, Rebecca Callahan, Richard Wei, Patrick Dong, Alex Braunstein, Sachin Ravi, Adolfo Lopez Mendez, Kaiwei Huang, Kun Duan, Haoshuo Huang, Rui Qian, Stefano Ligas, Jordan Huffaker, Dongxu Li, Bailin Wang, Nanzhu Wang, Anuva Agarwal, Tait Madsen, Josh Newnham, Abhishek Sharma, Zhile Ren, Deepak Gopinath, Erik Daxberger, Saptarshi Guha, Oron Levy, Jing Lu, Nan Dun, Marc Kirchner, Yinfei Yang, Manjot Bilkhu, Dave Nelson, Anthony Spalvieri-Kruse, Juan Lao Tebar, Yang Xu, Phani Mutyala, Gabriel Jacoby-Cooper, Yingbo Wang, Karla Vega, Vishaal Mahtani, Darren Botten, Eric Wang, Hanli Li, Matthias Paulik, Haoran Yan, Navid Shiee, Yihao Qian, Bugu Wu, Qi Zhu, Ob Adaranijo, Bhuwan Dhingra, Zhe Gan, Nicholas Seidl, Grace Duanmu, Rong Situ, Yiping Ma, Yin Xia, David Riazati, Vasileios Saveris, Anh Nguyen, Michael, Lee, Patrick Sonnenberg, Chinguun Erdenebileg, Yanghao Li, Vivian Ma, James Chou, Isha Garg, Mark Lee, Keen You, Yuhong Li, Ransen Niu, Nandhitha Raghuram, Pulkit Agrawal, Henry Mason, Sumeet Singh, Keyu He, Hong-You Chen, Lucas Guibert, Shiyu Li, Varsha Paidi, Narendran Raghavan, Mingze Xu, Yuli Yang, Sergiu Sima, Irina Belousova, Sprite Chu, Afshin Dehghan, Philipp Dufter, David Haldimann, Zhen Yang, Margit Bowler, Chang Liu, Ying-Chang Cheng, Vivek Rathod, Syd Evans, Wilson Tsao, Dustin Withers, Haitian Sun, Biyao Wang, Peter Grasch, Walker Cheng, Yihao Feng, Vivek Kumar, Frank Chu, Victoria MönchJuan Haladjian, Doug Kang, Jiarui Lu, Ciro Sannino, Max Lam, Floris Weers, Bowen Pan, Kenneth Jung, Dhaval Doshi, Fangping Shi, Olli Saarikivi, Alp Aygar, Josh Elman, Cheng Leong, Eshan Verma, Matthew Lei, Jeff Nichols, Jiulong Shan, Donald Zhang, Lawrence Zhou, Stephen Murphy, Xianzhi Du, Chang Lan, Ankur Jain, Elmira Amirloo, Marcin Eichner, Naomy Sabo, Anupama Mann Anupama, David Qiu, Zhao Meng, Michael FitzMaurice, Peng Zhang, Simon Yeung, Chen Chen, Marco Zuliani, Andrew Hansen, Yang Lu, Brent Ramerth, Ziyi Zhong, Parsa Mazaheri, Matthew Hopkins, Mengyu Li, Simon Wang, David Chen, Farzin Rasteh, Chong Wang, Josh Gardner, Asaf Liberman, Haoxuan You, Andrew Walkingshaw, Xingyu Zhou, Jinhao Lei, Yan Meng, Quentin Keunebroek, Sam Wiseman, Anders Boesen Lindbo Larsen, Yi Zhang, Zaid Ahmed, Haiming Gang, Aaron Franklin, Kelvin Zou, Guillaume Seguin, Jonathan Janke, Rachel Burger, Co Giang, Cheng Shen, Jen Liu, Sanskruti Shah, Xiang Kong, Yiran Fei, TJ Collins, Chen Zhang, Zhiyun Lu, Michael Booker, Qin Ba, Yasutaka Tanaka, Andres Romero Mier Y Teran, Federico Scozzafava, Regan Poston, Jane Li, Eduardo Jimenez, Bas Straathof, Karanjeet Singh, Lindsay Hislop, Rajat Arora, Deepa Seshadri, Boyue Li, Colorado Reed, Zhen Li, TJ Lu, Yi Wang, Kaelen Haag, Nicholas Lusskin, Raunak Sinha, Rahul Nair, Eldon Schoop, Mary Beth Kery, Mehrdad Farajtbar, Brenda Yang, George Horrell, Shiwen Zhao, Dhruti Shah, Cha Chen, Bowen Zhang, Chang Gao, Devi Krishna, Jennifer Mallalieu, Javier Movellan, Di Feng, Emily Zhang, Sam Xu, Junting Pan, Dominik Moritz, Suma Jayaram, Kevin Smith, Dongseong Hwang, Daniel Parilla, Jiaming Hu, You-Cyuan Jhang, Emad Soroush, Fred Hohman, Nan Du, Emma Wang, Sam Dodge, Pragnya Sridhar, Joris Pelemans, Wei Fang, Nina Wenzel, Joseph Yitan Cheng, Hadas Kotek, Chung-Cheng Chiu, Meng Cao, Haijing Fu, Ruixuan Hou, Ke Ye, Diane Zhu, Nikhil Bhendawade, Joseph Astrauskas, Jian Liu, Sai Aitharaju, Wentao Wu, Artsiom Peshko, Hyunjik Kim, Nilesh Shahdadpuri, Andy De Wang, Qi Shan, Piotr Maj, Raul Rea Menacho, Justin Lazarow, Eric Liang Yang, Arsalan Farooq, Donghan Yu, David Güera, Minsik Cho, Kavya Nerella, Yongqiang Wang, Tao Jia, John Park, Jeff Lai, Haotian Zhang, Futang Peng, Daniele Molinari, Aparna Rajamani, Tyler Johnson, Lauren Gardiner, Chao Jia, Violet Yao, Wojciech Kryscinski, Xiujun Li, Shang-Chen Wu

Main category: cs.LG

TL;DR: Apple introduces two multilingual, multimodal foundation models for on-device and server use, optimized with innovative techniques and trained on diverse datasets, outperforming benchmarks while ensuring privacy and responsibility.

DetailsMotivation: To enhance Apple Intelligence features with efficient, high-quality models that support multiple languages and modalities while prioritizing user privacy and responsible AI practices.

Method: Developed a 3B-parameter on-device model with KV-cache sharing and 2-bit quantization, and a server model using PT-MoE transformer with track parallelism and sparse computation. Trained on multilingual, multimodal datasets and refined with supervised fine-tuning and reinforcement learning.

Result: Both models match or surpass open benchmarks, support additional languages, and handle images and tool calls effectively.

Conclusion: Apple’s models demonstrate superior performance and scalability, backed by a responsible AI framework and privacy innovations like Private Cloud Compute.

Abstract: We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple’s Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users’ privacy with innovations like Private Cloud Compute.

[252] Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries

Hyunji Nam, Yanming Wan, Mickel Liu, Jianxun Lian, Natasha Jaques

Main category: cs.LG

TL;DR: PLUS is a framework for personalizing LLM responses by learning user-specific summaries to condition reward models, outperforming traditional RLHF and enabling zero-shot personalization.

DetailsMotivation: Traditional RLHF lacks personalization by modeling all users with a single reward model, ignoring individual preferences.

Method: PLUS learns text-based summaries of user preferences and updates the reward model in an online co-adaptation loop.

Result: PLUS captures meaningful user preferences, works robustly across datasets, and enables zero-shot personalization of models like GPT-4.

Conclusion: PLUS offers personalized, interpretable, and portable user summaries, enhancing transparency and user control in LLM alignment.

Abstract: As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model. We present a novel framework, Preference Learning Using Summarization (PLUS), that learns text-based summaries of each user’s preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. We train the user-summarization model with reinforcement learning, and update the reward model simultaneously, creating an online co-adaptation loop. We show that in contrast with prior personalized RLHF techniques or with in-context learning of user information, summaries produced by PLUS capture meaningful aspects of a user’s preferences. Across different pluralistic user datasets, we show that our method is robust to new users and diverse conversation topics. Additionally, we demonstrate that the textual summaries generated about users can be transferred for zero-shot personalization of stronger, proprietary models like GPT-4. The resulting user summaries are not only concise and portable, they are easy for users to interpret and modify, allowing for more transparency and user control in LLM alignment.

[253] Off-Policy Evaluation and Learning for Matching Markets

Yudai Hayashi, Shuhei Goda, Yuta Saito

Main category: cs.LG

TL;DR: Proposes novel OPE estimators (DiPS and DPR) for matching markets to address variance and reward sparsity, outperforming conventional methods in offline evaluation and policy learning.

DetailsMotivation: A/B tests are costly for frequent policy updates in matching markets, and standard OPE methods are unreliable due to large-scale, bidirectional interactions.

Method: Combines DM, IPS, and DR estimators with intermediate labels for better bias-variance control. Theoretically analyzes bias and variance, and extends to offline policy learning.

Result: Empirical evaluation on synthetic and real job-matching data shows superiority over existing methods in OPE and policy learning.

Conclusion: DiPS and DPR effectively address challenges in matching markets, enabling reliable offline evaluation and improved recommendation policies.

Abstract: Matching users based on mutual preferences is a fundamental aspect of services driven by reciprocal recommendations, such as job search and dating applications. Although A/B tests remain the gold standard for evaluating new policies in recommender systems for matching markets, it is costly and impractical for frequent policy updates. Off-Policy Evaluation (OPE) thus plays a crucial role by enabling the evaluation of recommendation policies using only offline logged data naturally collected on the platform. However, unlike conventional recommendation settings, the large scale and bidirectional nature of user interactions in matching platforms introduce variance issues and exacerbate reward sparsity, making standard OPE methods unreliable. To address these challenges and facilitate effective offline evaluation, we propose novel OPE estimators, \textit{DiPS} and \textit{DPR}, specifically designed for matching markets. Our methods combine elements of the Direct Method (DM), Inverse Propensity Score (IPS), and Doubly Robust (DR) estimators while incorporating intermediate labels, such as initial engagement signals, to achieve better bias-variance control in matching markets. Theoretically, we derive the bias and variance of the proposed estimators and demonstrate their advantages over conventional methods. Furthermore, we show that these estimators can be seamlessly extended to offline policy learning methods for improving recommendation policies for making more matches. We empirically evaluate our methods through experiments on both synthetic data and A/B testing logs from a real job-matching platform. The empirical results highlight the superiority of our approach over existing methods in off-policy evaluation and learning tasks for a variety of configurations.

[254] Generalist Bimanual Manipulation via Foundation Video Diffusion Models

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: VIDAR is a two-stage framework using video diffusion pre-training and masked inverse dynamics for bimanual robotic manipulation, achieving strong generalization with minimal data.

DetailsMotivation: Data scarcity and embodiment heterogeneity hinder scaling in bimanual robotic manipulation.

Method: VIDAR combines large-scale video diffusion pre-training (750K multi-view videos) with a masked inverse dynamics model for action prediction.

Result: With only 20 minutes of human demonstrations (1% of typical data), VIDAR generalizes to unseen tasks and backgrounds, outperforming state-of-the-art methods.

Conclusion: Video foundation models with masked action prediction enable scalable and generalizable robotic manipulation in diverse real-world settings.

Abstract: Bimanual robotic manipulation, which involves the coordinated control of two robotic arms, is foundational for solving challenging tasks. Despite recent progress in general-purpose manipulation, data scarcity and embodiment heterogeneity remain serious obstacles to further scaling up in bimanual settings. In this paper, we introduce VIdeo Diffusion for Action Reasoning (VIDAR), a two-stage framework that leverages large-scale, diffusion-based video pre-training and a novel masked inverse dynamics model for action prediction. We pre-train the video diffusion model on 750K multi-view videos from three real-world bimanual robot platforms, utilizing a unified observation space that encodes robot, camera, task, and scene contexts. Our masked inverse dynamics model learns masks to extract action-relevant information from generated trajectories without requiring pixel-level labels, and the masks can effectively generalize to unseen backgrounds. Our experiments demonstrate that with only 20 minutes of human demonstrations on an unseen robot platform (only 1% of typical data requirements), VIDAR generalizes to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods. Our findings highlight the potential of video foundation models, coupled with masked action prediction, to enable scalable and generalizable robotic manipulation in diverse real-world settings.

[255] Tri-Learn Graph Fusion Network for Attributed Graph Clustering

Binxiong Li, Yuefei Wang, Xu Xiang, Xue Li, Binyu Zhao, Heyang Gao, Qinyu Zhao, Xi Yu

Main category: cs.LG

TL;DR: The paper introduces Tri-GFN, a deep clustering framework combining GCN, AE, and Graph Transformer to address challenges like over-smoothing in graph data analysis. It improves clustering accuracy significantly on benchmark datasets.

DetailsMotivation: Existing GCN models face issues like over-smoothing and limited performance on heterogeneous graph data. Graph Transformers help but still fall short, prompting the need for a more robust solution.

Method: Proposes Tri-GFN, integrating GCN, AE, and Graph Transformer via a tri-learning mechanism and feature fusion strategy to enhance global and local information differentiation.

Result: Achieves accuracy improvements of 0.87% (ACM), 14.14% (Reuters), and 7.58% (USPS), outperforming state-of-the-art methods.

Conclusion: Tri-GFN excels in graph clustering, with potential applications in news classification and topic retrieval, demonstrating superior performance on heterogeneous datasets.

Abstract: In recent years, models based on Graph Convolutional Networks (GCN) have made significant strides in the field of graph data analysis. However, challenges such as over-smoothing and over-compression remain when handling large-scale and complex graph datasets, leading to a decline in clustering quality. Although the Graph Transformer architecture has mitigated some of these issues, its performance is still limited when processing heterogeneous graph data. To address these challenges, this study proposes a novel deep clustering framework that comprising GCN, Autoencoder (AE), and Graph Transformer, termed the Tri-Learn Graph Fusion Network (Tri-GFN). This framework enhances the differentiation and consistency of global and local information through a unique tri-learning mechanism and feature fusion enhancement strategy. The framework integrates GCN, AE, and Graph Transformer modules. These components are meticulously fused by a triple-channel enhancement module, which maximizes the use of both node attributes and topological structures, ensuring robust clustering representation. The tri-learning mechanism allows mutual learning among these modules, while the feature fusion strategy enables the model to capture complex relationships, yielding highly discriminative representations for graph clustering. It surpasses many state-of-the-art methods, achieving an accuracy improvement of approximately 0.87% on the ACM dataset, 14.14 % on the Reuters dataset, and 7.58 % on the USPS dataset. Due to its outstanding performance on the Reuters dataset, Tri-GFN can be applied to automatic news classification, topic retrieval, and related fields.

[256] FedSkipTwin: Digital-Twin-Guided Client Skipping for Communication-Efficient Federated Learning

Daniel Commey, Kamel Abbad, Garth V. Crosby, Lyes Khoukhi

Main category: cs.LG

TL;DR: FedSkipTwin reduces FL communication overhead by 12-15.5% and improves accuracy by 0.5% using server-side LSTM twins to predict client updates.

DetailsMotivation: Communication overhead in FL, especially for mobile/IoT devices with limited bandwidth, is a bottleneck.

Method: FedSkipTwin uses server-side LSTM twins to predict client update magnitude and uncertainty, skipping rounds when predictions fall below thresholds.

Result: Reduced communication by 12-15.5% and improved accuracy by 0.5% on UCI-HAR and MNIST datasets.

Conclusion: Prediction-guided skipping is effective for resource-aware FL in bandwidth-constrained environments.

Abstract: Communication overhead remains a primary bottleneck in federated learning (FL), particularly for applications involving mobile and IoT devices with constrained bandwidth. This work introduces FedSkipTwin, a novel client-skipping algorithm driven by lightweight, server-side digital twins. Each twin, implemented as a simple LSTM, observes a client’s historical sequence of gradient norms to forecast both the magnitude and the epistemic uncertainty of its next update. The server leverages these predictions, requesting communication only when either value exceeds a predefined threshold; otherwise, it instructs the client to skip the round, thereby saving bandwidth. Experiments are conducted on the UCI-HAR and MNIST datasets with 10 clients under a non-IID data distribution. The results demonstrate that FedSkipTwin reduces total communication by 12-15.5% across 20 rounds while simultaneously improving final model accuracy by up to 0.5 percentage points compared to the standard FedAvg algorithm. These findings establish that prediction-guided skipping is a practical and effective strategy for resource-aware FL in bandwidth-constrained edge environments.

[257] A Comprehensive Review of Transformer-based language models for Protein Sequence Analysis and Design

Nimisha Ghosh, Daniele Santoni, Debaleena Nawn, Eleonora Ottaviani, Giovanni Felici

Main category: cs.LG

TL;DR: A review of Transformer-based models for protein sequence analysis and design, covering applications like gene ontology, protein identification, and de novo protein generation, while highlighting strengths, weaknesses, and future research directions.

DetailsMotivation: To explore the adoption and impact of Transformer-based models in bioinformatics, specifically for protein sequence analysis and design, and to provide a comprehensive overview for researchers.

Method: Review and analysis of significant works applying Transformer-based models to protein-related tasks, evaluating their strengths and weaknesses.

Result: Identified key applications and limitations of Transformer models in protein analysis, along with gaps in current research.

Conclusion: The review serves as a guide for researchers, summarizing state-of-the-art advancements and suggesting future directions for improving Transformer-based approaches in bioinformatics.

Abstract: The impact of Transformer-based language models has been unprecedented in Natural Language Processing (NLP). The success of such models has also led to their adoption in other fields including bioinformatics. Taking this into account, this paper discusses recent advances in Transformer-based models for protein sequence analysis and design. In this review, we have discussed and analysed a significant number of works pertaining to such applications. These applications encompass gene ontology, functional and structural protein identification, generation of de novo proteins and binding of proteins. We attempt to shed light on the strength and weaknesses of the discussed works to provide a comprehensive insight to readers. Finally, we highlight shortcomings in existing research and explore potential avenues for future developments. We believe that this review will help researchers working in this field to have an overall idea of the state of the art in this field, and to orient their future studies.

[258] Kolmogorov-Arnold Networks-based GRU and LSTM for Loan Default Early Prediction

Yue Yang, Zihan Su, Ying Zhang, Chang Chuan Goh, Yuxiang Lin, Anthony Graham Bellotti, Boon Giin Lee

Main category: cs.LG

TL;DR: The paper introduces GRU-KAN and LSTM-KAN models for early loan default prediction, outperforming existing methods with over 92% accuracy three months in advance.

DetailsMotivation: To improve early loan default prediction for financial institutions, addressing limitations of current methods like accuracy and time-frame dependency.

Method: Proposes GRU-KAN and LSTM-KAN, combining Kolmogorov-Arnold Networks with GRU and LSTM, evaluated against baseline models on accuracy, precision, recall, F1, and AUC.

Result: Achieves 92% accuracy three months in advance and 88% eight months in advance, surpassing baselines.

Conclusion: The proposed models significantly enhance early loan default prediction, offering practical value for financial risk management.

Abstract: This study addresses a critical challenge in time series anomaly detection: enhancing the predictive capability of loan default models more than three months in advance to enable early identification of default events, helping financial institutions implement preventive measures before risk events materialize. Existing methods have significant drawbacks, such as their lack of accuracy in early predictions and their dependence on training and testing within the same year and specific time frames. These issues limit their practical use, particularly with out-of-time data. To address these, the study introduces two innovative architectures, GRU-KAN and LSTM-KAN, which merge Kolmogorov-Arnold Networks (KAN) with Gated Recurrent Units (GRU) and Long Short-Term Memory (LSTM) networks. The proposed models were evaluated against the baseline models (LSTM, GRU, LSTM-Attention, and LSTM-Transformer) in terms of accuracy, precision, recall, F1 and AUC in different lengths of feature window, sample sizes, and early prediction intervals. The results demonstrate that the proposed model achieves a prediction accuracy of over 92% three months in advance and over 88% eight months in advance, significantly outperforming existing baselines.

[259] Binarizing Physics-Inspired GNNs for Combinatorial Optimization

Martin Krutský, Gustav Šír, Vyacheslav Kungurtsev, Georgios Korpas

Main category: cs.LG

TL;DR: PI-GNNs show declining performance with denser problem graphs due to a phase transition and solution discrepancy. Proposed methods based on fuzzy logic and binarized networks improve results.

DetailsMotivation: Address the performance drop of PI-GNNs in dense combinatorial problem graphs and bridge the gap between relaxed model outputs and binary solutions.

Method: Analyze PI-GNNs’ training dynamics, identify phase transition, and propose alternatives inspired by fuzzy logic and binarized neural networks.

Result: Performance of PI-GNNs declines with graph density; proposed methods significantly improve results in dense settings.

Conclusion: The study highlights limitations of PI-GNNs in dense graphs and offers effective solutions to enhance their performance.

Abstract: Physics-inspired graph neural networks (PI-GNNs) have been utilized as an efficient unsupervised framework for relaxing combinatorial optimization problems encoded through a specific graph structure and loss, reflecting dependencies between the problem’s variables. While the framework has yielded promising results in various combinatorial problems, we show that the performance of PI-GNNs systematically plummets with an increasing density of the combinatorial problem graphs. Our analysis reveals an interesting phase transition in the PI-GNNs’ training dynamics, associated with degenerate solutions for the denser problems, highlighting a discrepancy between the relaxed, real-valued model outputs and the binary-valued problem solutions. To address the discrepancy, we propose principled alternatives to the naive strategy used in PI-GNNs by building on insights from fuzzy logic and binarized neural networks. Our experiments demonstrate that the portfolio of proposed methods significantly improves the performance of PI-GNNs in increasingly dense settings.

[260] Bayesian Optimization for Molecules Should Be Pareto-Aware

Anabel Yong, Austin Tripp, Layla Hosseini-Gerami, Brooks Paige

Main category: cs.LG

TL;DR: Multi-objective Bayesian optimization (MOBO) with EHVI outperforms scalarized EI in molecular design, showing better Pareto front coverage, speed, and diversity.

DetailsMotivation: To empirically compare MOBO (using EHVI) with scalarized alternatives (using EI) in molecular optimization tasks.

Method: Benchmarked EHVI against scalarized EI using identical Gaussian Process surrogates and molecular representations under controlled conditions.

Result: EHVI consistently outperformed scalarized EI in Pareto front coverage, convergence speed, and chemical diversity.

Conclusion: Pareto-aware acquisition (EHVI) is advantageous in molecular optimization, especially with limited evaluation budgets and non-trivial trade-offs.

Abstract: Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy – Expected Hypervolume Improvement (EHVI) – against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants – including random or adaptive schemes – our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.

[261] Learning Deformable Body Interactions With Adaptive Spatial Tokenization

Hao Wang, Yu Liu, Daniel Biggs, Haoru Wang, Jiandong Yu, Ping Huang

Main category: cs.LG

TL;DR: Proposes Adaptive Spatial Tokenization (AST) for scalable simulation of deformable body interactions using grid-based tokenization and attention mechanisms.

DetailsMotivation: Address scalability issues in learning-based methods (e.g., GNNs) for modeling deformable body interactions, which are computationally intensive for large-scale meshes.

Method: Divides simulation space into a grid, maps unstructured meshes onto it, groups adjacent nodes, and uses cross-attention for compact embeddings. Self-attention predicts next states in latent space.

Result: Outperforms state-of-the-art methods, especially in large-scale simulations (100,000+ nodes), and introduces a novel dataset for future research.

Conclusion: AST combines tokenization efficiency and attention mechanisms for accurate, scalable simulations, advancing deformable body interaction modeling.

Abstract: Simulating interactions between deformable bodies is vital in fields like material science, mechanical design, and robotics. While learning-based methods with Graph Neural Networks (GNNs) are effective at solving complex physical systems, they encounter scalability issues when modeling deformable body interactions. To model interactions between objects, pairwise global edges have to be created dynamically, which is computationally intensive and impractical for large-scale meshes. To overcome these challenges, drawing on insights from geometric representations, we propose an Adaptive Spatial Tokenization (AST) method for efficient representation of physical states. By dividing the simulation space into a grid of cells and mapping unstructured meshes onto this structured grid, our approach naturally groups adjacent mesh nodes. We then apply a cross-attention module to map the sparse cells into a compact, fixed-length embedding, serving as tokens for the entire physical state. Self-attention modules are employed to predict the next state over these tokens in latent space. This framework leverages the efficiency of tokenization and the expressive power of attention mechanisms to achieve accurate and scalable simulation results. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in modeling deformable body interactions. Notably, it remains effective on large-scale simulations with meshes exceeding 100,000 nodes, where existing methods are hindered by computational limitations. Additionally, we contribute a novel large-scale dataset encompassing a wide range of deformable body interactions to support future research in this area.

[262] Benchmarking of EEG Analysis Techniques for Parkinson’s Disease Diagnosis: A Comparison between Traditional ML Methods and Foundation DL Methods

Danilo Avola, Andrea Bernardini, Giancarlo Crocetti, Andrea Ladogana, Mario Lezoche, Maurizio Mancini, Daniele Pannone, Amedeo Ranaldi

Main category: cs.LG

TL;DR: The paper benchmarks traditional ML and DL models for classifying Parkinson’s Disease (PD) using EEG data, aiming to establish reliable baselines for future research.

DetailsMotivation: Early PD diagnosis is critical, and EEG offers a non-invasive method, but reliable automated models are lacking. This study aims to compare ML and DL approaches to identify the best-performing models.

Method: A seven-step preprocessing pipeline is applied to a public oddball task dataset, with consistent cross-validation and evaluation criteria. Models include CNN-LSTM (DL) and XGBoost (ML).

Result: CNNLSTM models perform best, highlighting the importance of temporal dependencies, while XGBoost also shows strong accuracy and calibrated decisions.

Conclusion: The study provides a reference framework for future EEG-based PD diagnostics, emphasizing the need for rigorous baselines to ensure scientific rigor and reproducibility.

Abstract: Parkinson’s Disease PD is a progressive neurodegenerative disorder that affects motor and cognitive functions with early diagnosis being critical for effective clinical intervention Electroencephalography EEG offers a noninvasive and costeffective means of detecting PDrelated neural alterations yet the development of reliable automated diagnostic models remains a challenge In this study we conduct a systematic benchmark of traditional machine learning ML and deep learning DL models for classifying PD using a publicly available oddball task dataset Our aim is to lay the groundwork for developing an effective learning system and to determine which approach produces the best results We implement a unified sevenstep preprocessing pipeline and apply consistent subjectwise crossvalidation and evaluation criteria to ensure comparability across models Our results demonstrate that while baseline deep learning architectures particularly CNNLSTM models achieve the best performance compared to other deep learning architectures underlining the importance of capturing longrange temporal dependencies several traditional classifiers such as XGBoost also offer strong predictive accuracy and calibrated decision boundaries By rigorously comparing these baselines our work provides a solid reference framework for future studies aiming to develop and evaluate more complex or specialized architectures Establishing a reliable set of baseline results is essential to contextualize improvements introduced by novel methods ensuring scientific rigor and reproducibility in the evolving field of EEGbased neurodiagnostics

[263] Bi-GRU Based Deception Detection using EEG Signals

Danilo Avola, Muhammad Yasir Bilal, Emad Emam, Cristina Lakasz, Daniele Pannone, Amedeo Ranaldi

Main category: cs.LG

TL;DR: A deep learning model (Bi-GRU) achieves 97% accuracy in detecting deception using EEG signals from the Bag-of-Lies dataset.

DetailsMotivation: Deception detection is crucial in security, psychology, and forensics, but challenging. This study explores EEG-based methods for naturalistic scenarios.

Method: A Bidirectional Gated Recurrent Unit (Bi-GRU) neural network was trained on EEG signals for binary classification of deceptive vs. truthful behavior.

Result: The model achieved 97% test accuracy, with high precision, recall, and F1-scores for both classes.

Conclusion: Bidirectional temporal modeling is effective for EEG-based deception detection, showing promise for real-time applications and advanced neural architectures.

Abstract: Deception detection is a significant challenge in fields such as security, psychology, and forensics. This study presents a deep learning approach for classifying deceptive and truthful behavior using ElectroEncephaloGram (EEG) signals from the Bag-of-Lies dataset, a multimodal corpus designed for naturalistic, casual deception scenarios. A Bidirectional Gated Recurrent Unit (Bi-GRU) neural network was trained to perform binary classification of EEG samples. The model achieved a test accuracy of 97%, along with high precision, recall, and F1-scores across both classes. These results demonstrate the effectiveness of using bidirectional temporal modeling for EEG-based deception detection and suggest potential for real-time applications and future exploration of advanced neural architectures.

[264] Graph-Structured Data Analysis of Component Failure in Autonomous Cargo Ships Based on Feature Fusion

Zizhao Zhang, Tianxiang Zhao, Yu Sun, Liping Sun, Jichuan Kang

Main category: cs.LG

TL;DR: A hybrid feature fusion framework improves failure mode analysis in autonomous cargo ships using advanced algorithms and achieves high accuracy in classification and prediction.

DetailsMotivation: Address challenges of cascading failures and uncertain emergency decision-making in autonomous cargo ships.

Method: Proposes a hybrid feature fusion framework with HN-CSA for literature retrieval, Word2Vec, BERT-KPCA, and Sentence-BERT for feature processing, and GATE-GNN for classification.

Result: Achieves 7.1% and 3.4% efficiency improvements over NSGA-II and CSA, classification accuracy of 0.735, and high prediction accuracy (F1 score 0.93).

Conclusion: Provides a robust foundation for failure analysis and supports fault diagnosis, risk assessment, and decision-making in autonomous cargo ships.

Abstract: To address the challenges posed by cascading reactions caused by component failures in autonomous cargo ships (ACS) and the uncertainties in emergency decision-making, this paper proposes a novel hybrid feature fusion framework for constructing a graph-structured dataset of failure modes. By employing an improved cuckoo search algorithm (HN-CSA), the literature retrieval efficiency is significantly enhanced, achieving improvements of 7.1% and 3.4% compared to the NSGA-II and CSA search algorithms, respectively. A hierarchical feature fusion framework is constructed, using Word2Vec encoding to encode subsystem/component features, BERT-KPCA to process failure modes/reasons, and Sentence-BERT to quantify the semantic association between failure impact and emergency decision-making. The dataset covers 12 systems, 1,262 failure modes, and 6,150 propagation paths. Validation results show that the GATE-GNN model achieves a classification accuracy of 0.735, comparable to existing benchmarks. Additionally, a silhouette coefficient of 0.641 indicates that the features are highly distinguishable. In the label prediction results, the Shore-based Meteorological Service System achieved an F1 score of 0.93, demonstrating high prediction accuracy. This paper not only provides a solid foundation for failure analysis in autonomous cargo ships but also offers reliable support for fault diagnosis, risk assessment, and intelligent decision-making systems. The link to the dataset is https://github.com/wojiufukele/Graph-Structured-about-CSA.

[265] Adversarial Training Improves Generalization Under Distribution Shifts in Bioacoustics

René Heinrich, Lukas Rauch, Bernhard Sick, Christoph Scholz

Main category: cs.LG

TL;DR: Adversarial training improves generalization and robustness in audio classification, especially with output-space attacks, boosting clean test performance by 10.5%.

DetailsMotivation: To explore how adversarial training enhances generalization and robustness in audio classification under data distribution shifts.

Method: Evaluated two adversarial training strategies (output-space and embedding-space attacks) on ConvNeXt and AudioProtoPNet models using a bird sound benchmark.

Result: Output-space attacks improved clean test performance by 10.5% and increased adversarial robustness.

Conclusion: Adversarial training can enhance robustness against distribution shifts and adversarial attacks in audio classification.

Abstract: Adversarial training is a promising strategy for enhancing model robustness against adversarial attacks. However, its impact on generalization under substantial data distribution shifts in audio classification remains largely unexplored. To address this gap, this work investigates how different adversarial training strategies improve generalization performance and adversarial robustness in audio classification. The study focuses on two model architectures: a conventional convolutional neural network (ConvNeXt) and an inherently interpretable prototype-based model (AudioProtoPNet). The approach is evaluated using a challenging bird sound classification benchmark. This benchmark is characterized by pronounced distribution shifts between training and test data due to varying environmental conditions and recording methods, a common real-world challenge. The investigation explores two adversarial training strategies: one based on output-space attacks that maximize the classification loss function, and another based on embedding-space attacks designed to maximize embedding dissimilarity. These attack types are also used for robustness evaluation. Additionally, for AudioProtoPNet, the study assesses the stability of its learned prototypes under targeted embedding-space attacks. Results show that adversarial training, particularly using output-space attacks, improves clean test data performance by an average of 10.5% relative and simultaneously strengthens the adversarial robustness of the models. These findings, although derived from the bird sound domain, suggest that adversarial training holds potential to enhance robustness against both strong distribution shifts and adversarial attacks in challenging audio classification settings.

[266] An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC

Matthias Jobst, Tim Langer, Chen Liu, Mehmet Alici, Hector A. Gonzalez, Christian Mayr

Main category: cs.LG

TL;DR: A multi-layer DNN scheduling framework for SpiNNaker2 enables edge-based execution of complex DNNs, including transformers, from PyTorch models.

DetailsMotivation: To facilitate the execution of large and complex DNNs on neuromorphic hardware like SpiNNaker2 for edge applications.

Method: Extends OctopuScheduler with quantization and lowering steps, providing an end-to-end flow from PyTorch models to SpiNNaker2 inference.

Result: Successfully enables edge-based execution of transformer-scale DNNs on SpiNNaker2.

Conclusion: The framework effectively bridges the gap between PyTorch models and neuromorphic hardware, supporting complex DNNs at the edge.

Abstract: This work presents a multi-layer DNN scheduling framework as an extension of OctopuScheduler, providing an end-to-end flow from PyTorch models to inference on a single SpiNNaker2 chip. Together with a front-end comprised of quantization and lowering steps, the proposed framework enables the edge-based execution of large and complex DNNs up to transformer scale using the neuromorphic platform SpiNNaker2.

[267] SamGoG: A Sampling-Based Graph-of-Graphs Framework for Imbalanced Graph Classification

Shangyou Wang, Zezhong Ding, Xike Xie

Main category: cs.LG

TL;DR: SamGoG is a sampling-based Graph-of-Graphs framework addressing class and graph size imbalance in GNNs, improving accuracy and training speed.

DetailsMotivation: Real-world graphs often suffer from class and size imbalances, biasing GNN performance. Existing methods are limited or costly.

Method: SamGoG constructs multiple GoGs via importance-based sampling, enhancing edge homophily with learnable similarity and adaptive node degrees.

Result: Achieves up to 15.66% accuracy improvement and 6.7× training acceleration on benchmarks.

Conclusion: SamGoG effectively mitigates imbalances and enhances GNN performance for graph classification.

Abstract: Graph Neural Networks (GNNs) have shown remarkable success in graph classification tasks by capturing both structural and feature-based representations. However, real-world graphs often exhibit two critical forms of imbalance: class imbalance and graph size imbalance. These imbalances can bias the learning process and degrade model performance. Existing methods typically address only one type of imbalance or incur high computational costs. In this work, we propose SamGoG, a sampling-based Graph-of-Graphs (GoG) learning framework that effectively mitigates both class and graph size imbalance. SamGoG constructs multiple GoGs through an efficient importance-based sampling mechanism and trains on them sequentially. This sampling mechanism incorporates the learnable pairwise similarity and adaptive GoG node degree to enhance edge homophily, thus improving downstream model quality. SamGoG can seamlessly integrate with various downstream GNNs, enabling their efficient adaptation for graph classification tasks. Extensive experiments on benchmark datasets demonstrate that SamGoG achieves state-of-the-art performance with up to a 15.66% accuracy improvement with 6.7$\times$ training acceleration.

[268] Search-Optimized Quantization in Biomedical Ontology Alignment

Oussama Bouaggad, Natalia Grabar

Main category: cs.LG

TL;DR: The paper introduces a method for optimizing large AI models for resource-constrained environments, achieving significant speed and memory improvements while maintaining performance.

DetailsMotivation: Challenges in deploying large AI models on edge devices due to computational demands, energy consumption, and latency.

Method: Uses supervised transformer-based models for ontology alignment, leverages Microsoft Olive for optimization, and employs dynamic quantization with Intel tools.

Result: Achieves 20x faster inference, 70% reduced memory usage, and new state-of-the-art performance on DEFT 2020 tasks.

Conclusion: The proposed optimization method effectively addresses deployment challenges for large AI models in constrained environments.

Abstract: In the fast-moving world of AI, as organizations and researchers develop more advanced models, they face challenges due to their sheer size and computational demands. Deploying such models on edge devices or in resource-constrained environments adds further challenges related to energy consumption, memory usage and latency. To address these challenges, emerging trends are shaping the future of efficient model optimization techniques. From this premise, by employing supervised state-of-the-art transformer-based models, this research introduces a systematic method for ontology alignment, grounded in cosine-based semantic similarity between a biomedical layman vocabulary and the Unified Medical Language System (UMLS) Metathesaurus. It leverages Microsoft Olive to search for target optimizations among different Execution Providers (EPs) using the ONNX Runtime backend, followed by an assembled process of dynamic quantization employing Intel Neural Compressor and IPEX (Intel Extension for PyTorch). Through our optimization process, we conduct extensive assessments on the two tasks from the DEFT 2020 Evaluation Campaign, achieving a new state-of-the-art in both. We retain performance metrics intact, while attaining an average inference speed-up of 20x and reducing memory usage by approximately 70%.

[269] MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

Yaowei Jin, Junjie Wang, Wenkai Xiang, Duanhua Cao, Dan Teng, Zhehuan Fan, Jiacheng Xiong, Xia Sheng, Chuanlong Zeng, Mingyue Zheng, Qian Shi

Main category: cs.LG

TL;DR: The paper introduces Parameter Interpolation Flow (PIF) as a novel method for molecular generation, addressing limitations of Bayesian Flow Networks (BFNs) and demonstrating superior performance in drug design.

DetailsMotivation: Current deep learning methods like BFNs have limitations in flexibility and adaptability for diverse molecular tasks, prompting the need for simpler, more efficient models.

Method: Proposes PIF, a parameter-space-based model with detailed theoretical foundation, training, and inference procedures, and develops MolPIF for drug design.

Result: MolPIF outperforms baselines in diverse metrics, validating the effectiveness of parameter-space-based generative modeling.

Conclusion: PIF offers a promising alternative to BFNs, providing new insights for model design in molecular generation.

Abstract: Advances in deep learning for molecular generation show promise in accelerating drug discovery. Bayesian Flow Networks (BFNs) have recently shown impressive performance across diverse chemical tasks, with their success often ascribed to the paradigm of modeling in a low-variance parameter space. However, the Bayesian inference-based strategy imposes limitations on designing more flexible distribution transformation pathways, making it challenging to adapt to diverse data distributions and varied task requirements. Furthermore, the potential for simpler, more efficient parameter-space-based models is unexplored. To address this, we propose a novel Parameter Interpolation Flow model (named PIF) with detailed theoretical foundation, training, and inference procedures. We then develop MolPIF for structure-based drug design, demonstrating its superior performance across diverse metrics compared to baselines. This work validates the effectiveness of parameter-space-based generative modeling paradigm for molecules and offers new perspectives for model design.

[270] Dual-Center Graph Clustering with Neighbor Distribution

Enhao Cheng, Shoujia Zhang, Jianhua Yin, Li Jin, Liqiang Nie

Main category: cs.LG

TL;DR: The paper introduces DCGC, a dual-center graph clustering method using neighbor distribution for reliable supervision and dual-center optimization, outperforming existing techniques.

DetailsMotivation: Existing graph clustering methods rely on unreliable pseudo-labels and single-center optimization, leading to incomplete guidance.

Method: DCGC uses neighbor distribution for supervision in contrastive learning and introduces dual-center optimization (feature and neighbor distribution centers).

Result: Extensive experiments show DCGC achieves superior performance.

Conclusion: DCGC provides a more reliable and effective approach to graph clustering by leveraging neighbor distribution and dual-center optimization.

Abstract: Graph clustering is crucial for unraveling intricate data structures, yet it presents significant challenges due to its unsupervised nature. Recently, goal-directed clustering techniques have yielded impressive results, with contrastive learning methods leveraging pseudo-label garnering considerable attention. Nonetheless, pseudo-label as a supervision signal is unreliable and existing goal-directed approaches utilize only features to construct a single-target distribution for single-center optimization, which lead to incomplete and less dependable guidance. In our work, we propose a novel Dual-Center Graph Clustering (DCGC) approach based on neighbor distribution properties, which includes representation learning with neighbor distribution and dual-center optimization. Specifically, we utilize neighbor distribution as a supervision signal to mine hard negative samples in contrastive learning, which is reliable and enhances the effectiveness of representation learning. Furthermore, neighbor distribution center is introduced alongside feature center to jointly construct a dual-target distribution for dual-center optimization. Extensive experiments and analysis demonstrate superior performance and effectiveness of our proposed method.

[271] On-the-Fly Fine-Tuning of Foundational Neural Network Potentials: A Bayesian Neural Network Approach

Tim Rensmeyer, Denis Kramer, Oliver Niggemann

Main category: cs.LG

TL;DR: The paper proposes a Bayesian neural network-based fine-tuning approach for machine learning force fields to reduce training data needs and automate uncertainty quantification, enabling efficient modeling of rare events.

DetailsMotivation: The computational burden of generating diverse training datasets for interatomic machine learning force fields makes modeling rare events or large configuration spaces impractical. Fine-tuning pre-trained foundation models can help, but uncertainty quantification remains a challenge.

Method: Introduces a Bayesian neural network method for fine-tuning foundation models, coupled with an on-the-fly workflow that automates model updates and detects rare events like transition states.

Result: The approach reduces the need for extensive training data, maintains accuracy, and efficiently samples rare events during simulations.

Conclusion: The proposed method successfully addresses uncertainty quantification in fine-tuning foundation models, enabling practical and accurate modeling of rare events in complex systems.

Abstract: Due to the computational complexity of evaluating interatomic forces from first principles, the creation of interatomic machine learning force fields has become a highly active field of research. However, the generation of training datasets of sufficient size and sample diversity itself comes with a computational burden that can make this approach impractical for modeling rare events or systems with a large configuration space. Fine-tuning foundation models that have been pre-trained on large-scale material or molecular databases offers a promising opportunity to reduce the amount of training data necessary to reach a desired level of accuracy. However, even if this approach requires less training data overall, creating a suitable training dataset can still be a very challenging problem, especially for systems with rare events and for end-users who don’t have an extensive background in machine learning. In on-the-fly learning, the creation of a training dataset can be largely automated by using model uncertainty during the simulation to decide if the model is accurate enough or if a structure should be recalculated with classical methods and used to update the model. A key challenge for applying this form of active learning to the fine-tuning of foundation models is how to assess the uncertainty of those models during the fine-tuning process, even though most foundation models lack any form of uncertainty quantification. In this paper, we overcome this challenge by introducing a fine-tuning approach based on Bayesian neural network methods and a subsequent on-the-fly workflow that automatically fine-tunes the model while maintaining a pre-specified accuracy and can detect rare events such as transition states and sample them at an increased rate relative to their occurrence.

[272] Self-supervised learning on gene expression data

Kevin Dradjat, Massinissa Hamidi, Pierre Bartet, Blaise Hanczar

Main category: cs.LG

TL;DR: The paper explores self-supervised learning for phenotype prediction from bulk gene expression data, outperforming traditional supervised methods and reducing reliance on labeled data.

DetailsMotivation: Traditional supervised learning for gene expression data is limited by the need for large labeled datasets, which are costly and time-consuming to obtain. Self-supervised learning offers a solution by leveraging unlabeled data.

Method: Three state-of-the-art self-supervised learning methods were applied to bulk gene expression data to assess their ability to capture data structure and improve phenotype prediction.

Result: Self-supervised methods outperformed traditional supervised models, capturing complex data patterns and reducing dependency on annotated data.

Conclusion: The study demonstrates the potential of self-supervised learning for gene expression analysis, provides method recommendations, and suggests future research directions.

Abstract: Predicting phenotypes from gene expression data is a crucial task in biomedical research, enabling insights into disease mechanisms, drug responses, and personalized medicine. Traditional machine learning and deep learning rely on supervised learning, which requires large quantities of labeled data that are costly and time-consuming to obtain in the case of gene expression data. Self-supervised learning has recently emerged as a promising approach to overcome these limitations by extracting information directly from the structure of unlabeled data. In this study, we investigate the application of state-of-the-art self-supervised learning methods to bulk gene expression data for phenotype prediction. We selected three self-supervised methods, based on different approaches, to assess their ability to exploit the inherent structure of the data and to generate qualitative representations which can be used for downstream predictive tasks. By using several publicly available gene expression datasets, we demonstrate how the selected methods can effectively capture complex information and improve phenotype prediction accuracy. The results obtained show that self-supervised learning methods can outperform traditional supervised models besides offering significant advantage by reducing the dependency on annotated data. We provide a comprehensive analysis of the performance of each method by highlighting their strengths and limitations. We also provide recommendations for using these methods depending on the case under study. Finally, we outline future research directions to enhance the application of self-supervised learning in the field of gene expression data analysis. This study is the first work that deals with bulk RNA-Seq data and self-supervised learning.

[273] Reframing attention as a reinforcement learning problem for causal discovery

Turan Orujlu, Christian Gumbsch, Martin V. Butz, Charley M Wu

Main category: cs.LG

TL;DR: The paper introduces the Causal Process framework and its implementation, the Causal Process Model, to represent dynamic causal structures in RL, outperforming existing methods in causal representation learning and agent performance.

DetailsMotivation: To bridge the gap between formal causality frameworks and deep RL by addressing the dynamic nature of causal interactions, which static causal graphs ignore.

Method: Proposes the Causal Process framework and its implementation, the Causal Process Model, integrating causal inference into RL via nested RL tasks and Transformer-like attention mechanisms.

Result: Outperforms current alternatives in causal representation learning and agent performance, successfully recovering dynamic causal graphs.

Conclusion: The Causal Process framework effectively represents dynamic causal structures in RL, offering interpretable causal processes and improved performance.

Abstract: Formal frameworks of causality have operated largely parallel to modern trends in deep reinforcement learning (RL). However, there has been a revival of interest in formally grounding the representations learned by neural networks in causal concepts. Yet, most attempts at neural models of causality assume static causal graphs and ignore the dynamic nature of causal interactions. In this work, we introduce Causal Process framework as a novel theory for representing dynamic hypotheses about causal structure. Furthermore, we present Causal Process Model as an implementation of this framework. This allows us to reformulate the attention mechanism popularized by Transformer networks within an RL setting with the goal to infer interpretable causal processes from visual observations. Here, causal inference corresponds to constructing a causal graph hypothesis which itself becomes an RL task nested within the original RL problem. To create an instance of such hypothesis, we employ RL agents. These agents establish links between units similar to the original Transformer attention mechanism. We demonstrate the effectiveness of our approach in an RL environment where we outperform current alternatives in causal representation learning and agent performance, and uniquely recover graphs of dynamic causal processes.

[274] MoDyGAN: Combining Molecular Dynamics With GANs to Investigate Protein Conformational Space

Jingbo Liang, Bruna Jacobson

Main category: cs.LG

TL;DR: MoDyGAN combines MD simulations and GANs to explore protein conformations efficiently, using a novel 2D representation for 3D structures.

DetailsMotivation: High computational costs of dynamic simulations limit protein conformational exploration. MoDyGAN aims to address this.

Method: Uses MD-derived trajectories, a GAN generator, and a refinement module with dual-discriminator and ensemble learning.

Result: Generates plausible conformations for rigid proteins and aligns with SMD trajectories for deca-alanine.

Conclusion: Image-like protein representation enables efficient conformational sampling and extends to other 3D structures.

Abstract: Extensively exploring protein conformational landscapes remains a major challenge in computational biology due to the high computational cost involved in dynamic physics-based simulations. In this work, we propose a novel pipeline, MoDyGAN, that leverages molecular dynamics (MD) simulations and generative adversarial networks (GANs) to explore protein conformational spaces. MoDyGAN contains a generator that maps Gaussian distributions into MD-derived protein trajectories, and a refinement module that combines ensemble learning with a dual-discriminator to further improve the plausibility of generated conformations. Central to our approach is an innovative representation technique that reversibly transforms 3D protein structures into 2D matrices, enabling the use of advanced image-based GAN architectures. We use three rigid proteins to demonstrate that MoDyGAN can generate plausible new conformations. We also use deca-alanine as a case study to show that interpolations within the latent space closely align with trajectories obtained from steered molecular dynamics (SMD) simulations. Our results suggest that representing proteins as image-like data unlocks new possibilities for applying advanced deep learning techniques to biomolecular simulation, leading to an efficient sampling of conformational states. Additionally, the proposed framework holds strong potential for extension to other complex 3D structures.

[275] Robust Anomaly Detection with Graph Neural Networks using Controllability

Yifan Wei, Anwar Said, Waseem Abbas, Xenofon Koutsoukos

Main category: cs.LG

TL;DR: The paper proposes integrating average controllability into graph-based anomaly detection to improve performance with limited labeled data, demonstrating success through novel edge-weight and attribute methods.

DetailsMotivation: Anomaly detection in complex domains is hindered by scarce labeled data and imbalance between anomalous and benign samples, requiring innovative solutions.

Method: Two approaches are introduced: (1) using average controllability as edge weight and (2) encoding it as a one-hot edge attribute vector.

Result: Evaluation on real-world and synthetic networks shows improved anomaly detection performance compared to six baselines.

Conclusion: Average controllability enhances graph-based anomaly detection, offering a viable solution for sparse and imbalanced datasets.

Abstract: Anomaly detection in complex domains poses significant challenges due to the need for extensive labeled data and the inherently imbalanced nature of anomalous versus benign samples. Graph-based machine learning models have emerged as a promising solution that combines attribute and relational data to uncover intricate patterns. However, the scarcity of anomalous data exacerbates the challenge, which requires innovative strategies to enhance model learning with limited information. In this paper, we hypothesize that the incorporation of the influence of the nodes, quantified through average controllability, can significantly improve the performance of anomaly detection. We propose two novel approaches to integrate average controllability into graph-based frameworks: (1) using average controllability as an edge weight and (2) encoding it as a one-hot edge attribute vector. Through rigorous evaluation on real-world and synthetic networks with six state-of-the-art baselines, our proposed methods demonstrate improved performance in identifying anomalies, highlighting the critical role of controllability measures in enhancing the performance of graph machine learning models. This work underscores the potential of integrating average controllability as additional metrics to address the challenges of anomaly detection in sparse and imbalanced datasets.

[276] Signs of the Past, Patterns of the Present: On the Automatic Classification of Old Babylonian Cuneiform Signs

Eli Verwimp, Gustav Ryberg Smidt, Hendrik Hameeuw, Katrien De Graef

Main category: cs.LG

TL;DR: The paper explores ML techniques for classifying cuneiform signs, highlighting variability challenges and evaluating ResNet50’s performance on Old Babylonian texts.

DetailsMotivation: To address the variability in cuneiform signs and its impact on ML model performance, aiming to improve future data standards and classification tasks.

Method: Trained and tested ResNet50 on handwritten Old Babylonian texts from three Mesopotamian cities, focusing on signs with at least 20 instances.

Result: Achieved a top-1 score of 87.1% and top-5 score of 96.5%, with no comparable results due to being the first study on Old Babylonian texts.

Conclusion: The study provides foundational insights for future cuneiform sign classification and advocates for improved data acquisition standards.

Abstract: The work in this paper describes the training and evaluation of machine learning (ML) techniques for the classification of cuneiform signs. There is a lot of variability in cuneiform signs, depending on where they come from, for what and by whom they were written, but also how they were digitized. This variability makes it unlikely that an ML model trained on one dataset will perform successfully on another dataset. This contribution studies how such differences impact that performance. Based on our results and insights, we aim to influence future data acquisition standards and provide a solid foundation for future cuneiform sign classification tasks. The ML model has been trained and tested on handwritten Old Babylonian (c. 2000-1600 B.C.E.) documentary texts inscribed on clay tablets originating from three Mesopotamian cities (Nippur, D=ur-Abie\v{s}uh and Sippar). The presented and analysed model is ResNet50, which achieves a top-1 score of 87.1% and a top-5 score of 96.5% for signs with at least 20 instances. As these automatic classification results are the first on Old Babylonian texts, there are currently no comparable results.

[277] Structural Connectome Harmonization Using Deep Learning: The Strength of Graph Neural Networks

Jagruti Patel, Thomas A. W. Bolton, Mikkel Schöttner, Anjali Tarun, Sebastien Tourbier, Yasser Alemàn-Gòmez, Jonas Richiardi, Patric Hagmann

Main category: cs.LG

TL;DR: A deep harmonization framework for structural connectomes (SCs) addresses biases in multi-site studies without needing metadata or traveling subjects, outperforming traditional methods in preserving topology and individuality.

DetailsMotivation: Small sample sizes and scanner heterogeneity in SC studies limit biomarker reliability for disorders like Alzheimer's and schizophrenia. Existing harmonization methods often require metadata or overlook SC graph-topology.

Method: Proposes a site-conditioned deep harmonization framework tested on simulated Human Connectome Dataset data. Benchmarks three deep architectures (fully connected, convolutional, and graph convolutional autoencoders) against linear regression.

Result: Graph-based autoencoder best preserves topological structure and subject-level individuality, while non-graph models excel in edge-weight prediction. Linear regression performs numerically best but lacks real-world applicability due to metadata dependence.

Conclusion: Graph-based approaches are ideal for structure-aware, domain-generalizable SC harmonization in multi-site studies, emphasizing architecture’s role in performance.

Abstract: Small sample sizes in neuroimaging in general, and in structural connectome (SC) studies in particular limit the development of reliable biomarkers for neurological and psychiatric disorders - such as Alzheimer’s disease and schizophrenia - by reducing statistical power, reliability, and generalizability. Large-scale multi-site studies have exist, but they have acquisition-related biases due to scanner heterogeneity, compromising imaging consistency and downstream analyses. While existing SC harmonization methods - such as linear regression (LR), ComBat, and deep learning techniques - mitigate these biases, they often rely on detailed metadata, traveling subjects (TS), or overlook the graph-topology of SCs. To address these limitations, we propose a site-conditioned deep harmonization framework that harmonizes SCs across diverse acquisition sites without requiring metadata or TS that we test in a simulated scenario based on the Human Connectome Dataset. Within this framework, we benchmark three deep architectures - a fully connected autoencoder (AE), a convolutional AE, and a graph convolutional AE - against a top-performing LR baseline. While non-graph models excel in edge-weight prediction and edge existence detection, the graph AE demonstrates superior preservation of topological structure and subject-level individuality, as reflected by graph metrics and fingerprinting accuracy, respectively. Although the LR baseline achieves the highest numerical performance by explicitly modeling acquisition parameters, it lacks applicability to real-world multi-site use cases as detailed acquisition metadata is often unavailable. Our results highlight the critical role of model architecture in SC harmonization performance and demonstrate that graph-based approaches are particularly well-suited for structure-aware, domain-generalizable SC harmonization in large-scale multi-site SC studies.

[278] ParallelTime: Dynamically Weighting the Balance of Short- and Long-Term Temporal Dependencies

Itay Katav, Aryeh Kontorovich

Main category: cs.LG

TL;DR: The paper introduces ParallelTime, a dynamic weighting mechanism for time-series forecasting, outperforming existing methods by balancing long-term and short-term dependencies adaptively.

DetailsMotivation: Current methods assign equal weight to long-term and short-term dependencies, which is suboptimal for time-series forecasting.

Method: Proposes ParallelTime Weighter, a dynamic weighting mechanism, and the ParallelTime architecture to adaptively balance dependencies.

Result: Achieves state-of-the-art performance, lower FLOPs, fewer parameters, and scalability to longer horizons.

Conclusion: ParallelTime offers a robust and efficient solution, paving the way for future Attention-Mamba hybrid models in forecasting.

Abstract: Modern multivariate time series forecasting primarily relies on two architectures: the Transformer with attention mechanism and Mamba. In natural language processing, an approach has been used that combines local window attention for capturing short-term dependencies and Mamba for capturing long-term dependencies, with their outputs averaged to assign equal weight to both. We find that for time-series forecasting tasks, assigning equal weight to long-term and short-term dependencies is not optimal. To mitigate this, we propose a dynamic weighting mechanism, ParallelTime Weighter, which calculates interdependent weights for long-term and short-term dependencies for each token based on the input and the model’s knowledge. Furthermore, we introduce the ParallelTime architecture, which incorporates the ParallelTime Weighter mechanism to deliver state-of-the-art performance across diverse benchmarks. Our architecture demonstrates robustness, achieves lower FLOPs, requires fewer parameters, scales effectively to longer prediction horizons, and significantly outperforms existing methods. These advances highlight a promising path for future developments of parallel Attention-Mamba in time series forecasting. The implementation is readily available at: \href{https://github.com/itay1551/ParallelTime}{ParallelTime GitHub

[279] On the Fundamental Limitations of Dual Static CVaR Decompositions in Markov Decision Processes

Mathieu Godbout, Audrey Durand

Main category: cs.LG

TL;DR: The paper explains why dual-based DP methods for static CVaR-optimal policies in MDPs fail, attributing it to unmet risk-assignment constraints and introduces a CVaR evaluation gap to quantify errors.

DetailsMotivation: To clarify why dual-based DP methods fail in CVaR optimization and to explore policy evaluation errors.

Method: Framing policy evaluation as two minimization problems and analyzing risk-assignment consistency constraints.

Result: Identifies empty constraint intersection as the cause of evaluation errors and links it to non-zero CVaR gaps in dual-based DP.

Conclusion: The dual CVaR decomposition is inherently limited, as no single policy can be optimal across all risk levels in certain MDPs.

Abstract: Recent work has shown that dynamic programming (DP) methods for finding static CVaR-optimal policies in Markov Decision Processes (MDPs) can fail when based on the dual formulation, yet the root cause for the failure has remained unclear. We expand on these findings by shifting focus from policy optimization to the seemingly simpler task of policy evaluation. We show that evaluating the static CVaR of a given policy can be framed as two distinct minimization problems. For their solutions to match, a set of ``risk-assignment consistency constraints’’ must be satisfied, and we demonstrate that the intersection of the constraints being empty is the source of previously observed evaluation errors. Quantifying the evaluation error as the CVaR evaluation gap, we then demonstrate that the issues observed when optimizing over the dual-based CVaR DP are explained by the returned policy having a non-zero CVaR evaluation gap. We then leverage our proposed risk-assignment perspective to prove that the search for a single, uniformly optimal policy via on the dual CVaR decomposition is fundamentally limited, identifying an MDP where no single policy can be optimal across all initial risk levels.

[280] Byzantine-resilient federated online learning for Gaussian process regression

Xu Zhang, Zhenyuan Yuan, Minghui Zhu

Main category: cs.LG

TL;DR: A Byzantine-resilient federated GPR algorithm is proposed to improve learning performance despite adversarial agent behavior.

DetailsMotivation: To address the challenge of learning a latent function collaboratively in federated settings where some agents may exhibit arbitrary or adversarial behavior (Byzantine failures).

Method: Develops a federated GPR algorithm where agents send local predictions to a cloud, which aggregates them using a Byzantine-resilient rule. The cloud broadcasts the global model back to agents, who refine their predictions by fusing it with their local GPR.

Result: Quantifies accuracy improvements of the fused GPR over local GPR. Experiments on toy and real-world datasets validate the algorithm’s performance.

Conclusion: The proposed algorithm effectively mitigates Byzantine failures and enhances learning accuracy in federated GPR settings.

Abstract: In this paper, we study Byzantine-resilient federated online learning for Gaussian process regression (GPR). We develop a Byzantine-resilient federated GPR algorithm that allows a cloud and a group of agents to collaboratively learn a latent function and improve the learning performances where some agents exhibit Byzantine failures, i.e., arbitrary and potentially adversarial behavior. Each agent-based local GPR sends potentially compromised local predictions to the cloud, and the cloud-based aggregated GPR computes a global model by a Byzantine-resilient product of experts aggregation rule. Then the cloud broadcasts the current global model to all the agents. Agent-based fused GPR refines local predictions by fusing the received global model with that of the agent-based local GPR. Moreover, we quantify the learning accuracy improvements of the agent-based fused GPR over the agent-based local GPR. Experiments on a toy example and two medium-scale real-world datasets are conducted to demonstrate the performances of the proposed algorithm.

[281] DONUT: Physics-aware Machine Learning for Real-time X-ray Nanodiffraction Analysis

Aileen Luo, Tao Zhou, Ming Du, Martin V. Holt, Andrej Singer, Mathew J. Cherukara

Main category: cs.LG

TL;DR: DONUT, a physics-aware neural network, enables real-time, automated analysis of nanobeam diffraction data without labeled datasets, improving efficiency by 200x over conventional methods.

DetailsMotivation: Real-time analysis in coherent X-ray scattering is hindered by artifacts and computational demands, especially in nanodiffraction microscopy.

Method: DONUT integrates a differentiable geometric diffraction model into its architecture to predict crystal lattice strain and orientation unsupervised.

Result: DONUT accurately extracts data features 200 times more efficiently than traditional fitting methods.

Conclusion: DONUT overcomes limitations of supervised learning in X-ray science, offering rapid, automated analysis for nanoscale structural studies.

Abstract: Coherent X-ray scattering techniques are critical for investigating the fundamental structural properties of materials at the nanoscale. While advancements have made these experiments more accessible, real-time analysis remains a significant bottleneck, often hindered by artifacts and computational demands. In scanning X-ray nanodiffraction microscopy, which is widely used to spatially resolve structural heterogeneities, this challenge is compounded by the convolution of the divergent beam with the sample’s local structure. To address this, we introduce DONUT (Diffraction with Optics for Nanobeam by Unsupervised Training), a physics-aware neural network designed for the rapid and automated analysis of nanobeam diffraction data. By incorporating a differentiable geometric diffraction model directly into its architecture, DONUT learns to predict crystal lattice strain and orientation in real-time. Crucially, this is achieved without reliance on labeled datasets or pre-training, overcoming a fundamental limitation for supervised machine learning in X-ray science. We demonstrate experimentally that DONUT accurately extracts all features within the data over 200 times more efficiently than conventional fitting methods.

[282] Noradrenergic-inspired gain modulation attenuates the stability gap in joint training

Alejandro Rodriguez-Garcia, Anindya Ghosh, Srikanth Ramaswamy

Main category: cs.LG

TL;DR: The paper addresses the stability gap in continual learning, proposing an uncertainty-modulated gain dynamics mechanism inspired by biological brains to balance plasticity and stability.

DetailsMotivation: The stability gap in continual learning undermines robustness, persisting even under ideal joint-loss regimes, necessitating mechanisms to reconcile rapid adaptation and retention.

Method: The authors propose uncertainty-modulated gain dynamics, inspired by noradrenergic bursts in biological brains, to dynamically balance knowledge integration and interference reduction.

Result: The mechanism effectively attenuates the stability gap in MNIST and CIFAR benchmarks under joint training, enhancing continual learning performance.

Conclusion: The study provides insights into reducing stability gaps and improving continual learning by mimicking biological neuromodulatory functions.

Abstract: Recent studies in continual learning have identified a transient drop in performance on mastered tasks when assimilating new ones, known as the stability gap. Such dynamics contradict the objectives of continual learning, revealing a lack of robustness in mitigating forgetting, and notably, persisting even under an ideal joint-loss regime. Examining this gap within this idealized joint training context is critical to isolate it from other sources of forgetting. We argue that it reflects an imbalance between rapid adaptation and robust retention at task boundaries, underscoring the need to investigate mechanisms that reconcile plasticity and stability within continual learning frameworks. Biological brains navigate a similar dilemma by operating concurrently on multiple timescales, leveraging neuromodulatory signals to modulate synaptic plasticity. However, artificial networks lack native multitimescale dynamics, and although optimizers like momentum-SGD and Adam introduce implicit timescale regularization, they still exhibit stability gaps. Inspired by locus coeruleus mediated noradrenergic bursts, which transiently enhance neuronal gain under uncertainty to facilitate sensory assimilation, we propose uncertainty-modulated gain dynamics - an adaptive mechanism that approximates a two-timescale optimizer and dynamically balances integration of knowledge with minimal interference on previously consolidated information. We evaluate our mechanism on domain-incremental and class-incremental variants of the MNIST and CIFAR benchmarks under joint training, demonstrating that uncertainty-modulated gain dynamics effectively attenuate the stability gap. Finally, our analysis elucidates how gain modulation replicates noradrenergic functions in cortical circuits, offering mechanistic insights into reducing stability gaps and enhance performance in continual learning tasks.

[283] Preference-based Multi-Objective Reinforcement Learning

Ni Mu, Yao Luan, Qing-Shan Jia

Main category: cs.LG

TL;DR: Pb-MORL integrates preferences into multi-objective reinforcement learning, eliminating the need for complex reward design and deriving Pareto optimal policies.

DetailsMotivation: Pre-defined reward functions in MORL are hard to design and may oversimplify conflicting goals. Preferences offer a flexible alternative.

Method: Pb-MORL formalizes preference integration, constructs a multi-objective reward model aligned with preferences, and proves its equivalence to Pareto optimal policy training.

Result: Experiments in benchmark tasks, energy management, and autonomous driving show Pb-MORL outperforms the oracle method using ground truth rewards.

Conclusion: Pb-MORL is effective for complex real-world systems, offering a practical alternative to traditional reward-based MORL.

Abstract: Multi-objective reinforcement learning (MORL) is a structured approach for optimizing tasks with multiple objectives. However, it often relies on pre-defined reward functions, which can be hard to design for balancing conflicting goals and may lead to oversimplification. Preferences can serve as more flexible and intuitive decision-making guidance, eliminating the need for complicated reward design. This paper introduces preference-based MORL (Pb-MORL), which formalizes the integration of preferences into the MORL framework. We theoretically prove that preferences can derive policies across the entire Pareto frontier. To guide policy optimization using preferences, our method constructs a multi-objective reward model that aligns with the given preferences. We further provide theoretical proof to show that optimizing this reward model is equivalent to training the Pareto optimal policy. Extensive experiments in benchmark multi-objective tasks, a multi-energy management task, and an autonomous driving task on a multi-line highway show that our method performs competitively, surpassing the oracle method, which uses the ground truth reward function. This highlights its potential for practical applications in complex real-world systems.

[284] DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration

Xiyun Li, Yining Ding, Yuhua Jiang, Yunlong Zhao, Runpeng Xie, Shuang Xu, Yuanhua Ni, Yiqin Yang, Bo Xu

Main category: cs.LG

TL;DR: Proposes a dual process multi-scale theory of mind (DPMT) framework to improve human-AI collaboration by modeling complex human mental characteristics.

DetailsMotivation: Existing LLM agents struggle to model human intentions without direct communication, limiting real-time collaboration.

Method: Introduces DPMT with a multi-scale theory of mind module for mental characteristic reasoning.

Result: DPMT significantly enhances human-AI collaboration, validated by ablation studies.

Conclusion: The DPMT framework effectively addresses limitations in modeling human behavior, improving collaboration.

Abstract: Real-time human-artificial intelligence (AI) collaboration is crucial yet challenging, especially when AI agents must adapt to diverse and unseen human behaviors in dynamic scenarios. Existing large language model (LLM) agents often fail to accurately model the complex human mental characteristics such as domain intentions, especially in the absence of direct communication. To address this limitation, we propose a novel dual process multi-scale theory of mind (DPMT) framework, drawing inspiration from cognitive science dual process theory. Our DPMT framework incorporates a multi-scale theory of mind (ToM) module to facilitate robust human partner modeling through mental characteristic reasoning. Experimental results demonstrate that DPMT significantly enhances human-AI collaboration, and ablation studies further validate the contributions of our multi-scale ToM in the slow system.

[285] Kolmogorov Arnold Networks (KANs) for Imbalanced Data – An Empirical Perspective

Pankaj Yadav, Vivek Vijay

Main category: cs.LG

TL;DR: Kolmogorov Arnold Networks (KANs) perform well on raw imbalanced data but conflict with standard imbalance techniques, showing high computational costs without proportional gains. MLPs with imbalance methods match KANs’ performance more efficiently.

DetailsMotivation: To empirically evaluate KANs in class-imbalanced classification and compare their effectiveness with MLPs, especially regarding imbalance strategies.

Method: Evaluation of KANs and MLPs on ten benchmark datasets, testing raw imbalanced data and conventional imbalance techniques like resampling and focal loss.

Result: KANs perform well on raw imbalanced data but degrade with imbalance techniques. MLPs with imbalance methods achieve similar performance at lower costs.

Conclusion: KANs are specialized for raw imbalanced data but face performance-resource tradeoffs and incompatibility with standard techniques, limiting practical use. Future work should focus on KAN-specific modifications and efficiency.

Abstract: Kolmogorov Arnold Networks (KANs) are recent architectural advancement in neural computation that offer a mathematically grounded alternative to standard neural networks. This study presents an empirical evaluation of KANs in context of class imbalanced classification, using ten benchmark datasets. We observe that KANs can inherently perform well on raw imbalanced data more effectively than Multi-Layer Perceptrons (MLPs) without any resampling strategy. However, conventional imbalance strategies fundamentally conflict with KANs mathematical structure as resampling and focal loss implementations significantly degrade KANs performance, while marginally benefiting MLPs. Crucially, KANs suffer from prohibitive computational costs without proportional performance gains. Statistical validation confirms that MLPs with imbalance techniques achieve equivalence with KANs (|d| < 0.08 across metrics) at minimal resource costs. These findings reveal that KANs represent a specialized solution for raw imbalanced data where resources permit. But their severe performance-resource tradeoffs and incompatibility with standard resampling techniques currently limits practical deployment. We identify critical research priorities as developing KAN specific architectural modifications for imbalance learning, optimizing computational efficiency, and theoretical reconciling their conflict with data augmentation. This work establishes foundational insights for next generation KAN architectures in imbalanced classification scenarios.

[286] Toward Temporal Causal Representation Learning with Tensor Decomposition

Jianhong Chen, Meng Zhao, Mostafa Reisi Gahrooei, Xubo Yue

Main category: cs.LG

TL;DR: The paper introduces CaRTeD, a joint learning framework combining temporal causal representation learning with irregular tensor decomposition for high-dimensional, varying-length data. It provides theoretical convergence guarantees and outperforms state-of-the-art methods in experiments.

DetailsMotivation: Real-world data often exists as high-dimensional, irregular tensors, requiring advanced methods for meaningful analysis. Existing techniques lack flexibility and theoretical guarantees for such data.

Method: Proposes CaRTeD, integrating temporal causal representation learning with irregular tensor decomposition, offering flexible regularization and theoretical convergence proofs.

Result: Demonstrates superior performance on synthetic and EHR datasets (MIMIC-III), improving phenotyping and network recovery.

Conclusion: CaRTeD fills a theoretical gap, enhances explainability, and outperforms existing methods, making it valuable for causal analysis of irregular tensor data.

Abstract: Temporal causal representation learning is a powerful tool for uncovering complex patterns in observational studies, which are often represented as low-dimensional time series. However, in many real-world applications, data are high-dimensional with varying input lengths and naturally take the form of irregular tensors. To analyze such data, irregular tensor decomposition is critical for extracting meaningful clusters that capture essential information. In this paper, we focus on modeling causal representation learning based on the transformed information. First, we present a novel causal formulation for a set of latent clusters. We then propose CaRTeD, a joint learning framework that integrates temporal causal representation learning with irregular tensor decomposition. Notably, our framework provides a blueprint for downstream tasks using the learned tensor factors, such as modeling latent structures and extracting causal information, and offers a more flexible regularization design to enhance tensor decomposition. Theoretically, we show that our algorithm converges to a stationary point. More importantly, our results fill the gap in theoretical guarantees for the convergence of state-of-the-art irregular tensor decomposition. Experimental results on synthetic and real-world electronic health record (EHR) datasets (MIMIC-III), with extensive benchmarks from both phenotyping and network recovery perspectives, demonstrate that our proposed method outperforms state-of-the-art techniques and enhances the explainability of causal representations.

[287] LOCUS: LOcalization with Channel Uncertainty and Sporadic Energy

Subrata Biswas, Mohammad Nur Hossain Khan, Violet Colwell, Jack Adiletta, Bashima Islam

Main category: cs.LG

TL;DR: LOCUS is a deep learning framework for sound source localization (SSL) that recovers corrupted features in batteryless systems, improving DoA accuracy under missing-channel conditions.

DetailsMotivation: Batteryless systems often suffer from missing data due to energy harvesting, degrading SSL performance.

Method: LOCUS integrates three modules: InFo to identify corrupted regions, LaFS to reconstruct missing features, and GRep to restore data without altering valid inputs.

Result: LOCUS achieves up to 36.91% error reduction on datasets and 25.87-59.46% gains in real-world deployments.

Conclusion: LOCUS significantly improves SSL accuracy under energy constraints, with released code and a 50-hour dataset for future research.

Abstract: Accurate sound source localization (SSL), such as direction-of-arrival (DoA) estimation, relies on consistent multichannel data. However, batteryless systems often suffer from missing data due to the stochastic nature of energy harvesting, degrading localization performance. We propose LOCUS, a deep learning framework that recovers corrupted features in such settings. LOCUS integrates three modules: (1) Information-Weighted Focus (InFo) to identify corrupted regions, (2) Latent Feature Synthesizer (LaFS) to reconstruct missing features, and (3) Guided Replacement (GRep) to restore data without altering valid inputs. LOCUS significantly improves DoA accuracy under missing-channel conditions, achieving up to 36.91% error reduction on DCASE and LargeSet, and 25.87-59.46% gains in real-world deployments. We release a 50-hour multichannel dataset to support future research on localization under energy constraints. Our code and data are available at: https://bashlab.github.io/locus_project/

[288] Equivalent and Compact Representations of Neural Network Controllers With Decision Trees

Kevin Chang, Nathan Dahlin, Rahul Jain, Pierluigi Nuzzo

Main category: cs.LG

TL;DR: The paper explores transforming neural network (NN) controllers into soft decision tree (SDT) controllers to improve verifiability and safety, demonstrating efficiency gains in formal verification.

DetailsMotivation: NN controllers are effective but lack transparency and safety guarantees, limiting real-world deployment. SDTs offer a more verifiable alternative.

Method: An exact and efficient transformation algorithm is devised to convert NN controllers (with ReLU and argmax) into SDTs, pruning redundant branches.

Result: Applied to an autonomous driving controller and OpenAI Gym benchmarks, the SDT transformation showed runtime improvements of up to 21x and 2x.

Conclusion: SDT transformation enhances verifiability and efficiency, making NN controllers safer and more practical for real-world systems.

Abstract: Over the past decade, neural network (NN)-based controllers have demonstrated remarkable efficacy in a variety of decision-making tasks. However, their black-box nature and the risk of unexpected behaviors pose a challenge to their deployment in real-world systems requiring strong guarantees of correctness and safety. We address these limitations by investigating the transformation of NN-based controllers into equivalent soft decision tree (SDT)-based controllers and its impact on verifiability. In contrast to existing work, we focus on discrete-output NN controllers including rectified linear unit (ReLU) activation functions as well as argmax operations. We then devise an exact yet efficient transformation algorithm which automatically prunes redundant branches. We first demonstrate the practical efficacy of the transformation algorithm applied to an autonomous driving NN controller within OpenAI Gym’s CarRacing environment. Subsequently, we evaluate our approach using two benchmarks from the OpenAI Gym environment. Our results indicate that the SDT transformation can benefit formal verification, showing runtime improvements of up to $21 \times$ and $2 \times$ for MountainCar-v0 and CartPole-v1, respectively.

[289] Interpretable Imitation Learning via Generative Adversarial STL Inference and Control

Wenliang Liu, Danyang Li, Erfan Aasi, Daniela Rus, Roberto Tron, Calin Belta

Main category: cs.LG

TL;DR: A novel imitation learning method combines STL inference and control synthesis for interpretable task representation, adaptability, and improved policy alignment.

DetailsMotivation: Address the lack of interpretability in imitation learning by explicitly representing tasks as STL formulas, enabling human knowledge integration and adaptation.

Method: Uses STL inference and control synthesis, trained with a GAN-inspired approach to align expert and learned policies.

Result: Demonstrates efficiency and adaptability in simulations, with practical applicability.

Conclusion: The method enhances interpretability and adaptability in imitation learning, bridging the gap between expert and learned behaviors.

Abstract: Imitation learning methods have demonstrated considerable success in teaching autonomous systems complex tasks through expert demonstrations. However, a limitation of these methods is their lack of interpretability, particularly in understanding the specific task the learning agent aims to accomplish. In this paper, we propose a novel imitation learning method that combines Signal Temporal Logic (STL) inference and control synthesis, enabling the explicit representation of the task as an STL formula. This approach not only provides a clear understanding of the task but also supports the integration of human knowledge and allows for adaptation to out-of-distribution scenarios by manually adjusting the STL formulas and fine-tuning the policy. We employ a Generative Adversarial Network (GAN)-inspired approach to train both the inference and policy networks, effectively narrowing the gap between expert and learned policies. The efficiency of our algorithm is demonstrated through simulations, showcasing its practical applicability and adaptability.

[290] A General Framework for Inference-time Scaling and Steering of Diffusion Models

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath

Main category: cs.LG

TL;DR: FK steering is an inference-time framework for steering diffusion models using reward functions, avoiding expensive training and mode collapse. It outperforms fine-tuned models in prompt fidelity and text quality.

DetailsMotivation: Generating samples with user-specified properties in diffusion models is challenging. Fine-tuning is costly and prone to mode collapse.

Method: FK steering samples multiple interacting diffusion processes (particles) and resamples them based on reward-based potentials.

Result: FK steering outperforms fine-tuned models in prompt fidelity and text quality, with faster sampling and no training.

Conclusion: Inference-time steering of diffusion models with off-the-shelf rewards offers significant quality gains and controllability.

Abstract: Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman-Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models - even with off-the-shelf rewards - can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .

[291] Geometry-Informed Neural Networks

Arturs Berzins, Andreas Radler, Eric Volkmann, Sebastian Sanokowski, Sepp Hochreiter, Johannes Brandstetter

Main category: cs.LG

TL;DR: GINNs train shape-generative neural fields without data by using design requirements as objectives and constraints, avoiding mode-collapse and enabling diverse solutions.

DetailsMotivation: The lack of large shape datasets limits supervised learning, prompting exploration of alternative strategies.

Method: Introduces geometry-informed neural networks (GINNs), leveraging design requirements as objectives and constraints to train without data.

Result: GINNs generate diverse solutions with control over geometric and topological properties, applied successfully in physics, geometry, and design.

Conclusion: GINNs show potential for data-free generative design, offering new approaches without reliance on large datasets.

Abstract: Geometry is a ubiquitous tool in computer graphics, design, and engineering. However, the lack of large shape datasets limits the application of state-of-the-art supervised learning methods and motivates the exploration of alternative learning strategies. To this end, we introduce geometry-informed neural networks (GINNs) – a framework for training shape-generative neural fields without data by leveraging user-specified design requirements in the form of objectives and constraints. By adding diversity as an explicit constraint, GINNs avoid mode-collapse and can generate multiple diverse solutions, often required in geometry tasks. Experimentally, we apply GINNs to several problems spanning physics, geometry, and engineering design, showing control over geometrical and topological properties, such as surface smoothness or the number of holes. These results demonstrate the potential of training shape-generative models without data, paving the way for new generative design approaches without large datasets.

[292] Bridging Local and Global Knowledge via Transformer in Board Games

Yan-Ru Ju, Tai-Lin Wu, Chung-Chin Shih, Ti-Rong Wu

Main category: cs.LG

TL;DR: ResTNet, a network combining residual and Transformer blocks, enhances AlphaZero’s performance in board games by improving global understanding and pattern recognition.

DetailsMotivation: AlphaZero struggles with comprehensive board understanding, especially in recognizing long-sequence patterns like in Go.

Method: ResTNet interleaves residual and Transformer blocks to integrate local and global knowledge.

Result: ResTNet boosts win rates in Go and Hex, improves pattern recognition, and reduces errors in specific scenarios.

Conclusion: ResTNet effectively bridges local and global knowledge, offering insights for better AlphaZero-based algorithms.

Abstract: Although AlphaZero has achieved superhuman performance in board games, recent studies reveal its limitations in handling scenarios requiring a comprehensive understanding of the entire board, such as recognizing long-sequence patterns in Go. To address this challenge, we propose ResTNet, a network that interleaves residual and Transformer blocks to bridge local and global knowledge. ResTNet improves playing strength across multiple board games, increasing win rate from 54.6% to 60.8% in 9x9 Go, 53.6% to 60.9% in 19x19 Go, and 50.4% to 58.0% in 19x19 Hex. In addition, ResTNet effectively processes global information and tackles two long-sequence patterns in 19x19 Go, including circular pattern and ladder pattern. It reduces the mean square error for circular pattern recognition from 2.58 to 1.07 and lowers the attack probability against an adversary program from 70.44% to 23.91%. ResTNet also improves ladder pattern recognition accuracy from 59.15% to 80.01%. By visualizing attention maps, we demonstrate that ResTNet captures critical game concepts in both Go and Hex, offering insights into AlphaZero’s decision-making process. Overall, ResTNet shows a promising approach to integrating local and global knowledge, paving the way for more effective AlphaZero-based algorithms in board games. Our code is available at https://rlg.iis.sinica.edu.tw/papers/restnet.

[293] XpertAI: uncovering regression model strategies for sub-manifolds

Simon Letzgus, Klaus-Robert Müller, Grégoire Montavon

Main category: cs.LG

TL;DR: XpertAI is a framework for Explainable AI (XAI) in regression models, enabling precise queries by decomposing predictions into range-specific sub-strategies.

DetailsMotivation: Address the lack of XAI solutions tailored for regression models, where explanations must be precise and context-specific.

Method: Disentangles prediction strategies into range-specific sub-strategies, allowing queries as linear combinations of these. Compatible with popular XAI techniques.

Result: Qualitative and quantitative results show the framework’s effectiveness in providing precise explanations.

Conclusion: XpertAI successfully bridges the gap in XAI for regression, offering tailored and precise explanations.

Abstract: In recent years, Explainable AI (XAI) methods have facilitated profound validation and knowledge extraction from ML models. While extensively studied for classification, few XAI solutions have addressed the challenges specific to regression models. In regression, explanations need to be precisely formulated to address specific user queries (e.g.\ distinguishing between Why is the output above 0?' and Why is the output above 50?’). They should furthermore reflect the model’s behavior on the relevant data sub-manifold. In this paper, we introduce XpertAI, a framework that disentangles the prediction strategy into multiple range-specific sub-strategies and allows the formulation of precise queries about the model (the `explanandum’) as a linear combination of those sub-strategies. XpertAI is formulated generally to work alongside popular XAI attribution techniques, based on occlusion, gradient integration, or reverse propagation. Qualitative and quantitative results, demonstrate the benefits of our approach.

[294] Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning

Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, Geng Yuan

Main category: cs.LG

TL;DR: DiZO optimization bridges the gap between zeroth-order (ZO) and first-order (FO) methods by using layer-wise divergence analysis, improving convergence speed and accuracy while reducing memory usage.

DetailsMotivation: Standard FO fine-tuning is memory-intensive, limiting deployment, while ZO methods lag in performance. The goal is to enhance ZO optimization to match FO efficiency.

Method: Introduces DiZO, a divergence-driven ZO optimization method that adapts updates layer-wise using projections, scaling updates to individual layer needs.

Result: DiZO reduces training iterations by up to 48%, outperforms ZO baselines, and sometimes surpasses FO fine-tuning on tasks like RoBERTa-large, OPT-series, and Llama-series.

Conclusion: DiZO offers a memory-efficient, high-performance alternative to FO fine-tuning, making it practical for resource-constrained scenarios.

Abstract: Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://anonymous.4open.science/r/DiZO-E86D.

[295] Uncertainty-Aware Explanations Through Probabilistic Self-Explainable Neural Networks

Jon Vadillo, Roberto Santana, Jose A. Lozano, Marta Kwiatkowska

Main category: cs.LG

TL;DR: Prob-PSENN introduces probabilistic prototypes in self-explainable neural networks, enhancing transparency and capturing explanatory uncertainty.

DetailsMotivation: Addressing the lack of transparency and reliability in Deep Neural Networks for high-stakes applications.

Method: Replaces point estimates for prototypes with probability distributions, enabling end-to-end learning and uncertainty capture.

Result: Provides more meaningful, robust explanations and maintains competitive predictive performance.

Conclusion: Prob-PSENN improves explainability and reliability in neural networks.

Abstract: The lack of transparency of Deep Neural Networks continues to be a limitation that severely undermines their reliability and usage in high-stakes applications. Promising approaches to overcome such limitations are Prototype-Based Self-Explainable Neural Networks (PSENNs), whose predictions rely on the similarity between the input at hand and a set of prototypical representations of the output classes, offering therefore a deep, yet transparent-by-design, architecture. In this paper, we introduce a probabilistic reformulation of PSENNs, called Prob-PSENN, which replaces point estimates for the prototypes with probability distributions over their values. This provides not only a more flexible framework for an end-to-end learning of prototypes, but can also capture the explanatory uncertainty of the model, which is a missing feature in previous approaches. In addition, since the prototypes determine both the explanation and the prediction, Prob-PSENNs allow us to detect when the model is making uninformed or uncertain predictions, and to obtain valid explanations for them. Our experiments demonstrate that Prob-PSENNs provide more meaningful and robust explanations than their non-probabilistic counterparts, while remaining competitive in terms of predictive performance, thus enhancing the explainability and reliability of the models.

[296] Learning to Reason at the Frontier of Learnability

Thomas Foster, Anya Sims, Johannes Forkel, Mattie Fellows, Jakob Foerster

Main category: cs.LG

TL;DR: The paper introduces a curriculum method for reinforcement learning in LLMs, focusing on questions with high variance of success to improve training efficiency.

DetailsMotivation: Current RL training for LLMs often involves questions that are either always solved or never solved, providing no meaningful learning signal.

Method: Adapts ‘sampling for learnability’ from RL literature to prioritize questions with high variance of success during training.

Result: The curriculum method consistently improves training performance across multiple algorithms (PPO, VinePPO) and datasets.

Conclusion: The approach enhances efficiency and effectiveness of RL in LLMs by focusing on learnable questions.

Abstract: Reinforcement learning is now widely adopted as the final stage of large language model training, especially for reasoning-style tasks such as maths problems. Typically, models attempt each question many times during a single training step and attempt to learn from their successes and failures. However, we demonstrate that throughout training with two popular algorithms (PPO and VinePPO) on two widely used datasets, many questions are either solved by all attempts - meaning they are already learned - or by none - providing no meaningful training signal. To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training. Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always. Our findings demonstrate that this curriculum consistently boosts training performance across multiple algorithms and datasets, paving the way for more efficient and effective reinforcement learning with LLMs.

[297] Two-Stage Pretraining for Molecular Property Prediction in the Wild

Kevin Tirta Wijaya, Minghao Guo, Michael Sun, Hans-Peter Seidel, Wojciech Matusik, Vahid Babaei

Main category: cs.LG

TL;DR: MoleVers is a pretrained molecular model for property prediction with scarce labels, using a two-stage pretraining strategy for state-of-the-art performance.

DetailsMotivation: Labels for molecular properties are scarce due to expensive and time-consuming lab experiments, necessitating models that work with limited labeled data.

Method: Two-stage pretraining: 1) learning representations from unlabeled data via masked atom prediction and extreme denoising, 2) refining with auxiliary property predictions from computational methods.

Result: Achieves state-of-the-art performance on 22 small, experimentally-validated datasets.

Conclusion: MoleVers’ two-stage framework effectively produces generalizable molecular representations for diverse property predictions.

Abstract: Molecular deep learning models have achieved remarkable success in property prediction, but they often require large amounts of labeled data. The challenge is that, in real-world applications, labels are extremely scarce, as obtaining them through laboratory experimentation is both expensive and time-consuming. In this work, we introduce MoleVers, a versatile pretrained molecular model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated labels are scarce. MoleVers employs a two-stage pretraining strategy. In the first stage, it learns molecular representations from unlabeled data through masked atom prediction and extreme denoising, a novel task enabled by our newly introduced branching encoder architecture and dynamic noise scale sampling. In the second stage, the model refines these representations through predictions of auxiliary properties derived from computational methods, such as the density functional theory or large language models. Evaluation on 22 small, experimentally-validated datasets demonstrates that MoleVers achieves state-of-the-art performance, highlighting the effectiveness of its two-stage framework in producing generalizable molecular representations for diverse downstream properties.

[298] Policy Verification in Stochastic Dynamical Systems Using Logarithmic Neural Certificates

Thom Badings, Wietze Koops, Sebastian Junges, Nils Jansen

Main category: cs.LG

TL;DR: The paper introduces logarithmic RASMs and a method for tighter Lipschitz constant bounds to verify neural network policies for stochastic systems with high threshold probabilities.

DetailsMotivation: Existing methods struggle with large Lipschitz constants, especially for high threshold probabilities in reach-avoid specifications.

Method: Uses learner-verifier procedure with neural network certificates, introduces logRASMs for smaller values, and computes tighter Lipschitz bounds via weighted norms.

Result: Empirical evaluation shows verification of reach-avoid specifications with probabilities up to 99.9999%.

Conclusion: The proposed logRASMs and weighted norm method effectively reduce Lipschitz constants, enabling verification of high-probability specifications.

Abstract: We consider the verification of neural network policies for discrete-time stochastic systems with respect to reach-avoid specifications. We use a learner-verifier procedure that learns a certificate for the specification, represented as a neural network. Verifying that this neural network certificate is a so-called reach-avoid supermartingale (RASM) proves the satisfaction of a reach-avoid specification. Existing approaches for such a verification task rely on computed Lipschitz constants of neural networks. These approaches struggle with large Lipschitz constants, especially for reach-avoid specifications with high threshold probabilities. We present two key contributions to obtain smaller Lipschitz constants than existing approaches. First, we introduce logarithmic RASMs (logRASMs), which take exponentially smaller values than RASMs and hence have lower theoretical Lipschitz constants. Second, we present a fast method to compute tighter upper bounds on Lipschitz constants based on weighted norms. Our empirical evaluation shows we can consistently verify the satisfaction of reach-avoid specifications with probabilities as high as 99.9999%.

[299] Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models

Konstantin Donhauser, Kristina Ulicna, Gemma Elyse Moran, Aditya Ravuri, Kian Kenyon-Dean, Cian Eastwood, Jason Hartford

Main category: cs.LG

TL;DR: Sparse dictionary learning (DL) is applied to extract meaningful concepts from vision foundation models trained on cell microscopy images, using a novel method combining Iterative Codebook Feature Learning (ICFL) and PCA whitening.

DetailsMotivation: To explore if DL can extract interpretable concepts from less human-interpretable scientific data, like cell microscopy images, where prior knowledge is limited.

Method: A combination of sparse DL (ICFL) and PCA whitening pre-processing derived from control data.

Result: Successfully retrieves biologically meaningful concepts (e.g., cell types, genetic perturbations) and reveals subtle morphological changes from interventions.

Conclusion: The method offers a promising direction for scientific discovery via mechanistic interpretability in bioimaging.

Abstract: Sparse dictionary learning (DL) has emerged as a powerful approach to extract semantically meaningful concepts from the internals of large language models (LLMs) trained mainly in the text domain. In this work, we explore whether DL can extract meaningful concepts from less human-interpretable scientific data, such as vision foundation models trained on cell microscopy images, where limited prior knowledge exists about which high-level concepts should arise. We propose a novel combination of a sparse DL algorithm, Iterative Codebook Feature Learning (ICFL), with a PCA whitening pre-processing step derived from control data. Using this combined approach, we successfully retrieve biologically meaningful concepts, such as cell types and genetic perturbations. Moreover, we demonstrate how our method reveals subtle morphological changes arising from human-interpretable interventions, offering a promising new direction for scientific discovery via mechanistic interpretability in bioimaging.

[300] TensorSocket: Shared Data Loading for Deep Learning Training

Ties Robroek, Neil Kim Nielsen, Pınar Tözün

Main category: cs.LG

TL;DR: TensorSocket reduces computational needs in deep learning training by enabling simultaneous processes to share a data loader, improving efficiency and reducing costs.

DetailsMotivation: The repetitive and resource-intensive nature of deep learning training, especially in hyper-parameter tuning and architecture search, creates inefficiencies and high costs due to redundant data processing.

Method: TensorSocket allows collocated training workloads to share the same data loader, reducing redundant computations and leveraging GPU-GPU interconnects. It supports differently-sized models and multiple batch sizes.

Result: TensorSocket increases training throughput by up to 100%, reduces CPU resource needs by 50%, and outperforms state-of-the-art solutions like CoorDL and Joader.

Conclusion: TensorSocket is a scalable, efficient solution for shared data loading in deep learning, offering significant performance and cost benefits.

Abstract: Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on a set of parameters (e.g., hyper-parameter tuning) and model architecture (e.g., neural architecture search), among other things that yield the highest accuracy. The computational efficiency of these training tasks depends highly on how well the training data is supplied to the training process. The repetitive nature of these tasks results in the same data processing pipelines running over and over, exacerbating the need for and costs of computational resources. In this paper, we present TensorSocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. TensorSocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. TensorSocket achieves this by reducing redundant computations and data duplication across collocated training processes and leveraging modern GPU-GPU interconnects. While doing so, TensorSocket is able to train and balance differently-sized models and serve multiple batch sizes simultaneously and is hardware- and pipeline-agnostic in nature. Our evaluation shows that TensorSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50% by reducing the hardware resource needs on the CPU side. Furthermore, TensorSocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader; it is easier to deploy and maintain and either achieves higher or matches their throughput while requiring fewer CPU resources.

[301] On Logical Extrapolation for Mazes with Recurrent and Implicit Networks

Brandon Knutson, Amandin Chyba Rabeendran, Michael Ivanitskiy, Jordan Pettyjohn, Cecilia Diniz-Behn, Samy Wu Fung, Daniel McKenzie

Main category: cs.LG

TL;DR: Neural networks (RNNs and INNs) can generalize to harder tasks after training on easy ones, but their ability to learn scalable algorithms is uncertain, especially in maze-solving. Diverse training data helps some failures but not logical extrapolation. Convergence behavior varies, and dynamics analysis may improve extrapolator design.

DetailsMotivation: To investigate whether neural networks (RNNs and INNs) truly learn scalable algorithms for tasks like maze-solving, and to understand their generalization and failure modes.

Method: Tested models on maze-solving tasks with varied data (e.g., maze size, diversity). Analyzed convergence behavior and extrapolation dynamics.

Result: Models showed mixed success, with some learning approximate algorithms (e.g., deadend-filling). Diverse training addressed some failures but not extrapolation. Convergence behavior varied, with some models exhibiting limit cycles.

Conclusion: Logical extrapolation is prone to goal misgeneralization, and analyzing extrapolation dynamics could improve model design.

Abstract: Recent work suggests that certain neural network architectures – particularly recurrent neural networks (RNNs) and implicit neural networks (INNs) – are capable of logical extrapolation. When trained on easy instances of a task, these networks (henceforth: logical extrapolators) can generalize to more difficult instances. Previous research has hypothesized that logical extrapolators do so by learning a scalable, iterative algorithm for the given task which converges to the solution. We examine this idea more closely in the context of a single task: maze solving. By varying test data along multiple axes – not just maze size – we show that models introduced in prior work fail in a variety of ways, some expected and others less so. It remains uncertain whether any of these models has truly learned an algorithm. However, we provide evidence that a certain RNN has approximately learned a form of `deadend-filling’. We show that training these models on more diverse data addresses some failure modes but, paradoxically, does not improve logical extrapolation. We also analyze convergence behavior, and show that models explicitly trained to converge to a fixed point are likely to do so when extrapolating, while models that are not may exhibit more exotic limiting behavior such as limit cycles, even when they correctly solve the problem. Our results (i) show that logical extrapolation is not immune to the problem of goal misgeneralization, and (ii) suggest that analyzing the dynamics of extrapolation may yield insights into designing better logical extrapolators.

[302] MUSO: Achieving Exact Machine Unlearning in Over-Parameterized Regimes

Ruikai Yang, Mingzhen He, Zhengbao He, Youmei Qiu, Xiaolin Huang

Main category: cs.LG

TL;DR: The paper explores whether relabeling and fine-tuning can achieve exact machine unlearning (MU) in parameter space for over-parameterized models, proving it’s possible for linear models and extending to nonlinear networks with a proposed algorithm.

DetailsMotivation: To determine if relabeling and fine-tuning can achieve exact MU in parameter space, not just output space, for over-parameterized models like neural networks.

Method: Uses random feature techniques for an analytical framework, proves exact MU for linear models via stochastic gradient descent, and extends to nonlinear networks with an alternating optimization algorithm.

Result: Theoretical proof for exact MU in linear models and successful numerical experiments for nonlinear networks, outperforming state-of-the-art methods.

Conclusion: Relabeling and fine-tuning can achieve exact MU in parameter space, with the proposed algorithm offering superior performance in unlearning tasks.

Abstract: Machine unlearning (MU) is to make a well-trained model behave as if it had never been trained on specific data. In today’s over-parameterized models, dominated by neural networks, a common approach is to manually relabel data and fine-tune the well-trained model. It can approximate the MU model in the output space, but the question remains whether it can achieve exact MU, i.e., in the parameter space. We answer this question by employing random feature techniques to construct an analytical framework. Under the premise of model optimization via stochastic gradient descent, we theoretically demonstrated that over-parameterized linear models can achieve exact MU through relabeling specific data. We also extend this work to real-world nonlinear networks and propose an alternating optimization algorithm that unifies the tasks of unlearning and relabeling. The algorithm’s effectiveness, confirmed through numerical experiments, highlights its superior performance in unlearning across various scenarios compared to current state-of-the-art methods, particularly excelling over similar relabeling-based MU approaches.

[303] VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting

Hao Chen, Han Tao, Guo Song, Jie Zhang, Yunlong Yu, Yonghan Dong, Lei Bai

Main category: cs.LG

TL;DR: VAMoE is a novel framework for incremental weather forecasting that dynamically adapts to spatiotemporal patterns, reducing computational costs while maintaining accuracy.

DetailsMotivation: Traditional weather models face high computational costs and require frequent updates. VAMoE aims to address these issues by adapting to real-time data efficiently.

Method: VAMoE uses a hybrid architecture of experts, each specializing in atmospheric variables, and a variable adaptive gating mechanism to dynamically select and combine experts.

Result: Experiments show VAMoE matches state-of-the-art models in accuracy for short and long-term forecasts, using fewer parameters and less training data.

Conclusion: VAMoE offers an efficient and accurate solution for incremental weather forecasting, reducing computational overhead without sacrificing performance.

Abstract: This paper presents Variables Adaptive Mixture of Experts (VAMoE), a novel framework for incremental weather forecasting that dynamically adapts to evolving spatiotemporal patterns in real time data. Traditional weather prediction models often struggle with exorbitant computational expenditure and the need to continuously update forecasts as new observations arrive. VAMoE addresses these challenges by leveraging a hybrid architecture of experts, where each expert specializes in capturing distinct subpatterns of atmospheric variables (temperature, humidity, wind speed). Moreover, the proposed method employs a variable adaptive gating mechanism to dynamically select and combine relevant experts based on the input context, enabling efficient knowledge distillation and parameter sharing. This design significantly reduces computational overhead while maintaining high forecast accuracy. Experiments on real world ERA5 dataset demonstrate that VAMoE performs comparable against SoTA models in both short term (1 days) and long term (5 days) forecasting tasks, with only about 25% of trainable parameters and 50% of the initial training data.

[304] Critiques of World Models

Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu

Main category: cs.LG

TL;DR: The paper critiques existing world model theories, proposes a new architecture for a general-purpose world model, and envisions a PAN AGI system.

DetailsMotivation: The rising need for virtual agents with artificial intelligence drives the exploration of world models, their construction, and evaluation.

Method: Critiques existing world model theories, proposes a hierarchical, multi-level, mixed continuous/discrete representation, and a generative self-supervision framework.

Result: A new architecture for a general-purpose world model is introduced, aiming to simulate actionable possibilities for reasoning and acting.

Conclusion: The proposed PAN AGI system, enabled by the new world model, offers a promising direction for future AGI development.

Abstract: World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

[305] $ε$-rank and the Staircase Phenomenon: New Insights into Neural Network Training Dynamics

Jiang Yang, Yuxiang Zhao, Quanhui Zhu

Main category: cs.LG

TL;DR: Error: OutputParser failed

DetailsMotivation: Error: OutputParser failed

Method: Error: OutputParser failed

Result: Error: OutputParser failed

Conclusion: Error: OutputParser failed

Abstract: Understanding the training dynamics of deep neural networks (DNNs), particularly how they evolve low-dimensional features from high-dimensional data, remains a central challenge in deep learning theory. In this work, we introduce the concept of $\epsilon$-rank, a novel metric quantifying the effective feature of neuron functions in the terminal hidden layer. Through extensive experiments across diverse tasks, we observe a universal staircase phenomenon: during training process implemented by the standard stochastic gradient descent methods, the decline of the loss function is accompanied by an increase in the $\epsilon$-rank and exhibits a staircase pattern. Theoretically, we rigorously prove a negative correlation between the loss lower bound and $\epsilon$-rank, demonstrating that a high $\epsilon$-rank is essential for significant loss reduction. Moreover, numerical evidences show that within the same deep neural network, the $\epsilon$-rank of the subsequent hidden layer is higher than that of the previous hidden layer. Based on these observations, to eliminate the staircase phenomenon, we propose a novel pre-training strategy on the initial hidden layer that elevates the $\epsilon$-rank of the terminal hidden layer. Numerical experiments validate its effectiveness in reducing training time and improving accuracy across various tasks. Therefore, the newly introduced concept of $\epsilon$-rank is a computable quantity that serves as an intrinsic effective metric characteristic for deep neural networks, providing a novel perspective for understanding the training dynamics of neural networks and offering a theoretical foundation for designing efficient training strategies in practical applications.

[306] AI-Accelerated Flow Simulation: A Robust Auto-Regressive Framework for Long-Term CFD Forecasting

Sunwoong Yang, Ricardo Vinuesa, Namwoo Kang

Main category: cs.LG

TL;DR: The study tackles error accumulation in spatio-temporal AR predictions by introducing the two-step Adams-Bashforth method and adaptive multi-step rollout strategies, showing significant improvements in accuracy and robustness.

DetailsMotivation: Error accumulation in spatio-temporal AR predictions poses a critical challenge in scientific machine learning, necessitating more stable and accurate methods.

Method: Implemented the two-step Adams-Bashforth method for AR prediction and developed three adaptive weighting strategies for multi-step rollout training, validated on 2D PDEs and Navier-Stokes dynamics.

Result: The Adams-Bashforth scheme and adaptive strategies improved accuracy (89% over fixed-weight methods) and robustness, even with lightweight models and limited data.

Conclusion: The integrated methodology significantly outperforms conventional techniques, demonstrating robustness and efficiency in complex scenarios.

Abstract: This study addresses the critical challenge of error accumulation in spatio-temporal auto-regressive (AR) predictions within scientific machine learning models by exploring temporal integration schemes and adaptive multi-step rollout strategies. We introduce the first implementation of the two-step Adams-Bashforth method specifically tailored for data-driven AR prediction, leveraging historical derivative information to enhance numerical stability without additional computational overhead. To validate our approach, we systematically evaluate time integration schemes across canonical 2D PDEs before extending to complex Navier-Stokes cylinder vortex shedding dynamics. Additionally, we develop three novel adaptive weighting strategies that dynamically adjust the importance of different future time steps during multi-step rollout training. Our analysis reveals that as physical complexity increases, such sophisticated rollout techniques become essential, with the Adams-Bashforth scheme demonstrating consistent robustness across investigated systems and our best adaptive approach delivering an 89% improvement over conventional fixed-weight methods while maintaining similar computational costs. For the complex Navier-Stokes vortex shedding problem, despite using an extremely lightweight graph neural network with just 1,177 trainable parameters and training on only 50 snapshots, our framework accurately predicts 350 future time steps reducing mean squared error from 0.125 (single-step direct prediction) to 0.002 (Adams-Bashforth with proposed multi-step rollout). Our integrated methodology demonstrates an 83% improvement over standard noise injection techniques and maintains robustness under severe spatial constraints; specifically, when trained on only a partial spatial domain, it still achieves 58% and 27% improvements over direct prediction and forward Euler methods, respectively.

[307] An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

Enoch H. Kang, Hema Yoganarasimhan, Lalit Jain

Main category: cs.LG

TL;DR: A gradient-based method for estimating Dynamic Discrete Choice models without linear reward assumptions, using Empirical Risk Minimization and compatible with non-parametric techniques like neural networks.

DetailsMotivation: To recover reward or Q* functions from offline behavior data without restrictive assumptions, enabling scalability to high-dimensional spaces.

Method: Proposes a globally convergent gradient-based method using Empirical Risk Minimization, avoiding explicit state transition probability estimation and leveraging the Polyak-Lojasiewicz condition.

Result: Outperforms benchmark methods in synthetic experiments, demonstrating fast global convergence.

Conclusion: The method is scalable, theoretically sound, and effective for high-dimensional, infinite state spaces.

Abstract: We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition – a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

[308] Exploiting Label Skewness for Spiking Neural Networks in Federated Learning

Di Yu, Xin Du, Linshan Jiang, Huijing Zhang, Shuiguang Deng

Main category: cs.LG

TL;DR: FedLEC improves global SNN model accuracy by 11.59% via label weight calibration and knowledge distillation in federated learning.

DetailsMotivation: Addressing label-skewed data issues in federated learning for SNNs on edge devices to enhance model performance and privacy.

Method: Proposes FedLEC with intra-client label weight calibration and inter-client knowledge distillation to balance learning and mitigate bias.

Result: Achieves 11.59% higher accuracy than state-of-the-art FL algorithms across diverse datasets.

Conclusion: FedLEC effectively tackles label skew in federated SNNs, improving global model performance.

Abstract: The energy efficiency of deep spiking neural networks (SNNs) aligns with the constraints of resource-limited edge devices, positioning SNNs as a promising foundation for intelligent applications leveraging the extensive data collected by these devices. To address data privacy concerns when deploying SNNs on edge devices, federated learning (FL) facilitates collaborative model training by leveraging data distributed across edge devices without transmitting local data to a central server. However, existing FL approaches struggle with label-skewed data across devices, which leads to drift in local SNN models and degrades the performance of the global SNN model. In this paper, we propose a novel framework called FedLEC, which incorporates intra-client label weight calibration to balance the learning intensity across local labels and inter-client knowledge distillation to mitigate local SNN model bias caused by label absence. Extensive experiments with three different structured SNNs across five datasets (i.e., three non-neuromorphic and two neuromorphic datasets) demonstrate the efficiency of FedLEC. Compared to eight state-of-the-art FL algorithms, FedLEC achieves an average accuracy improvement of approximately 11.59% for the global SNN model under various label skew distribution settings.

[309] Machine learning applications in archaeological practices: a review

Mathias Bellat, Jordy D. Orellana Figueroa, Jonathan S. Reeves, Ruhollah Taghizadeh-Mehrjardi, Claudio Tennie, Thomas Scholten

Main category: cs.LG

TL;DR: The paper reviews 135 articles (1997-2022) on AI/ML in archaeology, noting a rise in publications since 2019, with structure detection and artefact classification as dominant tasks. It highlights gaps like underuse of clustering and issues in method clarity, proposing a workflow guide for better practices.

DetailsMotivation: To assess the prevalence, success, and gaps in AI/ML applications across archaeology, as prior reviews were limited to specific subfields.

Method: Exhaustive review of 135 articles, analyzing trends, tasks, and methods used in AI/ML applications in archaeology.

Result: Increased publications post-2019, dominance of supervised models (ANNs, ensemble learning), and gaps in method clarity and unsupervised techniques.

Conclusion: AI/ML is valuable but requires structured methodologies and collaboration to address current limitations and maximize potential in archaeology.

Abstract: Artificial intelligence and machine learning applications in archaeology have increased significantly in recent years, and these now span all subfields, geographical regions, and time periods. The prevalence and success of these applications have remained largely unexamined, as recent reviews on the use of machine learning in archaeology have only focused only on specific subfields of archaeology. Our review examined an exhaustive corpus of 135 articles published between 1997 and 2022. We observed a significant increase in the number of publications from 2019 onwards. Automatic structure detection and artefact classification were the most represented tasks in the articles reviewed, followed by taphonomy, and archaeological predictive modelling. From the review, clustering and unsupervised methods were underrepresented compared to supervised models. Artificial neural networks and ensemble learning account for two thirds of the total number of models used. However, if machine learning models are gaining in popularity they remain subject to misunderstanding. We observed, in some cases, poorly defined requirements and caveats of the machine learning methods used. Furthermore, the goals and the needs of machine learning applications for archaeological purposes are in some cases unclear or poorly expressed. To address this, we proposed a workflow guide for archaeologists to develop coherent and consistent methodologies adapted to their research questions, project scale and data. As in many other areas, machine learning is rapidly becoming an important tool in archaeological research and practice, useful for the analyses of large and multivariate data, although not without limitations. This review highlights the importance of well-defined and well-reported structured methodologies and collaborative practices to maximise the potential of applications of machine learning methods in archaeology.

[310] Can we ease the Injectivity Bottleneck on Lorentzian Manifolds for Graph Neural Networks?

Srinitish Srinivasan, Omkumar CU

Main category: cs.LG

TL;DR: LGIN, a hyperbolic GNN, improves discriminative power by preserving Lorentzian metrics, outperforming Euclidean and hyperbolic baselines.

DetailsMotivation: Address the expressivity gap in hyperbolic GNNs caused by non-injective aggregation, limiting discriminative power compared to Euclidean GNNs or the WL test.

Method: Propose LGIN, a Lorentzian Graph Isomorphic Network, with a new update rule preserving Lorentzian metrics to capture richer structural information.

Result: LGIN consistently outperforms or matches state-of-the-art hyperbolic and Euclidean baselines across nine benchmark datasets.

Conclusion: LGIN advances expressive GNNs on Riemannian manifolds, adapting powerful GNN principles to hyperbolic space for the first time.

Abstract: While hyperbolic GNNs show promise for hierarchical data, they often have limited discriminative power compared to Euclidean counterparts or the WL test, due to non-injective aggregation. To address this expressivity gap, we propose the Lorentzian Graph Isomorphic Network (LGIN), a novel HGNN designed for enhanced discrimination within the Lorentzian model. LGIN introduces a new update rule that preserves the Lorentzian metric while effectively capturing richer structural information. This marks a significant step towards more expressive GNNs on Riemannian manifolds. Extensive evaluations across nine benchmark datasets demonstrate LGIN’s superior performance, consistently outperforming or matching state-of-the-art hyperbolic and Euclidean baselines, showcasing its ability to capture complex graph structures. LGIN is the first to adapt principles of powerful, highly discriminative GNN architectures to a Riemannian manifold. The code for our paper can be found at https://github.com/Deceptrax123/LGIN

[311] Load Forecasting for Households and Energy Communities: Are Deep Learning Models Worth the Effort?

Lukas Moosbrugger, Valentin Seiler, Philipp Wohlgenannt, Sebastian Hegenbart, Sashko Ristov, Elias Eder, Peter Kepplinger

Main category: cs.LG

TL;DR: The study evaluates deep learning models (LSTM, xLSTM, Transformer) vs. traditional methods (KNN, persistence) for short-term load forecasting in energy communities. Transfer learning improves accuracy, but simpler models outperform deep learning with limited data. Practical benefits include cost savings (8.06% with deep learning, 8.01% with KNN).

DetailsMotivation: Energy communities need accurate load forecasting for demand-side management, but deep learning's effectiveness in practical contexts is understudied.

Method: Comparison of deep learning models (LSTM, xLSTM, Transformer) and traditional methods (KNN, persistence) under varying conditions (community size, data availability, complexity). Transfer learning with synthetic data is also tested.

Result: Transfer learning improves accuracy by 1.97 percentage points with limited data. Persistence models outperform deep learning with less than six months of data. Deep learning reduces costs by 8.06%, but KNN is nearly as effective (8.01%).

Conclusion: Deep learning is beneficial with sufficient data, but simpler methods like KNN are robust alternatives. Findings provide practical insights for energy communities.

Abstract: Energy communities (ECs) play a key role in enabling local demand shifting and enhancing self-sufficiency, as energy systems transition toward decentralized structures with high shares of renewable generation. To optimally operate them, accurate short-term load forecasting is essential, particularly for implementing demand-side management strategies. With the recent rise of deep learning methods, data-driven forecasting has gained significant attention, however, it remains insufficiently explored in many practical contexts. Therefore, this study evaluates the effectiveness of state-of-the-art deep learning models-including LSTM, xLSTM, and Transformer architectures-compared to traditional benchmarks such as K-Nearest Neighbors (KNN) and persistence forecasting, across varying community size, historical data availability, and model complexity. Additionally, we assess the benefits of transfer learning using publicly available synthetic load profiles. On average, transfer learning improves the normalized mean absolute error by 1.97 percentage points when only two months of training data are available. Interestingly, for less than six months of training data, simple persistence models outperform deep learning architectures in forecast accuracy. The practical value of improved forecasting is demonstrated using a mixed-integer linear programming optimization for ECs with a shared battery energy storage system. For an energy community with 50 households, the most accurate deep learning model achieves an average reduction in financial energy costs of 8.06%. Notably, a simple KNN approach achieves average savings of 8.01%, making it a competitive and robust alternative. All implementations are publicly available to facilitate reproducibility. These findings offer actionable insights for ECs, and they highlight when the additional complexity of deep learning is warranted by performance gains.

[312] DP2Unlearning: An Efficient and Guaranteed Unlearning Framework for LLMs

Tamim Al Mahmud, Najeeb Jebreel, Josep Domingo-Ferrer, David Sanchez

Main category: cs.LG

TL;DR: DP2Unlearning is a novel framework for LLMs that ensures formal forgetting guarantees at lower cost than retraining, using differential privacy for efficient unlearning.

DetailsMotivation: Address ethical and legal issues in LLMs, like memorization of private/copyrighted data, without costly retraining.

Method: Train LLMs with ε-differential privacy to enable efficient unlearning with formal guarantees.

Result: Achieves performance close to retraining from scratch at half the cost, outperforming approximate unlearning methods.

Conclusion: DP2Unlearning offers a practical, cost-effective solution for unlearning in LLMs with strong guarantees.

Abstract: Large language models (LLMs) have recently revolutionized language processing tasks but have also brought ethical and legal issues. LLMs have a tendency to memorize potentially private or copyrighted information present in the training data, which might then be delivered to end users at inference time. When this happens, a naive solution is to retrain the model from scratch after excluding the undesired data. Although this guarantees that the target data have been forgotten, it is also prohibitively expensive for LLMs. Approximate unlearning offers a more efficient alternative, as it consists of ex post modifications of the trained model itself to prevent undesirable results, but it lacks forgetting guarantees because it relies solely on empirical evidence. In this work, we present DP2Unlearning, a novel LLM unlearning framework that offers formal forgetting guarantees at a significantly lower cost than retraining from scratch on the data to be retained. DP2Unlearning involves training LLMs on textual data protected using {\epsilon}-differential privacy (DP), which later enables efficient unlearning with the guarantees against disclosure associated with the chosen {\epsilon}. Our experiments demonstrate that DP2Unlearning achieves similar model performance post-unlearning, compared to an LLM retraining from scratch on retained data – the gold standard exact unlearning – but at approximately half the unlearning cost. In addition, with a reasonable computational cost, it outperforms approximate unlearning methods at both preserving the utility of the model post-unlearning and effectively forgetting the targeted information.

[313] Position: Untrained Machine Learning for Anomaly Detection by using 3D Point Cloud Data

Juan Du, Dongheng Chen

Main category: cs.LG

TL;DR: The paper addresses untrained anomaly detection in 3D point cloud data, proposing three frameworks for accurate anomaly identification without relying on historical data or labels.

DetailsMotivation: Motivated by real-world scenarios like personalized manufacturing, where only one sample is available, the paper tackles the challenge of detecting anomalies without training data.

Method: Three frameworks are introduced: Latent Variable Inference (probabilistic modeling), Decomposition (sparse learning), and Local Geometry (neighborhood-based).

Result: Untrained methods achieve competitive performance and up to 15x faster execution compared to traditional approaches.

Conclusion: The proposed methods offer practical solutions for data-scarce applications, such as personalized manufacturing and healthcare.

Abstract: Anomaly detection based on 3D point cloud data is an important research problem and receives more and more attention recently. Untrained anomaly detection based on only one sample is an emerging research problem motivated by real manufacturing industries such as personalized manufacturing where only one sample can be collected without any additional labels and historical datasets. Identifying anomalies accurately based on one 3D point cloud sample is a critical challenge in both industrial applications and the field of machine learning. This paper aims to provide a formal definition of the untrained anomaly detection problem based on 3D point cloud data, discuss the differences between untrained anomaly detection and current unsupervised anomaly detection problems. Unlike trained unsupervised learning, untrained unsupervised learning does not rely on any data, including unlabeled data. Instead, they leverage prior knowledge about the surfaces and anomalies. We propose three complementary methodological frameworks: the Latent Variable Inference Framework that employs probabilistic modeling to distinguish anomalies; the Decomposition Framework that separates point clouds into reference, anomaly, and noise components through sparse learning; and the Local Geometry Framework that leverages neighborhood information for anomaly identification. Experimental results demonstrate that untrained methods achieve competitive detection performance while offering significant computational advantages, demonstrating up to a 15-fold increase in execution speed. The proposed methods provide viable solutions for scenarios with extreme data scarcity, addressing critical challenges in personalized manufacturing and healthcare applications where collecting multiple samples or historical data is infeasible.

[314] Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL

Finn Rietz, Oleg Smirnov, Sara Karimi, Lele Cao

Main category: cs.LG

TL;DR: The paper introduces a bandit-based prompt-tuning framework to optimize trajectory prompt selection in offline multi-task RL, improving performance and sample efficiency.

DetailsMotivation: Current prompting methods uniformly sample prompts from expert demonstrations, which limits task differentiation and generalization, especially in low-data settings.

Method: A lightweight, bandit-based prompt-tuning framework is proposed to explore and optimize prompt selection at inference time without fine-tuning the transformer backbone.

Result: Experiments show performance gains, better sample complexity, scalability, and prompt space exploration compared to baselines.

Conclusion: Adaptive prompt selection is crucial for efficient generalization in offline multi-task RL.

Abstract: Prompting has emerged as the dominant paradigm for adapting large, pre-trained transformer-based models to downstream tasks. The Prompting Decision Transformer (PDT) enables large-scale, multi-task offline Reinforcement Learning (RL) pre-training by leveraging stochastic trajectory prompts to identify the target task. However, these prompts are sampled uniformly from expert demonstrations, overlooking a critical limitation: not all prompts are equally informative for differentiating between tasks. This limits generalization and adaptation, especially in low-data or open-world settings where sample efficiency is crucial. To address this issue, we propose a lightweight, inference-time, bandit-based prompt-tuning framework. The bandit explores and optimizes trajectory prompt selection to enhance task performance, while avoiding costly fine-tuning of the transformer backbone. Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration compared to prompt-tuning baselines. These results highlights the importance of adaptive prompt selection mechanisms for efficient generalization in offline multi-task RL.

[315] Understanding Reasoning in Thinking Language Models via Steering Vectors

Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

Main category: cs.LG

TL;DR: A steering method for controlling reasoning behaviors in thinking LLMs is introduced, using linear directions in activation space to modulate processes like backtracking and uncertainty expression.

DetailsMotivation: To address the challenge of controlling reasoning processes in thinking LLMs, which exhibit behaviors like uncertainty expression and backtracking.

Method: Analyze and manipulate reasoning behaviors in DeepSeek-R1-Distill models using steering vectors derived from activation space.

Result: Demonstrated consistent control over reasoning behaviors across different model architectures.

Conclusion: The approach provides interpretable and practical tools for steering reasoning in thinking LLMs.

Abstract: Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model’s activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model’s reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.

[316] Architect of the Bits World: Masked Autoregressive Modeling for Circuit Generation Guided by Truth Table

Haoyuan Wu, Haisheng Zheng, Shoubo Hu, Zhuolun He, Bei Yu

Main category: cs.LG

TL;DR: A novel approach combining conditional generative models with differentiable architecture search (DAS) for optimized logic synthesis, addressing traditional heuristic limitations and DAS challenges.

DetailsMotivation: Traditional logic synthesis tools rely on human-designed heuristics, often yielding suboptimal results, while DAS faces computational complexity and convergence issues.

Method: Proposes CircuitVQ (a circuit tokenizer) and CircuitAR (a masked autoregressive model) to generate preliminary circuit structures from truth tables, guiding DAS for precise circuit generation.

Result: Demonstrates scalability and emergent capability in generating complex circuits, with superior performance in experiments.

Conclusion: Bridges probabilistic generative models and precise circuit generation, offering a robust solution for logic synthesis.

Abstract: Logic synthesis, a critical stage in electronic design automation (EDA), optimizes gate-level circuits to minimize power consumption and area occupancy in integrated circuits (ICs). Traditional logic synthesis tools rely on human-designed heuristics, often yielding suboptimal results. Although differentiable architecture search (DAS) has shown promise in generating circuits from truth tables, it faces challenges such as high computational complexity, convergence to local optima, and extensive hyperparameter tuning. Consequently, we propose a novel approach integrating conditional generative models with DAS for circuit generation. Our approach first introduces CircuitVQ, a circuit tokenizer trained based on our Circuit AutoEncoder We then develop CircuitAR, a masked autoregressive model leveraging CircuitVQ as the tokenizer. CircuitAR can generate preliminary circuit structures from truth tables, which guide DAS in producing functionally equivalent circuits. Notably, we observe the scalability and emergent capability in generating complex circuit structures of our CircuitAR models. Extensive experiments also show the superior performance of our method. This research bridges the gap between probabilistic generative models and precise circuit generation, offering a robust solution for logic synthesis.

[317] Deep Q-Learning with Gradient Target Tracking

Bum Geun Park, Taeho Lee, Donghwan Lee

Main category: cs.LG

TL;DR: The paper introduces gradient-based target tracking methods (AGT2-DQN and SGT2-DQN) as alternatives to hard updates in DQN, eliminating manual tuning and improving performance.

DetailsMotivation: The hard update mechanism in DQN requires careful tuning, which is inefficient. The authors propose continuous gradient-based updates to address this.

Method: Two methods are introduced: AGT2-DQN and SGT2-DQN, which use gradient descent for continuous target updates instead of hard updates.

Result: Theoretical convergence is proven, and empirical results show improved performance over standard DQN.

Conclusion: Gradient-based target updates are a viable alternative to hard updates in Q-learning, offering better stability and eliminating tuning needs.

Abstract: This paper introduces Q-learning with gradient target tracking, a novel reinforcement learning framework that provides a learned continuous target update mechanism as an alternative to the conventional hard update paradigm. In the standard deep Q-network (DQN), the target network is a copy of the online network’s weights, held fixed for a number of iterations before being periodically replaced via a hard update. While this stabilizes training by providing consistent targets, it introduces a new challenge: the hard update period must be carefully tuned to achieve optimal performance. To address this issue, we propose two gradient-based target update methods: DQN with asymmetric gradient target tracking (AGT2-DQN) and DQN with symmetric gradient target tracking (SGT2-DQN). These methods replace the conventional hard target updates with continuous and structured updates using gradient descent, which effectively eliminates the need for manual tuning. We provide a theoretical analysis proving the convergence of these methods in tabular settings. Additionally, empirical evaluations demonstrate their advantages over standard DQN baselines, which suggest that gradient-based target updates can serve as an effective alternative to conventional target update mechanisms in Q-learning.

[318] Generative Deep Learning Framework for Inverse Design of Fuels

Kiran K. Yalamanchi, Pinaki Pal, Balaji Mohan, Abdullah S. AlRamadan, Jihad A. Badra, Yuanjiang Pei

Main category: cs.LG

TL;DR: A generative deep learning framework combining Co-VAE and QSPR accelerates fuel design by optimizing molecular reconstruction and RON prediction.

DetailsMotivation: To overcome limitations of traditional fuel screening by capturing complex structure-property relationships.

Method: Uses Co-VAE with QSPR, hyperparameter tuning, regression model, and differential evolution for latent space navigation.

Result: Enables efficient identification of high-RON fuel candidates and systematic chemical space exploration.

Conclusion: The framework is adaptable for other properties and can be enhanced with synthesizability criteria for broader fuel design applications.

Abstract: In the present work, a generative deep learning framework combining a Co-optimized Variational Autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques is developed to enable accelerated inverse design of fuels. The Co-VAE integrates a property prediction component coupled with the VAE latent space, enhancing molecular reconstruction and accurate estimation of Research Octane Number (RON) (chosen as the fuel property of interest). A subset of the GDB-13 database, enriched with a curated RON database, is used for model training. Hyperparameter tuning is further utilized to optimize the balance among reconstruction fidelity, chemical validity, and RON prediction. An independent regression model is then used to refine RON prediction, while a differential evolution algorithm is employed to efficiently navigate the VAE latent space and identify promising fuel molecule candidates with high RON. This methodology addresses the limitations of traditional fuel screening approaches by capturing complex structure-property relationships within a comprehensive latent representation. The generative model can be adapted to different target properties, enabling systematic exploration of large chemical spaces relevant to fuel design applications. Furthermore, the demonstrated framework can be readily extended by incorporating additional synthesizability criteria to improve applicability and reliability for de novo design of new fuels.

[319] A Simple Baseline for Stable and Plastic Neural Networks

Étienne Künzel, Achref Jaziri, Visvanathan Ramesh

Main category: cs.LG

TL;DR: RDBP introduces ReLUDown and Decreasing Backpropagation to balance plasticity and stability in continual learning, outperforming state-of-the-art methods with lower computational cost.

DetailsMotivation: Existing continual learning approaches often favor either plasticity or stability, lacking a balanced solution.

Method: RDBP combines ReLUDown (activation modification) and Decreasing Backpropagation (gradient-scheduling) to prevent forgetting and reduce computational overhead.

Result: RDBP matches or exceeds state-of-the-art performance on the Continual ImageNet benchmark while being more efficient.

Conclusion: RDBP offers a practical, efficient solution for continual learning and sets a benchmark for future methods.

Abstract: Continual learning in computer vision requires that models adapt to a continuous stream of tasks without forgetting prior knowledge, yet existing approaches often tip the balance heavily toward either plasticity or stability. We introduce RDBP, a simple, low-overhead baseline that unites two complementary mechanisms: ReLUDown, a lightweight activation modification that preserves feature sensitivity while preventing neuron dormancy, and Decreasing Backpropagation, a biologically inspired gradient-scheduling scheme that progressively shields early layers from catastrophic updates. Evaluated on the Continual ImageNet benchmark, RDBP matches or exceeds the plasticity and stability of state-of-the-art methods while reducing computational cost. RDBP thus provides both a practical solution for real-world continual learning and a clear benchmark against which future continual learning strategies can be measured.

[320] Towards Foundation Models for Experimental Readout Systems Combining Discrete and Continuous Data

James Giroux, Cristiano Fanelli

Main category: cs.LG

TL;DR: A Foundation Model for Nuclear Physics is proposed, addressing challenges in detector inputs with innovations like separate vocabularies, continuous conditioning, scalable tokenization, and class-conditional generation.

DetailsMotivation: To overcome resolution loss and limited conditional generation in existing tokenization schemes for nuclear physics detector inputs.

Method: Uses next-token prediction with innovations: separate vocabularies, CMHCA, continuous kinematic conditioning, scalable tokenization, and Mixture of Experts for class-conditional generation.

Result: Enables high-fidelity generation of pixel/time sequences, validated in closure tests, and generalizes to tasks like particle identification and noise filtering.

Conclusion: The model effectively addresses key challenges and demonstrates versatility in nuclear physics applications.

Abstract: We present a (proto) Foundation Model for Nuclear Physics, capable of operating on low-level detector inputs from Imaging Cherenkov Detectors at the future Electron Ion Collider. Building upon established next-token prediction approaches, we aim to address potential challenges such as resolution loss from existing tokenization schemes and limited support for conditional generation. We propose four key innovations: (i) separate vocabularies for discrete and continuous variates, combined via Causal Multi-Head Cross-Attention (CMHCA), (ii) continuous kinematic conditioning through prepended context embeddings, (iii) scalable and simple, high-resolution continuous variate tokenization without joint vocabulary inflation, and (iv) class conditional generation through a Mixture of Experts. Our model enables fast, high-fidelity generation of pixel and time sequences for Cherenkov photons, validated through closure tests in the High Performance DIRC. We also show our model generalizes to reconstruction tasks such as pion/kaon identification, and noise filtering, in which we show its ability to leverage fine-tuning under specific objectives.

[321] Recalibrating binary probabilistic classifiers

Dirk Tasche

Main category: cs.LG

TL;DR: The paper proposes two new methods for recalibrating binary probabilistic classifiers to a target prior probability, focusing on AUC-linked distribution shifts. The methods, CSPD and QMM, are tested and show conservative results in credit risk evaluations.

DetailsMotivation: Recalibration of classifiers is crucial in fields like credit risk management, where accurate probability estimates are needed. The study aims to address this by analyzing distribution shifts linked to AUC.

Method: The paper introduces two methods: parametric covariate shift with posterior drift (CSPD) and ROC-based quasi moment matching (QMM). These are tested alongside other methods in an example setting.

Result: The QMM methods provide conservative results in evaluations involving concave functionals, such as risk weight functions for credit risk.

Conclusion: The proposed QMM methods are effective for recalibration in scenarios like credit risk, offering conservative and meaningful results.

Abstract: Recalibration of binary probabilistic classifiers to a target prior probability is an important task in areas like credit risk management. We analyse methods for recalibration from a distribution shift perspective. Distribution shift assumptions linked to the area under the curve (AUC) of a probabilistic classifier are found to be useful for the design of meaningful recalibration methods. Two new methods called parametric covariate shift with posterior drift (CSPD) and ROC-based quasi moment matching (QMM) are proposed and tested together with some other methods in an example setting. The outcomes of the test suggest that the QMM methods discussed in the paper can provide appropriately conservative results in evaluations with concave functionals like for instance risk weights functions for credit risk.

[322] FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration

Daehyeon Baek, Jieun Choi, Jimyoung Son, Kyungmin Bin, Seungbeom Choi, Kihyo Moon, Minsung Jang, Hyojung Lee

Main category: cs.LG

TL;DR: FireQ is a co-designed PTQ framework with an INT4-FP8 kernel, enhancing LLM inference throughput via optimized quantization and pipelining, achieving significant speedups with minimal accuracy loss.

DetailsMotivation: Memory bandwidth constraints limit LLM inference throughput, motivating the need for efficient post-training quantization (PTQ).

Method: FireQ quantizes weights/key-values to INT4 and activations/queries to FP8, introduces three-stage pipelining for prefill, and uses novel outlier smoothing techniques for linear and attention layers.

Result: FireQ achieves 1.68x faster inference in feed-forward layers (Llama2-7B) and 1.26x faster prefill (Llama3-8B) vs. QServe, with negligible accuracy loss.

Conclusion: FireQ effectively addresses LLM inference bottlenecks through co-designed quantization and optimization, outperforming state-of-the-art methods.

Abstract: As large language models become increasingly prevalent, memory bandwidth constraints significantly limit inference throughput, motivating post-training quantization (PTQ). In this paper, we propose FireQ, a co-designed PTQ framework and an INT4-FP8 matrix multiplication kernel that accelerates LLM inference across all linear layers. Specifically, FireQ quantizes linear layer weights and key-values to INT4, and activations and queries to FP8, significantly enhancing throughput. Additionally, we introduce a three-stage pipelining for the prefill phase, which modifies the FlashAttention-3 kernel, effectively reducing time-to-first-token in the prefill phase. To minimize accuracy loss from quantization, we develop novel outlier smoothing techniques tailored separately for linear and attention layers. In linear layers, we explicitly use per-tensor scaling to prevent underflow caused by the FP8 quantization scaling factor of INT4 quantization, and channel-wise scaling to compensate for coarse granularity of INT4. In attention layers, we address quantization challenges posed by rotary positional embeddings (RoPE) by combining pre-RoPE and post-RoPE scaling strategies. FireQ significantly outperforms state-of-the-art methods, achieving 1.68x faster inference in feed-forward network layers on Llama2-7B and 1.26x faster prefill phase performance on Llama3-8B compared to QServe, with negligible accuracy loss.

[323] DiffGradCAM: A Universal Class Activation Map Resistant to Adversarial Training

Jacob Piland, Chris Sweet, Adam Czajka

Main category: cs.LG

TL;DR: The paper introduces SHAMs to expose vulnerabilities in CAM methods and proposes DiffGradCAM, a robust alternative resistant to adversarial manipulation.

DetailsMotivation: CAM and GradCAM focus on individual logits, ignoring logit differences critical for softmax-based predictions, making them prone to adversarial attacks like passive fooling.

Method: SHAMs are introduced as an entropy-aware benchmark for CAM robustness. DiffGradCAM is proposed as a contrastive, lightweight solution immune to passive fooling.

Result: SHAMs expose CAM vulnerabilities, while DiffGradCAM matches standard CAM performance in non-adversarial cases and resists passive fooling.

Conclusion: SHAM and DiffGradCAM form a framework for robust saliency-based explanations, validated across tasks with varying class numbers.

Abstract: Class Activation Mapping (CAM) and its gradient-based variants (e.g., GradCAM) have become standard tools for explaining Convolutional Neural Network (CNN) predictions. However, these approaches typically focus on individual logits, while for neural networks using softmax, the class membership probability estimates depend \textit{only} on the \textit{differences} between logits, not on their absolute values. This disconnect leaves standard CAMs vulnerable to adversarial manipulation, such as passive fooling, where a model is trained to produce misleading CAMs without affecting decision performance. We introduce \textbf{Salience-Hoax Activation Maps (SHAMs)}, an \emph{entropy-aware form of passive fooling} that serves as a benchmark for CAM robustness under adversarial conditions. To address the passive fooling vulnerability, we then propose \textbf{DiffGradCAM}, a novel, lightweight, and contrastive approach to class activation mapping that is both non-suceptible to passive fooling, but also matches the output of standard CAM methods such as GradCAM in the non-adversarial case. Together, SHAM and DiffGradCAM establish a new framework for probing and improving the robustness of saliency-based explanations. We validate both contributions across multi-class tasks with few and many classes.

Kondrup Emma

Main category: cs.LG

TL;DR: The paper introduces Base3, a lightweight model combining EdgeBank, PopTrack, and t-CoMem for dynamic link prediction, achieving competitive performance without training.

DetailsMotivation: Addressing the challenge of dynamic link prediction with models that are effective, practical, and interpretable, avoiding complex neural architectures.

Method: Proposes t-CoMem for tracking temporal co-occurrence and neighborhood activity, and Base3, an interpolation-based model fusing EdgeBank, PopTrack, and t-CoMem.

Result: Base3 performs competitively with state-of-the-art deep models on the Temporal Graph Benchmark, excelling in realistic negative sampling scenarios.

Conclusion: Base3 offers a simple, robust alternative for temporal graph learning, bridging local and global dynamics without training.

Abstract: Dynamic link prediction remains a central challenge in temporal graph learning, particularly in designing models that are both effective and practical for real-world deployment. Existing approaches often rely on complex neural architectures, which are computationally intensive and difficult to interpret. In this work, we build on the strong recurrence-based foundation of the EdgeBank baseline, by supplementing it with inductive capabilities. We do so by leveraging the predictive power of non-learnable signals from two complementary perspectives: historical edge recurrence, as captured by EdgeBank, and global node popularity, as introduced in the PopTrack model. We propose t-CoMem, a lightweight memory module that tracks temporal co-occurrence patterns and neighborhood activity. Building on this, we introduce Base3, an interpolation-based model that fuses EdgeBank, PopTrack, and t-CoMem into a unified scoring framework. This combination effectively bridges local and global temporal dynamics – repetition, popularity, and context – without relying on training. Evaluated on the Temporal Graph Benchmark, Base3 achieves performance competitive with state-of-the-art deep models, even outperforming them on some datasets. Importantly, it considerably improves on existing baselines’ performance under more realistic and challenging negative sampling strategies – offering a simple yet robust alternative for temporal graph learning.

[325] Honesty in Causal Forests: When It Helps and When It Hurts

Yanfang Hou, Carlos Fernández-Loría

Main category: cs.LG

TL;DR: Honest estimation in causal forests, intended to reduce overfitting, may reduce accuracy in individual-level treatment effect estimates, especially with rich data and varied responses. It involves a bias-variance trade-off, and default use can require up to 75% more data for comparable performance. Honesty should be treated as regularization, guided by out-of-sample performance.

DetailsMotivation: To evaluate whether honest estimation in causal forests, a default practice to avoid overfitting, is always optimal, especially when data is rich and treatment effects vary significantly.

Method: Analyzed 7,500 benchmark datasets to compare the performance of causal forests with and without honest estimation, focusing on individual-level treatment effect accuracy.

Result: Honest estimation can reduce accuracy, requiring up to 75% more data for comparable performance, due to its bias-variance trade-off.

Conclusion: Honesty should be used as a form of regularization, guided by out-of-sample performance, rather than adopted by default.

Abstract: Causal forests estimate how treatment effects vary across individuals, guiding personalized interventions in areas like marketing, operations, and public policy. A standard modeling practice with this method is honest estimation: dividing the data so that the subgroups used to model treatment effect variation are formed separately from the data used to estimate those effects. This is intended to reduce overfitting and is the default in many software packages. But is it always the right choice? In this paper, we show that honest estimation can reduce the accuracy of individual-level treatment effect estimates, especially when there are substantial differences in how individuals respond to treatment, and the data is rich enough to uncover those differences. The core issue is a classic bias-variance trade-off: honesty lowers the risk of overfitting but increases the risk of underfitting, because it limits the data available to detect patterns. Across 7,500 benchmark datasets, we find that the cost of using honesty by default can be as high as requiring 75% more data to match the performance of models trained without it. We argue that honesty is best understood as a form of regularization, and like any regularization choice, its use should be guided by out-of-sample performance, not adopted reflexively.

[326] Merge Kernel for Bayesian Optimization on Permutation Space

Zikai Xie, Linjiang Chen

Main category: cs.LG

TL;DR: A novel framework for generating kernel functions on permutation space using sorting algorithms is proposed, introducing the Merge Kernel with linearithmic complexity, outperforming the Mallows kernel.

DetailsMotivation: To address the quadratic complexity of the Mallows kernel in Bayesian Optimization for permutation spaces by leveraging sorting algorithms for more efficient and compact representations.

Method: Proposes a framework for kernel generation based on sorting algorithms, introducing the Merge Kernel (from merge sort) with Θ(n log n) complexity. Incorporates lightweight descriptors for robustness: shift histogram, split-pair line, and sliding-window motifs.

Result: The Merge Kernel consistently outperforms the Mallows kernel in permutation optimization benchmarks, offering a more compact and effective solution.

Conclusion: The Merge Kernel provides a superior alternative to the Mallows kernel for Bayesian Optimization in permutation spaces, combining efficiency, compactness, and effectiveness.

Abstract: Bayesian Optimization (BO) algorithm is a standard tool for black-box optimization problems. The current state-of-the-art BO approach for permutation spaces relies on the Mallows kernel-an $\Omega(n^2)$ representation that explicitly enumerates every pairwise comparison. Inspired by the close relationship between the Mallows kernel and pairwise comparison, we propose a novel framework for generating kernel functions on permutation space based on sorting algorithms. Within this framework, the Mallows kernel can be viewed as a special instance derived from bubble sort. Further, we introduce the \textbf{Merge Kernel} constructed from merge sort, which replaces the quadratic complexity with $\Theta(n\log n)$ to achieve the lowest possible complexity. The resulting feature vector is significantly shorter, can be computed in linearithmic time, yet still efficiently captures meaningful permutation distances. To boost robustness and right-invariance without sacrificing compactness, we further incorporate three lightweight, task-agnostic descriptors: (1) a shift histogram, which aggregates absolute element displacements and supplies a global misplacement signal; (2) a split-pair line, which encodes selected long-range comparisons by aligning elements across the two halves of the whole permutation; and (3) sliding-window motifs, which summarize local order patterns that influence near-neighbor objectives. Our empirical evaluation demonstrates that the proposed kernel consistently outperforms the state-of-the-art Mallows kernel across various permutation optimization benchmarks. Results confirm that the Merge Kernel provides a more compact yet more effective solution for Bayesian optimization in permutation space.

[327] KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction

Han Liu, Keyan Ding, Peilin Chen, Yinwei Wei, Liqiang Nie, Dapeng Wu, Shiqi Wang

Main category: cs.LG

TL;DR: KEPLA is a deep learning framework integrating Gene Ontology and ligand properties to improve protein-ligand binding affinity prediction, outperforming existing methods.

DetailsMotivation: Existing deep learning approaches overlook biochemical knowledge, limiting prediction accuracy for protein-ligand binding affinity.

Method: KEPLA integrates prior knowledge from Gene Ontology and ligand properties, optimizing global and local representations for prediction.

Result: KEPLA outperforms state-of-the-art baselines on benchmark datasets in both in-domain and cross-domain scenarios.

Conclusion: KEPLA enhances prediction performance and provides interpretable insights into binding mechanisms.

Abstract: Accurate prediction of protein-ligand binding affinity is critical for drug discovery. While recent deep learning approaches have demonstrated promising results, they often rely solely on structural features of proteins and ligands, overlooking their valuable biochemical knowledge associated with binding affinity. To address this limitation, we propose KEPLA, a novel deep learning framework that explicitly integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance. KEPLA takes protein sequences and ligand molecular graphs as input and optimizes two complementary objectives: (1) aligning global representations with knowledge graph relations to capture domain-specific biochemical insights, and (2) leveraging cross attention between local representations to construct fine-grained joint embeddings for prediction. Experiments on two benchmark datasets across both in-domain and cross-domain scenarios demonstrate that KEPLA consistently outperforms state-of-the-art baselines. Furthermore, interpretability analyses based on knowledge graph relations and cross attention maps provide valuable insights into the underlying predictive mechanisms.

[328] Convolution-weighting method for the physics-informed neural network: A Primal-Dual Optimization Perspective

Chenhao Si, Ming Yan

Main category: cs.LG

TL;DR: A new adaptive weighting scheme for PINNs improves accuracy by adjusting loss function weights from isolated points to continuous regions, reducing relative L2 errors.

DetailsMotivation: PINNs face challenges in convergence and accuracy due to optimization with finite points.

Method: Proposed an adaptive weighting scheme for loss functions, transitioning from isolated points to continuous neighborhoods.

Result: Empirical results show reduced relative L2 errors.

Conclusion: The adaptive weighting scheme enhances PINN performance in solving PDEs.

Abstract: Physics-informed neural networks (PINNs) are extensively employed to solve partial differential equations (PDEs) by ensuring that the outputs and gradients of deep learning models adhere to the governing equations. However, constrained by computational limitations, PINNs are typically optimized using a finite set of points, which poses significant challenges in guaranteeing their convergence and accuracy. In this study, we proposed a new weighting scheme that will adaptively change the weights to the loss functions from isolated points to their continuous neighborhood regions. The empirical results show that our weighting scheme can reduce the relative $L^2$ errors to a lower value.

[329] Mitigating Goal Misgeneralization via Minimax Regret

Karim Abdel Sadek, Matthew Farrugia-Roberts, Usman Anwar, Hannah Erlebach, Christian Schroeder de Witt, David Krueger, Michael Dennis

Main category: cs.LG

TL;DR: The paper formalizes goal misgeneralization in reinforcement learning, showing it occurs under MEV objectives but not MMER. Empirical results in grid-worlds confirm MEV-based methods like domain randomization suffer from misgeneralization, while regret-based UED methods are more robust.

DetailsMotivation: To address the risk of policies pursuing proxy goals instead of intended goals in novel environments, formalizing and studying goal misgeneralization.

Method: Theoretical analysis of goal misgeneralization under MEV and MMER objectives, followed by empirical testing in procedurally-generated grid-worlds using domain randomization (MEV) and UED (MMER).

Result: Goal misgeneralization occurs under MEV but not MMER. UED methods show robustness to misgeneralization, though they don’t always find MMER policies.

Conclusion: Minimax expected regret (MMER) is a promising approach to mitigate goal misgeneralization in reinforcement learning.

Abstract: Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer’s intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates the risk that policies will behave as if in pursuit of the proxy goal, rather than the intended goal, in deployment – a phenomenon known as goal misgeneralization. In this paper, we formalize this problem setting in order to theoretically study the possibility of goal misgeneralization under different training objectives. We show that goal misgeneralization is possible under approximate optimization of the maximum expected value (MEV) objective, but not the minimax expected regret (MMER) objective. We then empirically show that the standard MEV-based training method of domain randomization exhibits goal misgeneralization in procedurally-generated grid-world environments, whereas current regret-based unsupervised environment design (UED) methods are more robust to goal misgeneralization (though they don’t find MMER policies in all cases). Our findings suggest that minimax expected regret is a promising approach to mitigating goal misgeneralization.

[330] FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning

Huan Wang, Haoran Li, Huaming Chen, Jun Yan, Jiahua Shi, Jun Shen

Main category: cs.LG

TL;DR: FedDifRC introduces diffusion models into federated learning to address data heterogeneity by leveraging text-driven contrast and noise-driven regularization.

DetailsMotivation: Data heterogeneity in federated learning harms model convergence and performance; diffusion models offer a solution.

Method: Proposes FedDifRC, using diffusion representations for text-driven contrastive learning and noise-driven consistency regularization.

Result: Validated effectiveness in experiments, with theoretical convergence guarantees.

Conclusion: FedDifRC successfully mitigates data heterogeneity and improves federated learning performance.

Abstract: Federated learning aims at training models collaboratively across participants while protecting privacy. However, one major challenge for this paradigm is the data heterogeneity issue, where biased data preferences across multiple clients, harming the model’s convergence and performance. In this paper, we first introduce powerful diffusion models into the federated learning paradigm and show that diffusion representations are effective steers during federated training. To explore the possibility of using diffusion representations in handling data heterogeneity, we propose a novel diffusion-inspired Federated paradigm with Diffusion Representation Collaboration, termed FedDifRC, leveraging meaningful guidance of diffusion models to mitigate data heterogeneity. The key idea is to construct text-driven diffusion contrasting and noise-driven diffusion regularization, aiming to provide abundant class-related semantic information and consistent convergence signals. On the one hand, we exploit the conditional feedback from the diffusion model for different text prompts to build a text-driven contrastive learning strategy. On the other hand, we introduce a noise-driven consistency regularization to align local instances with diffusion denoising representations, constraining the optimization region in the feature space. In addition, FedDifRC can be extended to a self-supervised scheme without relying on any labeled data. We also provide a theoretical analysis for FedDifRC to ensure convergence under non-convex objectives. The experiments on different scenarios validate the effectiveness of FedDifRC and the efficiency of crucial components.

[331] Generalization in Reinforcement Learning for Radio Access Networks

Burak Demirel, Yu Wang, Cristian Tatino, Pablo Soldati

Main category: cs.LG

TL;DR: A generalization-centered RL framework for RAN control improves throughput and spectral efficiency in diverse 5G scenarios, outperforming traditional methods.

DetailsMotivation: Traditional rule-based RRM algorithms underperform in dynamic RAN environments, and existing RL solutions struggle with generalization across diverse deployments.

Method: The framework includes robust state reconstruction, domain randomization, and distributed data generation with centralized training, aligned with O-RAN principles.

Result: Achieves ~10% higher throughput and >20% gains under high mobility, with significant improvements in eMBB and mixed-traffic benchmarks.

Conclusion: The scalable and generalizable RL framework paves the way for AI-native 6G RAN.

Abstract: Modern RAN operate in highly dynamic and heterogeneous environments, where hand-tuned, rule-based RRM algorithms often underperform. While RL can surpass such heuristics in constrained settings, the diversity of deployments and unpredictable radio conditions introduce major generalization challenges. Data-driven policies frequently overfit to training conditions, degrading performance in unseen scenarios. To address this, we propose a generalization-centered RL framework for RAN control that: (i) robustly reconstructs dynamically varying states from partial and noisy observations, while encoding static and semi-static information, such as radio nodes, cell attributes, and their topology, through graph representations; (ii) applies domain randomization to broaden the training distribution; and (iii) distributes data generation across multiple actors while centralizing training in a cloud-compatible architecture aligned with O-RAN principles. Although generalization increases computational and data-management complexity, our distributed design mitigates this by scaling data collection and training across diverse network conditions. Applied to downlink link adaptation in five 5G benchmarks, our policy improves average throughput and spectral efficiency by ~10% over an OLLA baseline (10% BLER target) in full-buffer MIMO/mMIMO and by >20% under high mobility. It matches specialized RL in full-buffer traffic and achieves up to 4- and 2-fold gains in eMBB and mixed-traffic benchmarks, respectively. In nine-cell deployments, GAT models offer 30% higher throughput over MLP baselines. These results, combined with our scalable architecture, offer a path toward AI-native 6G RAN using a single, generalizable RL agent.

[332] Explainable AI in Genomics: Transcription Factor Binding Site Prediction with Mixture of Experts

Aakash Tripathi, Ian E. Nielsen, Muhammad Umer, Ravi P. Ramachandran, Ghulam Rasool

Main category: cs.LG

TL;DR: A novel Mixture of Experts (MoE) approach for TFBS prediction outperforms individual models, especially in out-of-distribution scenarios, and introduces ShiftSmooth for better interpretability.

DetailsMotivation: Understanding gene regulation and biological processes requires accurate TFBS prediction, necessitating improved models and interpretability.

Method: The study integrates multiple pre-trained CNN models into an MoE framework and evaluates performance on in-distribution and OOD datasets. It also introduces ShiftSmooth for robust attribution mapping.

Result: The MoE model achieves competitive or superior performance, excelling in OOD scenarios, with ANOVA confirming significance. ShiftSmooth outperforms traditional methods in interpretability.

Conclusion: The work provides an efficient, generalizable, and interpretable solution for TFBS prediction, advancing genome biology and transcriptional regulation understanding.

Abstract: Transcription Factor Binding Site (TFBS) prediction is crucial for understanding gene regulation and various biological processes. This study introduces a novel Mixture of Experts (MoE) approach for TFBS prediction, integrating multiple pre-trained Convolutional Neural Network (CNN) models, each specializing in different TFBS patterns. We evaluate the performance of our MoE model against individual expert models on both in-distribution and out-of-distribution (OOD) datasets, using six randomly selected transcription factors (TFs) for OOD testing. Our results demonstrate that the MoE model achieves competitive or superior performance across diverse TF binding sites, particularly excelling in OOD scenarios. The Analysis of Variance (ANOVA) statistical test confirms the significance of these performance differences. Additionally, we introduce ShiftSmooth, a novel attribution mapping technique that provides more robust model interpretability by considering small shifts in input sequences. Through comprehensive explainability analysis, we show that ShiftSmooth offers superior attribution for motif discovery and localization compared to traditional Vanilla Gradient methods. Our work presents an efficient, generalizable, and interpretable solution for TFBS prediction, potentially enabling new discoveries in genome biology and advancing our understanding of transcriptional regulation.

[333] Rethinking Inductive Bias in Geographically Neural Network Weighted Regression

Zhenyuan Chen

Main category: cs.LG

TL;DR: The paper revisits inductive biases in GNNWR, proposing enhancements using neural network concepts to improve spatial regression. Benchmarks show GNNWR outperforms traditional methods, with performance varying by data characteristics.

DetailsMotivation: Current GNNWR implementations have limitations in modeling spatial non-stationarity due to fixed distance-based schemes and weak inductive biases.

Method: The authors generalize GNNWR by integrating CNN, RNN, and transformer concepts, introducing local receptive fields, sequential context, and self-attention.

Result: GNNWR outperforms classic methods in capturing complex spatial relationships, with performance dependent on data heterogeneity and sample size.

Conclusion: Inductive bias is crucial for spatial modeling; future work should focus on learnable weighting functions, hybrid architectures, and improved interpretability.

Abstract: Inductive bias is a key factor in spatial regression models, determining how well a model can learn from limited data and capture spatial patterns. This work revisits the inductive biases in Geographically Neural Network Weighted Regression (GNNWR) and identifies limitations in current approaches for modeling spatial non-stationarity. While GNNWR extends traditional Geographically Weighted Regression by using neural networks to learn spatial weighting functions, existing implementations are often restricted by fixed distance-based schemes and limited inductive bias. We propose to generalize GNNWR by incorporating concepts from convolutional neural networks, recurrent neural networks, and transformers, introducing local receptive fields, sequential context, and self-attention into spatial regression. Through extensive benchmarking on synthetic spatial datasets with varying heterogeneity, noise, and sample sizes, we show that GNNWR outperforms classic methods in capturing nonlinear and complex spatial relationships. Our results also reveal that model performance depends strongly on data characteristics, with local models excelling in highly heterogeneous or small-sample scenarios, and global models performing better with larger, more homogeneous data. These findings highlight the importance of inductive bias in spatial modeling and suggest future directions, including learnable spatial weighting functions, hybrid neural architectures, and improved interpretability for models handling non-stationary spatial data.

[334] ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs

Daniel Commey, Benjamin Appiah, Griffith S. Klogo, Garth V. Crosby

Main category: cs.LG

TL;DR: A novel protocol using Zero-Knowledge Proofs (ZKPs) ensures privacy-preserving and verifiable evaluation in Federated Learning (FL), avoiding raw data exposure.

DetailsMotivation: Current FL evaluation phases may leak sensitive data through shared metrics, necessitating a privacy-preserving solution.

Method: Clients generate ZKP proofs asserting local loss is below a threshold, implemented via self-contained modules for FL simulation, ZKP circuit design, and evaluation on MNIST and HAR datasets.

Result: The approach is evaluated for computational overhead, communication cost, and verifiability, focusing on CNN (MNIST) and MLP (HAR) models.

Conclusion: The protocol successfully enables private and verifiable FL evaluation without external APIs, addressing privacy concerns in FL.

Abstract: Federated Learning (FL) enables collaborative model training on decentralized data without exposing raw data. However, the evaluation phase in FL may leak sensitive information through shared performance metrics. In this paper, we propose a novel protocol that incorporates Zero-Knowledge Proofs (ZKPs) to enable privacy-preserving and verifiable evaluation for FL. Instead of revealing raw loss values, clients generate a succinct proof asserting that their local loss is below a predefined threshold. Our approach is implemented without reliance on external APIs, using self-contained modules for federated learning simulation, ZKP circuit design, and experimental evaluation on both the MNIST and Human Activity Recognition (HAR) datasets. We focus on a threshold-based proof for a simple Convolutional Neural Network (CNN) model (for MNIST) and a multi-layer perceptron (MLP) model (for HAR), and evaluate the approach in terms of computational overhead, communication cost, and verifiability.

[335] Accelerating RF Power Amplifier Design via Intelligent Sampling and ML-Based Parameter Tuning

Abhishek Sriram, Neal Tuffy

Main category: cs.LG

TL;DR: A machine learning framework reduces RF power amplifier simulation needs by 65% while maintaining accuracy, using MaxMin Latin Hypercube Sampling and CatBoost.

DetailsMotivation: To minimize simulation time and resources in RF power amplifier design without sacrificing accuracy.

Method: Combines MaxMin Latin Hypercube Sampling with CatBoost to strategically select 35% of critical simulation points, then predicts performance across the design space.

Result: Achieves 65% simulation reduction with ±0.4 dBm accuracy, 0.901 average R², and 58.24%-77.78% time savings.

Conclusion: The framework enables rapid, accurate RF power amplifier design with significant efficiency gains.

Abstract: This paper presents a machine learning-accelerated optimization framework for RF power amplifier design that reduces simulation requirements by 65% while maintaining $\pm0.4$ dBm accuracy for the majority of the modes. The proposed method combines MaxMin Latin Hypercube Sampling with CatBoost gradient boosting to intelligently explore multidimensional parameter spaces. Instead of exhaustively simulating all parameter combinations to achieve target P2dB compression specifications, our approach strategically selects approximately 35% of critical simulation points. The framework processes ADS netlists, executes harmonic balance simulations on the reduced dataset, and trains a CatBoost model to predict P2dB performance across the entire design space. Validation across 15 PA operating modes yields an average $R^2$ of 0.901, with the system ranking parameter combinations by their likelihood of meeting target specifications. The integrated solution delivers 58.24% to 77.78% reduction in simulation time through automated GUI-based workflows, enabling rapid design iterations without compromising accuracy standards required for production RF circuits.

[336] FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale

Boris Bonev, Thorsten Kurth, Ankur Mahesh, Mauro Bisson, Jean Kossaifi, Karthik Kashinath, Anima Anandkumar, William D. Collins, Michael S. Pritchard, Alexander Keller

Main category: cs.LG

TL;DR: FourCastNet 3 introduces a scalable ML approach for probabilistic weather forecasting, outperforming conventional models in accuracy and speed while maintaining realistic dynamics and spectra.

DetailsMotivation: To improve global weather modeling by addressing the need for scalable, geometrically accurate, and probabilistically calibrated ML methods.

Method: Uses a convolutional neural network tailored for spherical geometry, with a novel training paradigm for large-scale parallel training.

Result: Achieves faster forecasts (8-60x speedup), better accuracy, and retains realistic spectra even at 60-day lead times.

Conclusion: FourCastNet 3 is a promising tool for meteorological forecasting due to its efficiency, accuracy, and stability.

Abstract: FourCastNet 3 advances global weather modeling by implementing a scalable, geometric machine learning (ML) approach to probabilistic ensemble forecasting. The approach is designed to respect spherical geometry and to accurately model the spatially correlated probabilistic nature of the problem, resulting in stable spectra and realistic dynamics across multiple scales. FourCastNet 3 delivers forecasting accuracy that surpasses leading conventional ensemble models and rivals the best diffusion-based methods, while producing forecasts 8 to 60 times faster than these approaches. In contrast to other ML approaches, FourCastNet 3 demonstrates excellent probabilistic calibration and retains realistic spectra, even at extended lead times of up to 60 days. All of these advances are realized using a purely convolutional neural network architecture tailored for spherical geometry. Scalable and efficient large-scale training on 1024 GPUs and more is enabled by a novel training paradigm for combined model- and data-parallelism, inspired by domain decomposition methods in classical numerical models. Additionally, FourCastNet 3 enables rapid inference on a single GPU, producing a 60-day global forecast at 0.25{\deg}, 6-hourly resolution in under 4 minutes. Its computational efficiency, medium-range probabilistic skill, spectral fidelity, and rollout stability at subseasonal timescales make it a strong candidate for improving meteorological forecasting and early warning systems through large ensemble predictions.

[337] Learning to Reject Low-Quality Explanations via User Feedback

Luca Stradiotti, Dario Pesenti, Stefano Teso, Jesse Davis

Main category: cs.LG

TL;DR: The paper proposes a framework (LtX) for classifiers to reject inputs with low-quality explanations, introducing ULER, a rejector that mirrors human judgments of explanation quality.

DetailsMotivation: High-stakes applications like credit scoring require trustworthy ML predictions, but poor explanations can hinder user trust and decision-making.

Method: Introduces ULER, a rejector trained on human ratings and per-feature relevance judgments to assess explanation quality.

Result: ULER outperforms state-of-the-art and explanation-aware rejection strategies on eight benchmarks and a human-annotated dataset.

Conclusion: The framework enables classifiers to reject low-quality explanations, improving trust and decision-making in high-stakes applications.

Abstract: Machine Learning predictors are increasingly being employed in high-stakes applications such as credit scoring. Explanations help users unpack the reasons behind their predictions, but are not always “high quality’’. That is, end-users may have difficulty interpreting or believing them, which can complicate trust assessment and downstream decision-making. We argue that classifiers should have the option to refuse handling inputs whose predictions cannot be explained properly and introduce a framework for learning to reject low-quality explanations (LtX) in which predictors are equipped with a rejector that evaluates the quality of explanations. In this problem setting, the key challenges are how to properly define and assess explanation quality and how to design a suitable rejector. Focusing on popular attribution techniques, we introduce ULER (User-centric Low-quality Explanation Rejector), which learns a simple rejector from human ratings and per-feature relevance judgments to mirror human judgments of explanation quality. Our experiments show that ULER outperforms both state-of-the-art and explanation-aware learning to reject strategies at LtX on eight classification and regression benchmarks and on a new human-annotated dataset, which we will publicly release to support future research.

[338] Improving DAPO from a Mixed-Policy Perspective

Hongze Tan

Main category: cs.LG

TL;DR: Two modifications to DAPO improve stability and sample efficiency: using a pre-trained guiding policy for regularization and reusing zero-reward samples.

DetailsMotivation: Standard policy gradient methods are unstable and sample-inefficient, especially in sparse reward settings.

Method: 1. Incorporate a pre-trained guiding policy for off-policy experience. 2. Reuse zero-reward samples guided by the expert policy.

Result: Improved training stability, convergence speed, and sample efficiency with theoretical guarantees.

Conclusion: The mixed-policy framework balances exploration and exploitation, enabling more stable and efficient policy optimization.

Abstract: This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy ($\piphi$) to provide off-policy experience, thereby regularizing the training of the target policy ($\pion$). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO’s. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both methods, demonstrating that their objective functions converge to the optimal solution within the established theoretical framework of reinforcement learning. The proposed mixed-policy framework effectively balances exploration and exploitation, promising more stable and efficient policy optimization.

[339] Insights into a radiology-specialised multimodal large language model with sparse autoencoders

Kenza Bouzid, Shruthi Bannur, Felix Meissen, Daniel Coelho de Castro, Anton Schwaighofer, Javier Alvarez-Valle, Stephanie L. Hyland

Main category: cs.LG

TL;DR: The study uses Matryoshka-SAE to interpret MAIRA-2, a radiology-specialized multimodal LLM, identifying clinically relevant features and demonstrating steering capabilities, though with mixed success.

DetailsMotivation: Improving AI interpretability in healthcare to enhance safety, transparency, and trust, given the high stakes of medical decisions.

Method: Applied Matryoshka-SAE to MAIRA-2 for automated interpretability, identifying clinical concepts and testing feature influence via steering.

Result: Identified medical devices, pathologies, and textual features; steering showed directional control but faced challenges.

Conclusion: The study advances mechanistic interpretability for MAIRA-2, highlighting practical challenges and paving the way for improved model transparency.

Abstract: Interpretability can improve the safety, transparency and trust of AI models, which is especially important in healthcare applications where decisions often carry significant consequences. Mechanistic interpretability, particularly through the use of sparse autoencoders (SAEs), offers a promising approach for uncovering human-interpretable features within large transformer-based models. In this study, we apply Matryoshka-SAE to the radiology-specialised multimodal large language model, MAIRA-2, to interpret its internal representations. Using large-scale automated interpretability of the SAE features, we identify a range of clinically relevant concepts - including medical devices (e.g., line and tube placements, pacemaker presence), pathologies such as pleural effusion and cardiomegaly, longitudinal changes and textual features. We further examine the influence of these features on model behaviour through steering, demonstrating directional control over generations with mixed success. Our results reveal practical and methodological challenges, yet they offer initial insights into the internal concepts learned by MAIRA-2 - marking a step toward deeper mechanistic understanding and interpretability of a radiology-adapted multimodal large language model, and paving the way for improved model transparency. We release the trained SAEs and interpretations: https://huggingface.co/microsoft/maira-2-sae.

[340] MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling

Etienne Le Naour, Tahar Nabil, Ghislain Agoua

Main category: cs.LG

TL;DR: The paper introduces MoTM, a method for out-of-domain time series imputation using implicit neural representations (INRs) and a ridge regressor, addressing distribution shifts and missing data scenarios.

DetailsMotivation: The motivation is to address the underexplored task of out-of-domain imputation of missing values in time series, leveraging INRs to handle diverse missing data scenarios and sampling rates.

Method: The proposed method, MoTM, combines a basis of INRs (each trained on distinct time series patterns) with a ridge regressor to adapt to observed context at inference, enabling robust generalization.

Result: MoTM demonstrates robust in-domain and out-of-domain generalization across diverse imputation scenarios, including block and pointwise missingness and variable sampling rates.

Conclusion: The work paves the way for adaptable foundation models for time series imputation by addressing distribution shifts and diverse missing data challenges.

Abstract: Recent years have witnessed a growing interest for time series foundation models, with a strong emphasis on the forecasting task. Yet, the crucial task of out-of-domain imputation of missing values remains largely underexplored. We propose a first step to fill this gap by leveraging implicit neural representations (INRs). INRs model time series as continuous functions and naturally handle various missing data scenarios and sampling rates. While they have shown strong performance within specific distributions, they struggle under distribution shifts. To address this, we introduce MoTM (Mixture of Timeflow Models), a step toward a foundation model for time series imputation. Building on the idea that a new time series is a mixture of previously seen patterns, MoTM combines a basis of INRs, each trained independently on a distinct family of time series, with a ridge regressor that adapts to the observed context at inference. We demonstrate robust in-domain and out-of-domain generalization across diverse imputation scenarios (e.g., block and pointwise missingness, variable sampling rates), paving the way for adaptable foundation imputation models.

cs.MA

[341] CodeEdu: A Multi-Agent Collaborative Platform for Personalized Coding Education

Jianing Zhao, Peng Gao, Jiannong Cao, Zhiyuan Wen, Chen Chen, Jianing Yin, Ruosong Yang, Bo Yuan

Main category: cs.MA

TL;DR: CodeEdu is a multi-agent LLM platform for personalized coding education, improving student performance through dynamic task allocation and specialized agent functions.

DetailsMotivation: Existing LLM-based approaches lack personalized learning, ability assessment, and interactive tutoring, limiting their effectiveness in coding education.

Method: CodeEdu uses multi-agent LLMs with specialized roles (e.g., task planning, tutoring, debugging) and external tools for dynamic, personalized education.

Result: Automated evaluations show CodeEdu significantly boosts students’ coding performance.

Conclusion: CodeEdu demonstrates the potential of multi-agent LLMs in enhancing coding education through personalized, interactive learning.

Abstract: Large Language Models (LLMs) have demonstrated considerable potential in improving coding education by providing support for code writing, explanation, and debugging. However, existing LLM-based approaches generally fail to assess students’ abilities, design learning plans, provide personalized material aligned with individual learning goals, and enable interactive learning. Current work mostly uses single LLM agents, which limits their ability to understand complex code repositories and schedule step-by-step tutoring. Recent research has shown that multi-agent LLMs can collaborate to solve complicated problems in various domains like software engineering, but their potential in the field of education remains unexplored. In this work, we introduce CodeEdu, an innovative multi-agent collaborative platform that combines LLMs with tool use to provide proactive and personalized education in coding. Unlike static pipelines, CodeEdu dynamically allocates agents and tasks to meet student needs. Various agents in CodeEdu undertake certain functions specifically, including task planning, personalized material generation, real-time QA, step-by-step tutoring, code execution, debugging, and learning report generation, facilitated with extensive external tools to improve task efficiency. Automated evaluations reveal that CodeEdu substantially enhances students’ coding performance.

cs.MM

[342] SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection

Peican Zhu, Yubo Jing, Le Cheng, Bin Chen, Xiaodong Cui, Lianwei Wu, Keke Tang

Main category: cs.MM

TL;DR: The paper introduces SEER, a network for multimodal fake news detection, leveraging semantic enhancement from large models and emotional reasoning to improve accuracy.

DetailsMotivation: Existing methods neglect semantic enhancement from large multimodal models and emotional features, despite fake news often exhibiting negative emotions.

Method: SEER uses summarized captions for image understanding, large model outputs for semantic enhancement, and an emotional reasoning module to infer news authenticity.

Result: SEER outperforms state-of-the-art baselines on two real-world datasets.

Conclusion: The SEER network effectively combines semantic enhancement and emotional reasoning for superior fake news detection.

Abstract: Previous studies on multimodal fake news detection mainly focus on the alignment and integration of cross-modal features, as well as the application of text-image consistency. However, they overlook the semantic enhancement effects of large multimodal models and pay little attention to the emotional features of news. In addition, people find that fake news is more inclined to contain negative emotions than real ones. Therefore, we propose a novel Semantic Enhancement and Emotional Reasoning (SEER) Network for multimodal fake news detection. We generate summarized captions for image semantic understanding and utilize the products of large multimodal models for semantic enhancement. Inspired by the perceived relationship between news authenticity and emotional tendencies, we propose an expert emotional reasoning module that simulates real-life scenarios to optimize emotional features and infer the authenticity of news. Extensive experiments on two real-world datasets demonstrate the superiority of our SEER over state-of-the-art baselines.

[343] Align Your Rhythm: Generating Highly Aligned Dance Poses with Gating-Enhanced Rhythm-Aware Feature Representation

Congyi Fan, Jian Guan, Xuanjia Zhao, Dongli Xu, Youtian Lin, Tong Ye, Pengming Feng, Haiwei Pan

Main category: cs.MM

TL;DR: Danceba is a novel framework for music-driven dance generation, improving rhythm alignment and motion dynamics using Phase-Based Rhythm Extraction, Temporal-Gated Causal Attention, and Parallel Mamba Motion Modeling.

DetailsMotivation: Generating natural, diverse, and rhythmic dance movements driven by music is challenging due to poor beat alignment and unnatural dynamics in existing methods.

Method: Danceba uses PRE for rhythm extraction, TGCA for global rhythmic focus, and PMMM for separate modeling of upper/lower body motions and musical features.

Result: Danceba outperforms state-of-the-art methods in rhythmic alignment and motion diversity.

Conclusion: The proposed framework effectively enhances music-driven dance generation, offering better rhythmic sensitivity and natural motion.

Abstract: Automatically generating natural, diverse and rhythmic human dance movements driven by music is vital for virtual reality and film industries. However, generating dance that naturally follows music remains a challenge, as existing methods lack proper beat alignment and exhibit unnatural motion dynamics. In this paper, we propose Danceba, a novel framework that leverages gating mechanism to enhance rhythm-aware feature representation for music-driven dance generation, which achieves highly aligned dance poses with enhanced rhythmic sensitivity. Specifically, we introduce Phase-Based Rhythm Extraction (PRE) to precisely extract rhythmic information from musical phase data, capitalizing on the intrinsic periodicity and temporal structures of music. Additionally, we propose Temporal-Gated Causal Attention (TGCA) to focus on global rhythmic features, ensuring that dance movements closely follow the musical rhythm. We also introduce Parallel Mamba Motion Modeling (PMMM) architecture to separately model upper and lower body motions along with musical features, thereby improving the naturalness and diversity of generated dance movements. Extensive experiments confirm that Danceba outperforms state-of-the-art methods, achieving significantly better rhythmic alignment and motion diversity. Project page: https://danceba.github.io/ .

eess.AS

[344] Unifying Listener Scoring Scales: Comparison Learning Framework for Speech Quality Assessment and Continuous Speech Emotion Recognition

Cheng-Hung Hu, Yusuke Yasud, Akifumi Yoshimoto, Tomoki Toda

Main category: eess.AS

TL;DR: The paper proposes a method to unify listener scoring scales in SQA and CSER tasks, addressing biases from mean listener approaches and improving prediction performance.

DetailsMotivation: Listener ratings in SQA and CSER are biased due to individual factors. Mean listener approaches distort ordinal data, while learning multiple scales limits effectiveness.

Method: The method models a unified listener scoring scale using comparison scores to capture scoring relationships between utterances.

Result: The method improves prediction performance in SQA and CSER tasks, demonstrating effectiveness and robustness.

Conclusion: The unified listener scoring scale approach outperforms mean listener methods, reducing bias and enhancing performance.

Abstract: Speech Quality Assessment (SQA) and Continuous Speech Emotion Recognition (CSER) are two key tasks in speech technology, both relying on listener ratings. However, these ratings are inherently biased due to individual listener factors. Previous approaches have introduced a mean listener scoring scale and modeled all listener scoring scales in the training set. However, the mean listener approach is prone to distortion from averaging ordinal data, leading to potential biases. Moreover, learning multiple listener scoring scales while inferring based only on the mean listener scale limits effectiveness. In contrast, our method focuses on modeling a unified listener scoring scale, using comparison scores to correctly capture the scoring relationships between utterances. Experimental results show that our method effectively improves prediction performance in both SQA and CSER tasks, proving its effectiveness and robustness.

[345] TGIF: Talker Group-Informed Familiarization of Target Speaker Extraction

Tsun-An Hsieh, Minje Kim

Main category: eess.AS

TL;DR: The paper introduces TGIF, a specialized target speaker extraction (TSE) system for specific talker groups, using knowledge distillation to adapt to unique speech characteristics while maintaining efficiency.

DetailsMotivation: Current TSE systems are generalists, lacking customization for small talker groups like families. Personalized solutions exist but ignore practical needs for group-specific adaptation.

Method: Proposes TGIF, where a group-specific student model learns from pseudo-clean targets generated by a large teacher model via knowledge distillation.

Result: Outperforms baseline generic models by adapting to unique speech characteristics of a speaker group.

Conclusion: TGIF highlights the potential for specialized TSE solutions in real-world applications, like on-device family use.

Abstract: State-of-the-art target speaker extraction (TSE) systems are typically designed to generalize to any given mixing environment, necessitating a model with a large enough capacity as a generalist. Personalized speech enhancement could be a specialized solution that adapts to single-user scenarios, but it overlooks the practical need for customization in cases where only a small number of talkers are involved, e.g., TSE for a specific family. We address this gap with the proposed concept, talker group-informed familiarization (TGIF) of TSE, where the TSE system specializes in a particular group of users, which is challenging due to the inherent absence of a clean speech target. To this end, we employ a knowledge distillation approach, where a group-specific student model learns from the pseudo-clean targets generated by a large teacher model. This tailors the student model to effectively extract the target speaker from the particular talker group while maintaining computational efficiency. Experimental results demonstrate that our approach outperforms the baseline generic models by adapting to the unique speech characteristics of a given speaker group. Our newly proposed TGIF concept underscores the potential of developing specialized solutions for diverse and real-world applications, such as on-device TSE on a family-owned device.

[346] A lightweight and robust method for blind wideband-to-fullband extension of speech

Jan Büthe, Jean-Marc Valin

Main category: eess.AS

TL;DR: A lightweight method for extending speech bandwidth with ~370K parameters and ~140 MFLOPS, improving quality at low bitrates (6-12 kb/s) and matching higher-bitrate codecs.

DetailsMotivation: To enhance speech quality in low-bandwidth or low-complexity scenarios without requiring guided bandwidth extensions.

Method: Proposes a robust, lightweight model inspired by classical speech coding techniques, with minimal parameters and complexity, suitable for wideband codecs.

Result: Significantly improves speech quality at 6-12 kb/s, matches 3GPP EVS at 9.6 kb/s, and Opus 1.4 at 18 kb/s.

Conclusion: The blind bandwidth extension method achieves quality comparable to guided extensions, enabling backward-compatible quality enhancement.

Abstract: Reducing the bandwidth of speech is common practice in resource constrained environments like low-bandwidth speech transmission or low-complexity vocoding. We propose a lightweight and robust method for extending the bandwidth of wideband speech signals that is inspired by classical methods developed in the speech coding context. The resulting model has just ~370K parameters and a complexity of ~140 MFLOPS (or ~70 MMACS). With a frame size of 10 ms and a lookahead of only 0.27 ms, the model is well-suited for use with common wideband speech codecs. We evaluate the model’s robustness by pairing it with the Opus SILK speech codec (1.5 release) and verify in a P.808 DCR listening test that it significantly improves quality from 6 to 12 kb/s. We also demonstrate that Opus 1.5 together with the proposed bandwidth extension at 9 kb/s meets the quality of 3GPP EVS at 9.6 kb/s and that of Opus 1.4 at 18 kb/s showing that the blind bandwidth extension can meet the quality of classical guided bandwidth extensions thus providing a way for backward-compatible quality improvement.

[347] Incremental Averaging Method to Improve Graph-Based Time-Difference-of-Arrival Estimation

Klaus Brümann, Kouei Yamaoka, Nobutaka Ono, Simon Doclo

Main category: eess.AS

TL;DR: The paper proposes an incremental method to improve TDOA estimation by averaging multiple CPSDs for GCC-PHAT functions, enhancing accuracy in noisy and reverberant environments.

DetailsMotivation: Background noise and reverberation degrade TDOA estimation accuracy, prompting the need for more robust methods.

Method: An incremental averaging of multiple CPSDs for GCC-PHAT functions, leveraging indirect CPSDs via other microphones.

Result: The method reduces TDOA and 2D source position estimation errors compared to single CPSD-based methods.

Conclusion: Averaging multiple CPSDs improves TDOA and source position estimation accuracy in challenging acoustic conditions.

Abstract: Estimating the position of a speech source based on time-differences-of-arrival (TDOAs) is often adversely affected by background noise and reverberation. A popular method to estimate the TDOA between a microphone pair involves maximizing a generalized cross-correlation with phase transform (GCC-PHAT) function. Since the TDOAs across different microphone pairs satisfy consistency relations, generally only a small subset of microphone pairs are used for source position estimation. Although the set of microphone pairs is often determined based on a reference microphone, recently a more robust method has been proposed to determine the set of microphone pairs by computing the minimum spanning tree (MST) of a signal graph of GCC-PHAT function reliabilities. To reduce the influence of noise and reverberation on the TDOA estimation accuracy, in this paper we propose to compute the GCC-PHAT functions of the MST based on an average of multiple cross-power spectral densities (CPSDs) using an incremental method. In each step of the method, we increase the number of CPSDs over which we average by considering CPSDs computed indirectly via other microphones from previous steps. Using signals recorded in a noisy and reverberant laboratory with an array of spatially distributed microphones, the performance of the proposed method is evaluated in terms of TDOA estimation error and 2D source position estimation error. Experimental results for different source and microphone configurations and three reverberation conditions show that the proposed method considering multiple CPSDs improves the TDOA estimation and source position estimation accuracy compared to the reference microphone- and MST-based methods that rely on a single CPSD as well as steered-response power-based source position estimation.

eess.IV

[348] Flatten Wisely: How Patch Order Shapes Mamba-Powered Vision for MRI Segmentation

Osama Hardan, Omar Elshenhabi, Tamer Khattab, Mohamed Mabrok

Main category: eess.IV

TL;DR: The paper studies the impact of patch scan order in Vision Mamba models for MRI segmentation, introducing MS2D to explore scan paths efficiently. Results show scan order significantly affects performance, with contiguous paths outperforming disjointed ones.

DetailsMotivation: Vision Mamba models offer linear computational cost but overlook the critical design choice of patch scan order, especially in medical imaging with strong anatomical priors.

Method: Introduces Multi-Scan 2D (MS2D), a parameter-free module for Mamba-based architectures, and benchmarks 21 scan strategies on three datasets.

Result: Scan order is statistically significant (χ²=43.9, p=0.0016), with performance varying by up to 27 Dice points. Contiguous paths (horizontal/vertical rasters) outperform disjointed scans.

Conclusion: Scan order is a powerful, cost-free hyperparameter. The paper provides evidence-based optimal paths for Mamba models in medical imaging.

Abstract: Vision Mamba models promise transformer-level performance at linear computational cost, but their reliance on serializing 2D images into 1D sequences introduces a critical, yet overlooked, design choice: the patch scan order. In medical imaging, where modalities like brain MRI contain strong anatomical priors, this choice is non-trivial. This paper presents the first systematic study of how scan order impacts MRI segmentation. We introduce Multi-Scan 2D (MS2D), a parameter-free module for Mamba-based architectures that facilitates exploring diverse scan paths without additional computational cost. We conduct a large-scale benchmark of 21 scan strategies on three public datasets (BraTS 2020, ISLES 2022, LGG), covering over 70,000 slices. Our analysis shows conclusively that scan order is a statistically significant factor (Friedman test: $\chi^{2}_{20}=43.9, p=0.0016$), with performance varying by as much as 27 Dice points. Spatially contiguous paths – simple horizontal and vertical rasters – consistently outperform disjointed diagonal scans. We conclude that scan order is a powerful, cost-free hyperparameter, and provide an evidence-based shortlist of optimal paths to maximize the performance of Mamba models in medical imaging.

[349] Enhanced DeepLab Based Nerve Segmentation with Optimized Tuning

Akhil John Thomas, Christiaan Boerkamp

Main category: eess.IV

TL;DR: Optimized DeepLabV3 pipeline with automated threshold fine-tuning improves nerve segmentation in ultrasound imaging, achieving high accuracy metrics.

DetailsMotivation: Precise nerve segmentation in medical imaging is critical for accurate identification of nerve structures.

Method: DeepLabV3-based segmentation pipeline with automated threshold fine-tuning, refined preprocessing, and parameter optimization.

Result: Achieved Dice Score of 0.78, IoU of 0.70, and Pixel Accuracy of 0.95, outperforming baseline models.

Conclusion: Tailored parameter selection is vital for enhancing automated nerve detection accuracy.

Abstract: Nerve segmentation is crucial in medical imaging for precise identification of nerve structures. This study presents an optimized DeepLabV3-based segmentation pipeline that incorporates automated threshold fine-tuning to improve segmentation accuracy. By refining preprocessing steps and implementing parameter optimization, we achieved a Dice Score of 0.78, an IoU of 0.70, and a Pixel Accuracy of 0.95 on ultrasound nerve imaging. The results demonstrate significant improvements over baseline models and highlight the importance of tailored parameter selection in automated nerve detection.

[350] Domain-randomized deep learning for neuroimage analysis

Malte Hoffmann

Main category: eess.IV

TL;DR: The paper discusses a domain-randomization strategy using synthetic images to improve deep learning model robustness and generalizability in neuroimage analysis, particularly for MRI.

DetailsMotivation: The narrow scope of training datasets limits model robustness and generalizability, especially in MRI where image appearance varies widely.

Method: A domain-randomization strategy trains deep neural networks on synthetic images with randomized intensities and anatomical content, generated from segmentation maps.

Result: The approach enables models to process unseen image types accurately without retraining, showing effectiveness across multiple imaging modalities.

Conclusion: The synthesis-driven training paradigm improves generalization and accessibility of deep learning tools, though it increases computational demands.

Abstract: Deep learning has revolutionized neuroimage analysis by delivering unprecedented speed and accuracy. However, the narrow scope of many training datasets constrains model robustness and generalizability. This challenge is particularly acute in magnetic resonance imaging (MRI), where image appearance varies widely across pulse sequences and scanner hardware. A recent domain-randomization strategy addresses the generalization problem by training deep neural networks on synthetic images with randomized intensities and anatomical content. By generating diverse data from anatomical segmentation maps, the approach enables models to accurately process image types unseen during training, without retraining or fine-tuning. It has demonstrated effectiveness across modalities including MRI, computed tomography, positron emission tomography, and optical coherence tomography, as well as beyond neuroimaging in ultrasound, electron and fluorescence microscopy, and X-ray microtomography. This tutorial paper reviews the principles, implementation, and potential of the synthesis-driven training paradigm. It highlights key benefits, such as improved generalization and resistance to overfitting, while discussing trade-offs such as increased computational demands. Finally, the article explores practical considerations for adopting the technique, aiming to accelerate the development of generalizable tools that make deep learning more accessible to domain experts without extensive computational resources or machine learning knowledge.

[351] BreastSegNet: Multi-label Segmentation of Breast MRI

Qihang Li, Jichen Yang, Yaqian Chen, Yuwen Chen, Hanxue Gu, Lars J. Grimm, Maciej A. Mazurowski

Main category: eess.IV

TL;DR: BreastSegNet, a multi-label segmentation algorithm for breast MRI, covers nine anatomical labels and outperforms other models, achieving high Dice scores, especially for heart and liver.

DetailsMotivation: Existing breast MRI segmentation methods are limited in scope, focusing on few structures, reducing their utility for quantitative analysis.

Method: BreastSegNet is developed and benchmarked against nine segmentation models using a manually annotated dataset of 1123 MRI slices.

Result: nnU-Net ResEncM achieves the highest average Dice score (0.694), excelling in heart, liver, muscle, FGT, and bone segmentation.

Conclusion: BreastSegNet and the benchmarked models, especially nnU-Net ResEncM, enhance breast MRI segmentation, with plans to release data and code publicly.

Abstract: Breast MRI provides high-resolution imaging critical for breast cancer screening and preoperative staging. However, existing segmentation methods for breast MRI remain limited in scope, often focusing on only a few anatomical structures, such as fibroglandular tissue or tumors, and do not cover the full range of tissues seen in scans. This narrows their utility for quantitative analysis. In this study, we present BreastSegNet, a multi-label segmentation algorithm for breast MRI that covers nine anatomical labels: fibroglandular tissue (FGT), vessel, muscle, bone, lesion, lymph node, heart, liver, and implant. We manually annotated a large set of 1123 MRI slices capturing these structures with detailed review and correction from an expert radiologist. Additionally, we benchmark nine segmentation models, including U-Net, SwinUNet, UNet++, SAM, MedSAM, and nnU-Net with multiple ResNet-based encoders. Among them, nnU-Net ResEncM achieves the highest average Dice scores of 0.694 across all labels. It performs especially well on heart, liver, muscle, FGT, and bone, with Dice scores exceeding 0.73, and approaching 0.90 for heart and liver. All model code and weights are publicly available, and we plan to release the data at a later date.

[352] Converting T1-weighted MRI from 3T to 7T quality using deep learning

Malo Gicquel, Ruoyi Zhao, Anika Wuestefeld, Nicola Spotorno, Olof Strandberg, Kalle Åström, Yu Xiao, Laura EM Wisse, Danielle van Westen, Rik Ossenkoppele, Niklas Mattsson-Carlgren, David Berron, Oskar Hansson, Gabrielle Flood, Jacob Vogel

Main category: eess.IV

TL;DR: A deep learning model synthesizes 7T MRI from 3T MRI, improving image quality and segmentation without compromising downstream task performance.

DetailsMotivation: 7T MRI offers superior resolution and contrast but is less accessible than 3T MRI. The study aims to bridge this gap by generating synthetic 7T images from 3T scans.

Method: Two models were trained: a specialized U-Net and a GAN U-Net, using paired 7T and 3T T1-weighted images from 172 participants. Performance was compared to state-of-the-art models.

Result: Synthetic 7T images matched real 7T in detail and surpassed them in visual quality. Automated segmentations from synthetic images were more accurate than those from 3T. Downstream cognitive status prediction remained comparable.

Conclusion: Synthetic 7T images from 3T scans can enhance quality and segmentation without affecting downstream tasks, offering a viable alternative to inaccessible 7T MRI.

Abstract: Ultra-high resolution 7 tesla (7T) magnetic resonance imaging (MRI) provides detailed anatomical views, offering better signal-to-noise ratio, resolution and tissue contrast than 3T MRI, though at the cost of accessibility. We present an advanced deep learning model for synthesizing 7T brain MRI from 3T brain MRI. Paired 7T and 3T T1-weighted images were acquired from 172 participants (124 cognitively unimpaired, 48 impaired) from the Swedish BioFINDER-2 study. To synthesize 7T MRI from 3T images, we trained two models: a specialized U-Net, and a U-Net integrated with a generative adversarial network (GAN U-Net). Our models outperformed two additional state-of-the-art 3T-to-7T models in image-based evaluation metrics. Four blinded MRI professionals judged our synthetic 7T images as comparable in detail to real 7T images, and superior in subjective visual quality to 7T images, apparently due to the reduction of artifacts. Importantly, automated segmentations of the amygdalae of synthetic GAN U-Net 7T images were more similar to manually segmented amygdalae (n=20), than automated segmentations from the 3T images that were used to synthesize the 7T images. Finally, synthetic 7T images showed similar performance to real 3T images in downstream prediction of cognitive status using MRI derivatives (n=3,168). In all, we show that synthetic T1-weighted brain images approaching 7T quality can be generated from 3T images, which may improve image quality and segmentation, without compromising performance in downstream tasks. Future directions, possible clinical use cases, and limitations are discussed.

[353] Divide and Conquer: A Large-Scale Dataset and Model for Left-Right Breast MRI Segmentation

Maximilian Rokuss, Benjamin Hamm, Yannick Kirchhoff, Klaus Maier-Hein

Main category: eess.IV

TL;DR: First public breast MRI dataset with left-right segmentation labels (13,000+ cases) and a deep-learning model for segmentation, addressing a gap in breast MRI analysis.

DetailsMotivation: To fill a critical gap in breast MRI analysis by providing a publicly available dataset and model for left-right breast segmentation.

Method: Developed a deep-learning model trained on the annotated dataset for accurate left-right breast segmentation.

Result: A dataset of over 13,000 annotated cases and a robust deep-learning model for segmentation are made publicly available.

Conclusion: This work provides a valuable resource for advancing tools in women’s health, with open access to the dataset and model.

Abstract: We introduce the first publicly available breast MRI dataset with explicit left and right breast segmentation labels, encompassing more than 13,000 annotated cases. Alongside this dataset, we provide a robust deep-learning model trained for left-right breast segmentation. This work addresses a critical gap in breast MRI analysis and offers a valuable resource for the development of advanced tools in women’s health. The dataset and trained model are publicly available at: www.github.com/MIC-DKFZ/BreastDivider

[354] Software architecture and manual for novel versatile CT image analysis toolbox – AnatomyArchive

Lei Xu, Torkel B Brismar

Main category: eess.IV

TL;DR: AnatomyArchive is a novel CT image analysis tool built on TotalSegmentator, offering automated volume selection, segmentation mask management, and radiomic feature extraction for precise body composition analysis.

DetailsMotivation: To streamline and enhance CT image analysis by automating target volume selection, segmentation, and radiomic feature extraction for improved body composition analysis.

Method: Leverages TotalSegmentator for full-body segmentation, integrates a knowledge graph for mask management, and provides GPU-accelerated tools for radiomics and cinematic rendering.

Result: Enables precise 2D/3D body composition analysis, efficient segmentation, and robust radiomic feature extraction with an open-source release.

Conclusion: AnatomyArchive is a powerful, open-source tool for advanced CT image analysis, aiding modern machine learning model development.

Abstract: We have developed a novel CT image analysis package named AnatomyArchive, built on top of the recent full body segmentation model TotalSegmentator. It provides automatic target volume selection and deselection capabilities according to user-configured anatomies for volumetric upper- and lower-bounds. It has a knowledge graph-based and time efficient tool for anatomy segmentation mask management and medical image database maintenance. AnatomyArchive enables automatic body volume cropping, as well as automatic arm-detection and exclusion, for more precise body composition analysis in both 2D and 3D formats. It provides robust voxel-based radiomic feature extraction, feature visualization, and an integrated toolchain for statistical tests and analysis. A python-based GPU-accelerated nearly photo-realistic segmentation-integrated composite cinematic rendering is also included. We present here its software architecture design, illustrate its workflow and working principle of algorithms as well provide a few examples on how the software can be used to assist development of modern machine learning models. Open-source codes will be released at https://github.com/lxu-medai/AnatomyArchive for only research and educational purposes.

[355] Blind Super Resolution with Reference Images and Implicit Degradation Representation

Huu-Phu Do, Po-Chih Hu, Hao-Chien Hsueh, Che-Kai Liu, Vu-Hoang Tran, Ching-Chun Huang

Main category: eess.IV

TL;DR: The paper introduces a novel strategy for blind super-resolution (BSR) by using HR reference images to create scale-aware degradation kernels, improving SR performance.

DetailsMotivation: Previous BSR methods focus on estimating degradation kernels from LR inputs but ignore the impact of scaling factors, leading to impractical results across varying scales.

Method: The proposed method uses HR reference images to adaptively determine degradation processes, generating additional LR-HR pairs for training.

Result: The approach outperforms previous methods in both proficiently trained and zero-shot blind SR scenarios.

Conclusion: Considering blur kernels, scaling factors, and HR references enhances the effectiveness of blind super-resolution tasks.

Abstract: Previous studies in blind super-resolution (BSR) have primarily concentrated on estimating degradation kernels directly from low-resolution (LR) inputs to enhance super-resolution. However, these degradation kernels, which model the transition from a high-resolution (HR) image to its LR version, should account for not only the degradation process but also the downscaling factor. Applying the same degradation kernel across varying super-resolution scales may be impractical. Our research acknowledges degradation kernels and scaling factors as pivotal elements for the BSR task and introduces a novel strategy that utilizes HR images as references to establish scale-aware degradation kernels. By employing content-irrelevant HR reference images alongside the target LR image, our model adaptively discerns the degradation process. It is then applied to generate additional LR-HR pairs through down-sampling the HR reference images, which are keys to improving the SR performance. Our reference-based training procedure is applicable to proficiently trained blind SR models and zero-shot blind SR methods, consistently outperforming previous methods in both scenarios. This dual consideration of blur kernels and scaling factors, coupled with the use of a reference image, contributes to the effectiveness of our approach in blind super-resolution tasks.

[356] Leveraging Pathology Foundation Models for Panoptic Segmentation of Melanoma in H&E Images

Jiaqi Lv, Yijie Zhu, Carmen Guadalupe Colin Tenorio, Brinder Singh Chohan, Mark Eastwood, Shan E Ahmed Raza

Main category: eess.IV

TL;DR: A deep learning model for segmenting five tissue classes in melanoma H&E images, using a pathology foundation model (Virchow2) and Efficient-UNet, won the PUMA Grand Challenge.

DetailsMotivation: Manual segmentation of melanoma tissue is labor-intensive and inconsistent, necessitating automated methods.

Method: Combines Virchow2 (a pathology foundation model) with Efficient-UNet for feature extraction and segmentation.

Result: Achieved top performance in the PUMA Grand Challenge, showing robust and generalizable results.

Conclusion: Pathology foundation models can enhance segmentation networks, improving computational pathology workflows.

Abstract: Melanoma is an aggressive form of skin cancer with rapid progression and high metastatic potential. Accurate characterisation of tissue morphology in melanoma is crucial for prognosis and treatment planning. However, manual segmentation of tissue regions from haematoxylin and eosin (H&E) stained whole-slide images (WSIs) is labour-intensive and prone to inter-observer variability, this motivates the need for reliable automated tissue segmentation methods. In this study, we propose a novel deep learning network for the segmentation of five tissue classes in melanoma H&E images. Our approach leverages Virchow2, a pathology foundation model trained on 3.1 million histopathology images as a feature extractor. These features are fused with the original RGB images and subsequently processed by an encoder-decoder segmentation network (Efficient-UNet) to produce accurate segmentation maps. The proposed model achieved first place in the tissue segmentation task of the PUMA Grand Challenge, demonstrating robust performance and generalizability. Our results show the potential and efficacy of incorporating pathology foundation models into segmentation networks to accelerate computational pathology workflows.

[357] OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models

Ningyong Wu, Jinzhi Wang, Wenhong Zhao, Chenzhan Yu, Zhigang Xiu, Duwei Dai

Main category: eess.IV

TL;DR: OrthoInsight, a multi-modal deep learning framework, integrates YOLOv9 for rib fracture detection, a medical knowledge graph, and LLaVA for report generation, outperforming models like GPT-4 and Claude-3 in diagnostic accuracy and clinical utility.

DetailsMotivation: The increasing volume of medical imaging data and the limitations of manual interpretation for musculoskeletal injuries like rib fractures necessitate automated diagnostic tools.

Method: OrthoInsight combines YOLOv9 for fracture detection, a medical knowledge graph for clinical context, and a fine-tuned LLaVA model for report generation, merging visual and textual data.

Result: Evaluated on 28,675 annotated CT images, OrthoInsight achieves high performance (average score 4.28) in diagnostic accuracy, content completeness, logical coherence, and clinical guidance value, surpassing GPT-4 and Claude-3.

Conclusion: The study highlights the potential of multi-modal learning to enhance medical image analysis and support radiologists effectively.

Abstract: The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.

[358] D2IP: Deep Dynamic Image Prior for 3D Time-sequence Pulmonary Impedance Imaging

Hao Fang, Hao Yu, Sihao Teng, Tao Zhang, Siyi Yuan, Huaiwu He, Zhe Liu, Yunjie Yang

Main category: eess.IV

TL;DR: D2IP improves 3D time-sequence tomographic imaging by accelerating convergence and reducing computational costs with novel strategies like UPWS, TPP, and a lightweight backbone.

DetailsMotivation: Overcome the high computational costs and inefficiency of unsupervised methods like DIP in complex 3D or time-sequence imaging.

Method: Introduces UPWS, TPP, and 3D-FastResUNet to enhance speed, temporal coherence, and efficiency.

Result: Achieves 24.8% higher MSSIM, 8.1% lower ERR, and 7.1x faster computation than baselines.

Conclusion: D2IP is a promising solution for fast and accurate clinical dynamic pulmonary imaging.

Abstract: Unsupervised learning methods, such as Deep Image Prior (DIP), have shown great potential in tomographic imaging due to their training-data-free nature and high generalization capability. However, their reliance on numerous network parameter iterations results in high computational costs, limiting their practical application, particularly in complex 3D or time-sequence tomographic imaging tasks. To overcome these challenges, we propose Deep Dynamic Image Prior (D2IP), a novel framework for 3D time-sequence imaging. D2IP introduces three key strategies - Unsupervised Parameter Warm-Start (UPWS), Temporal Parameter Propagation (TPP), and a customized lightweight reconstruction backbone, 3D-FastResUNet - to accelerate convergence, enforce temporal coherence, and improve computational efficiency. Experimental results on both simulated and clinical pulmonary datasets demonstrate that D2IP enables fast and accurate 3D time-sequence Electrical Impedance Tomography (tsEIT) reconstruction. Compared to state-of-the-art baselines, D2IP delivers superior image quality, with a 24.8% increase in average MSSIM and an 8.1% reduction in ERR, alongside significantly reduced computational time (7.1x faster), highlighting its promise for clinical dynamic pulmonary imaging.

[359] UGPL: Uncertainty-Guided Progressive Learning for Evidence-Based Classification in Computed Tomography

Shravan Venkatraman, Pavan Kumar S, Rakesh Raj Madavan, Chandrakala S

Main category: eess.IV

TL;DR: UGPL, an uncertainty-guided progressive learning framework, improves CT image classification by focusing on ambiguous regions and outperforms state-of-the-art methods.

DetailsMotivation: Existing CT image classification methods struggle with subtle, diverse pathological features and lack localized analysis.

Method: UGPL uses evidential deep learning to quantify uncertainty, extracts informative patches via non-maximum suppression, and employs adaptive fusion for global-to-local analysis.

Result: UGPL achieves accuracy improvements of 3.29%, 2.46%, and 8.08% for kidney abnormality, lung cancer, and COVID-19 detection, respectively.

Conclusion: UGPL’s uncertainty-guided progressive learning significantly enhances CT image classification, especially when fully implemented.

Abstract: Accurate classification of computed tomography (CT) images is essential for diagnosis and treatment planning, but existing methods often struggle with the subtle and spatially diverse nature of pathological features. Current approaches typically process images uniformly, limiting their ability to detect localized abnormalities that require focused analysis. We introduce UGPL, an uncertainty-guided progressive learning framework that performs a global-to-local analysis by first identifying regions of diagnostic ambiguity and then conducting detailed examination of these critical areas. Our approach employs evidential deep learning to quantify predictive uncertainty, guiding the extraction of informative patches through a non-maximum suppression mechanism that maintains spatial diversity. This progressive refinement strategy, combined with an adaptive fusion mechanism, enables UGPL to integrate both contextual information and fine-grained details. Experiments across three CT datasets demonstrate that UGPL consistently outperforms state-of-the-art methods, achieving improvements of 3.29%, 2.46%, and 8.08% in accuracy for kidney abnormality, lung cancer, and COVID-19 detection, respectively. Our analysis shows that the uncertainty-guided component provides substantial benefits, with performance dramatically increasing when the full progressive learning pipeline is implemented. Our code is available at: https://github.com/shravan-18/UGPL

[360] Large-Vocabulary Segmentation for Medical Images with Text Prompts

Ziheng Zhao, Yao Zhang, Chaoyi Wu, Xiaoman Zhang, Xiao Zhou, Ya Zhang, Yanfeng Wang, Weidi Xie

Main category: eess.IV

TL;DR: SAT introduces a 3D medical image segmentation model using text prompts from medical terminologies, achieving strong performance and generalization.

DetailsMotivation: To enable large-vocabulary segmentation in 3D medical images using medical terminologies as prompts, addressing the lack of comprehensive datasets and models.

Method: Constructs a multimodal knowledge tree, builds a large dataset, and trains SAT models (Nano and Pro) using contrastive learning for text encoder knowledge injection.

Result: SAT-Pro matches 72 nnU-Nets’ performance, outperforms MedSAM by +7.1% DSC, and shows superior generalization (+3.7% DSC on cross-center datasets).

Conclusion: SAT demonstrates scalable, robust, and generalizable 3D medical image segmentation, advancing the field with its multimodal approach.

Abstract: This paper aims to build a model that can Segment Anything in 3D medical images, driven by medical terminologies as Text prompts, termed as SAT. Our main contributions are three-fold: (i) We construct the first multimodal knowledge tree on human anatomy, including 6502 anatomical terminologies; Then, we build the largest and most comprehensive segmentation dataset for training, collecting over 22K 3D scans from 72 datasets, across 497 classes, with careful standardization on both image and label space; (ii) We propose to inject medical knowledge into a text encoder via contrastive learning and formulate a large-vocabulary segmentation model that can be prompted by medical terminologies in text form; (iii) We train SAT-Nano (110M parameters) and SAT-Pro (447M parameters). SAT-Pro achieves comparable performance to 72 nnU-Nets – the strongest specialist models trained on each dataset (over 2.2B parameters combined) – over 497 categories. Compared with the interactive approach MedSAM, SAT-Pro consistently outperforms across all 7 human body regions with +7.1% average Dice Similarity Coefficient (DSC) improvement, while showing enhanced scalability and robustness. On 2 external (cross-center) datasets, SAT-Pro achieves higher performance than all baselines (+3.7% average DSC), demonstrating superior generalization ability.

[361] A Mixture of Experts (MoE) model to improve AI-based computational pathology prediction performance under variable levels of histopathology image blur

Yujie Xiang, Bojing Liu, Mattias Rantalainen

Main category: eess.IV

TL;DR: The paper explores the impact of image blur on AI models for histopathology WSI analysis and proposes a Mixture of Experts (MoE) strategy to improve performance under varying blur conditions.

DetailsMotivation: Unsharp or blurred areas in WSIs reduce AI model performance, prompting the need for robust solutions.

Method: A MoE strategy combines predictions from expert models trained on data with varying blur levels, tested on CNN- and Vision Transformer-based models.

Result: MoE models outperformed baselines, especially under moderate and mixed blur conditions, with higher AUC scores.

Conclusion: The MoE approach enhances AI model reliability in pathology under variable image quality, supporting broader clinical and research use.

Abstract: AI-based models for histopathology whole slide image (WSI) analysis are increasingly common, but unsharp or blurred areas within WSI can significantly reduce prediction performance. In this study, we investigated the effect of image blur on deep learning models and introduced a mixture of experts (MoE) strategy that combines predictions from multiple expert models trained on data with varying blur levels. Using H&E-stained WSIs from 2,093 breast cancer patients, we benchmarked performance on grade classification and IHC biomarker prediction with both CNN- (CNN_CLAM and MoE-CNN_CLAM) and Vision Transformer-based (UNI_CLAM and MoE-UNI_CLAM) models. Our results show that baseline models’ performance consistently decreased with increasing blur, but expert models trained on blurred tiles and especially our proposed MoE approach substantially improved performance, and outperformed baseline models in a range of simulated scenarios. MoE-CNN_CLAM outperformed the baseline CNN_CLAM under moderate (AUC: 0.868 vs. 0.702) and mixed blur conditions (AUC: 0.890 vs. 0.875). MoE-UNI_CLAM outperformed the baseline UNI_CLAM model in both moderate (AUC: 0.950 vs. 0.928) and mixed blur conditions (AUC: 0.944 vs. 0.931). This MoE method has the potential to enhance the reliability of AI-based pathology models under variable image quality, supporting broader application in both research and clinical settings.

[362] Revisiting Data Augmentation for Ultrasound Images

Adam Tupper, Christian Gagné

Main category: eess.IV

TL;DR: The paper evaluates the effectiveness of data augmentation techniques for ultrasound image analysis, introducing a standardized benchmark and showing that natural image augmentations often outperform ultrasound-specific ones.

DetailsMotivation: Limited understanding of data augmentation efficacy in medical imaging, especially ultrasound, despite its potential to improve deep learning model performance.

Method: Analyzed various augmentation techniques using a new benchmark of 14 ultrasound tasks covering classification and segmentation across 11 body regions.

Result: Common natural image augmentations were more effective than ultrasound-specific ones; TrivialAugment also performed well.

Conclusion: The study provides a structured approach for assessing augmentations, applicable beyond ultrasound, and highlights the value of natural image techniques in medical imaging.

Abstract: Data augmentation is a widely used and effective technique to improve the generalization performance of deep neural networks. Yet, despite often facing limited data availability when working with medical images, it is frequently underutilized. This appears to come from a gap in our collective understanding of the efficacy of different augmentation techniques across different tasks and modalities. One modality where this is especially true is ultrasound imaging. This work addresses this gap by analyzing the effectiveness of different augmentation techniques at improving model performance across a wide range of ultrasound image analysis tasks. To achieve this, we introduce a new standardized benchmark of 14 ultrasound image classification and semantic segmentation tasks from 10 different sources and covering 11 body regions. Our results demonstrate that many of the augmentations commonly used for tasks on natural images are also effective on ultrasound images, even more so than augmentations developed specifically for ultrasound images in some cases. We also show that diverse augmentation using TrivialAugment, which is widely used for natural images, is also effective for ultrasound images. Moreover, our proposed methodology represents a structured approach for assessing various data augmentations that can be applied to other contexts and modalities.

[363] Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Xing Shen, Justin Szeto, Mingyang Li, Hengguan Huang, Tal Arbel

Main category: eess.IV

TL;DR: The paper investigates calibration biases and demographic unfairness in MLLMs for medical image classification, introducing CALIN, an inference-time calibration method to mitigate biases and improve accuracy.

DetailsMotivation: To ensure safe deployment of MLLMs in clinical practice by addressing calibration errors and unfairness across demographic subgroups.

Method: CALIN, a bi-level calibration method, estimates and applies calibration matrices from population to subgroup levels during inference.

Result: CALIN improves prediction accuracy and ensures fair confidence calibration across three medical imaging datasets with minimal fairness-utility trade-off.

Conclusion: CALIN effectively addresses calibration biases and demographic unfairness in MLLMs, enhancing their reliability for medical image analysis.

Abstract: Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs’ predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN’s effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.

[364] Inverse Synthetic Aperture Fourier Ptychography

Matthew A. Chan, Casey J. Pellizzari, Christopher A. Metzler

Main category: eess.IV

TL;DR: A novel Fourier ptychography method, Inverse Synthetic Aperture Fourier Ptychography, uses target motion for measurement diversity, eliminating the need for changing illumination angles or camera positions. A learning-based method estimates k-space coordinates from dual-plane intensity measurements, enabling synthetic aperture imaging without knowing target rotation.

DetailsMotivation: Traditional Fourier ptychography methods introduce measurement diversity by altering illumination angles or camera positions, which adds cost and complexity. This work aims to simplify the process by leveraging target motion and a learning-based approach.

Method: The proposed method, Inverse Synthetic Aperture Fourier Ptychography, replaces illumination or camera adjustments with target motion. A learning-based technique estimates k-space coordinates from dual-plane intensity measurements, bypassing the need for known target rotation.

Result: The method is validated through simulations and a tabletop optical system, demonstrating successful synthetic aperture imaging without traditional diversity mechanisms.

Conclusion: The introduced approach simplifies Fourier ptychography by using target motion and a learning-based coordinate estimation method, reducing complexity while maintaining imaging capabilities.

Abstract: Fourier ptychography (FP) is a powerful light-based synthetic aperture imaging technique that allows one to reconstruct a high-resolution, wide field-of-view image by computationally integrating a diverse collection of low-resolution, far-field measurements. Typically, FP measurement diversity is introduced by changing the angle of the illumination or the position of the camera; either approach results in sampling different portions of the target’s spatial frequency content, but both approaches introduce substantial costs and complexity to the acquisition process. In this work, we introduce Inverse Synthetic Aperture Fourier Ptychography, a novel approach to FP that foregoes changing the illumination angle or camera position and instead generates measurement diversity through target motion. Critically, we also introduce a novel learning-based method for estimating k-space coordinates from dual plane intensity measurements, thereby enabling synthetic aperture imaging without knowing the rotation of the target. We experimentally validate our method in simulation and on a tabletop optical system.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack