Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 89]
- cs.CV [Total: 157]
- cs.AI [Total: 84]
- cs.SD [Total: 8]
- cs.LG [Total: 106]
- cs.MA [Total: 4]
- cs.MM [Total: 5]
- eess.AS [Total: 1]
- eess.IV [Total: 11]
cs.CL
[1] Categorical Classification of Book Summaries Using Word Embedding Techniques
Kerem Keskin, Mümine Kaya Keleş
Main category: cs.CL
TL;DR: The study compares word embedding methods (One Hot Encoding, Word2Vec, TF-IDF) and machine learning algorithms (SVM, Naive Bayes, Logistic Regression) for classifying Turkish book summaries, finding TF-IDF and One-Hot Encoder most effective.
Details
Motivation: To evaluate and compare the effectiveness of various word embedding and machine learning techniques for text classification, specifically for Turkish texts.Method: Used word embedding methods (One Hot Encoding, Word2Vec, TF-IDF) and machine learning algorithms (SVM, Naive Bayes, Logistic Regression) to classify book summaries. Pre-processing methods were also combined and evaluated.
Result: TF-IDF and One-Hot Encoder with SVM, Naive Bayes, and Logistic Regression performed best for Turkish text classification.
Conclusion: TF-IDF and One-Hot Encoder are the most effective word embedding techniques for Turkish text classification when paired with certain machine learning models.
Abstract: In this study, book summaries and categories taken from book sites were classified using word embedding methods, natural language processing techniques and machine learning algorithms. In addition, one hot encoding, Word2Vec and Term Frequency - Inverse Document Frequency (TF-IDF) methods, which are frequently used word embedding methods were used in this study and their success was compared. Additionally, the combination table of the pre-processing methods used is shown and added to the table. Looking at the results, it was observed that Support Vector Machine, Naive Bayes and Logistic Regression Models and TF-IDF and One-Hot Encoder word embedding techniques gave more successful results for Turkish texts.
[2] Dialogic Social Learning for Artificial Agents: Enhancing LLM Ontology Acquisition through Mixed-Initiative Educational Interactions
Sabrina Patania, Luca Annese, Cansu Koyuturk, Azzurra Ruggeri, Dimitri Ognibene
Main category: cs.CL
TL;DR: The paper explores socially mediated learning for LLMs, introducing an ‘AI Social Gym’ where AI learners interact with teacher agents, showing improved knowledge acquisition through dialogic methods.
Details
Motivation: LLMs struggle with online knowledge integration; traditional training lacks efficiency. Inspired by Vygotsky's theory, the study aims to enhance learning via social interaction.Method: An ‘AI Social Gym’ environment is created for dyadic pedagogical dialogues between AI learners and teachers, testing various pedagogical strategies.
Result: Dialogic methods, especially mixed-direction interactions, outperform unidirectional instruction and direct knowledge access in enhancing LLM knowledge acquisition.
Conclusion: Integrating pedagogical insights into AI training improves post-training knowledge acquisition, offering a complementary approach to existing methods.
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in processing extensive offline datasets. However, they often face challenges in acquiring and integrating complex, knowledge online. Traditional AI training paradigms, predominantly based on supervised learning or reinforcement learning, mirror a ‘Piagetian’ model of independent exploration. These approaches typically rely on large datasets and sparse feedback signals, limiting the models’ ability to learn efficiently from interactions. Drawing inspiration from Vygotsky’s sociocultural theory, this study explores the potential of socially mediated learning paradigms to address these limitations. We introduce a dynamic environment, termed the ‘AI Social Gym’, where an AI learner agent engages in dyadic pedagogical dialogues with knowledgeable AI teacher agents. These interactions emphasize external, structured dialogue as a core mechanism for knowledge acquisition, contrasting with methods that depend solely on internal inference or pattern recognition. Our investigation focuses on how different pedagogical strategies impact the AI learning process in the context of ontology acquisition. Empirical results indicate that such dialogic approaches-particularly those involving mixed-direction interactions combining top-down explanations with learner-initiated questioning-significantly enhance the LLM’s ability to acquire and apply new knowledge, outperforming both unidirectional instructional methods and direct access to structured knowledge, formats typically present in training datasets. These findings suggest that integrating pedagogical and psychological insights into AI and robot training can substantially improve post-training knowledge acquisition and response quality. This approach offers a complementary pathway to existing strategies like prompt engineering
[3] Product vs. Process: Exploring EFL Students’ Editing of AI-Generated Text for Expository Writing
David James Woo, Yangyang Yu, Kai Guo, Yilin Huang, April Ka Yeng Fung
Main category: cs.CL
TL;DR: The study examines how EFL secondary students edit AI-generated text in expository writing, revealing minimal impact of editing on quality but a positive effect of AI-generated words.
Details
Motivation: To understand the impact of AI-generated text on EFL students' writing process and compositions, as its use grows but effects are understudied.Method: A convergent design analyzed screen recordings and compositions of 39 Hong Kong students, using qualitative coding, descriptive statistics, temporal sequence analysis, human-rated scoring, and multiple linear regression.
Result: Two editing patterns emerged: refining introductions or quick body edits. AI-generated words positively predicted scores, but editing efforts had little impact on quality.
Conclusion: AI supports but doesn’t replace writing skills; genre-specific instruction and process-focused writing are crucial before AI integration, alongside assessments valuing both process and product.
Abstract: Text generated by artificial intelligence (AI) chatbots is increasingly used in English as a foreign language (EFL) writing contexts, yet its impact on students’ expository writing process and compositions remains understudied. This research examines how EFL secondary students edit AI-generated text. Exploring editing behaviors in their expository writing process and in expository compositions, and their effect on human-rated scores for content, organization, language, and overall quality. Participants were 39 Hong Kong secondary students who wrote an expository composition with AI chatbots in a workshop. A convergent design was employed to analyze their screen recordings and compositions to examine students’ editing behaviors and writing qualities. Analytical methods included qualitative coding, descriptive statistics, temporal sequence analysis, human-rated scoring, and multiple linear regression analysis. We analyzed over 260 edits per dataset, and identified two editing patterns: one where students refined introductory units repeatedly before progressing, and another where they quickly shifted to extensive edits in body units (e.g., topic and supporting sentences). MLR analyses revealed that the number of AI-generated words positively predicted all score dimensions, while most editing variables showed minimal impact. These results suggest a disconnect between students’ significant editing effort and improved composition quality, indicating AI supports but does not replace writing skills. The findings highlight the importance of genre-specific instruction and process-focused writing before AI integration. Educators should also develop assessments valuing both process and product to encourage critical engagement with AI text.
[4] Which symbol grounding problem should we try to solve?
Vincent C. Müller
Main category: cs.CL
TL;DR: The paper critiques Floridi and Taddeo’s ‘zero semantic commitment’ condition for the grounding problem, argues it’s unfulfillable, and suggests rethinking the problem’s formulation, emphasizing goals and computing.
Details
Motivation: To challenge existing solutions to the grounding problem and propose a revised understanding focusing on computational agents' behavioral abilities.Method: Critiques Floridi and Taddeo’s condition, examines Luc Steels’ alternative, and redefines the grounding problem based on computing principles.
Result: The ‘zero semantic commitment’ condition is unfulfillable; the grounding problem should focus on explaining meaning in computational agents.
Conclusion: The grounding problem should be reframed to address the behavioral ability and function of meaning in artificial computational agents.
Abstract: Floridi and Taddeo propose a condition of “zero semantic commitment” for solutions to the grounding problem, and a solution to it. I argue briefly that their condition cannot be fulfilled, not even by their own solution. After a look at Luc Steels’ very different competing suggestion, I suggest that we need to re-think what the problem is and what role the ‘goals’ in a system play in formulating the problem. On the basis of a proper understanding of computing, I come to the conclusion that the only sensible grounding problem is how we can explain and re-produce the behavioral ability and function of meaning in artificial computational agents
[5] ChatGPT Reads Your Tone and Responds Accordingly – Until It Does Not – Emotional Framing Induces Bias in LLM Outputs
Franck Bardol
Main category: cs.CL
TL;DR: GPT-4 adjusts responses based on emotional tone of prompts, showing a ‘rebound’ bias toward neutrality or positivity, especially on sensitive topics.
Details
Motivation: To explore how emotional framing in prompts influences GPT-4's responses and identify biases.Method: Systematically varied emotional tone in 156 prompts, analyzed responses, and introduced metrics like ’tone floor’ and transition matrices.
Result: GPT-4 is three times less likely to respond negatively to negative prompts, with stronger effects on sensitive topics.
Conclusion: Emotional framing introduces biases, impacting AI alignment and trust; tools and data are shared for further research.
Abstract: Large Language Models like GPT-4 adjust their responses not only based on the question asked, but also on how it is emotionally phrased. We systematically vary the emotional tone of 156 prompts - spanning controversial and everyday topics - and analyze how it affects model responses. Our findings show that GPT-4 is three times less likely to respond negatively to a negatively framed question than to a neutral one. This suggests a “rebound” bias where the model overcorrects, often shifting toward neutrality or positivity. On sensitive topics (e.g., justice or politics), this effect is even more pronounced: tone-based variation is suppressed, suggesting an alignment override. We introduce concepts like the “tone floor” - a lower bound in response negativity
- and use tone-valence transition matrices to quantify behavior. Visualizations based on 1536-dimensional embeddings confirm semantic drift based on tone. Our work highlights an underexplored class of biases driven by emotional framing in prompts, with implications for AI alignment and trust. Code and data are available at: https://github.com/bardolfranck/llm-responses-viewer
[6] Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing
Aly M. Kassem, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi
Main category: cs.CL
TL;DR: MNEME is a framework for detecting unintended side effects in LLMs after fine-tuning or unlearning, using sparse model diffing to identify behavioral shifts without fine-tuning data.
Details
Motivation: Existing methods lack general approaches to detect unpredictable side effects of fine-tuning or unlearning in LLMs.Method: MNEME compares base and fine-tuned models on task-agnostic data to isolate behavioral shifts, using sparse probing and diffing.
Result: MNEME achieves up to 95% accuracy in predicting side effects across five LLMs and three scenarios, with retraining partially reversing effects.
Conclusion: Sparse probing and diffing provide scalable tools for understanding and managing LLM behavior changes post-intervention.
Abstract: Large language models (LLMs) are frequently fine-tuned or unlearned to adapt to new tasks or eliminate undesirable behaviors. While existing evaluation methods assess performance after such interventions, there remains no general approach for detecting unintended side effects, such as unlearning biology content degrading performance on chemistry tasks, particularly when these effects are unpredictable or emergent. To address this issue, we introduce MNEME, Model diffiNg for Evaluating Mechanistic Effects, a lightweight framework for identifying these side effects using sparse model diffing. MNEME compares base and fine-tuned models on task-agnostic data (for example, The Pile, LMSYS-Chat-1M) without access to fine-tuning data to isolate behavioral shifts. Applied to five LLMs across three scenarios: WMDP knowledge unlearning, emergent misalignment, and benign fine-tuning, MNEME achieves up to 95 percent accuracy in predicting side effects, aligning with known benchmarks and requiring no custom heuristics. Furthermore, we show that retraining on high-activation samples can partially reverse these effects. Our results demonstrate that sparse probing and diffing offer a scalable and automated lens into fine-tuning-induced model changes, providing practical tools for understanding and managing LLM behavior.
[7] TTS-1 Technical Report
Oleg Atamanenko, Anna Chalova, Joseph Coombes, Nikki Cope, Phillip Dang, Zhifeng Deng, Jimmy Du, Michael Ermolenko, Feifan Fan, Yufei Feng, Cheryl Fichter, Pavel Filimonov, Louis Fischer, Kylan Gibbs, Valeria Gusarova, Pavel Karpik, Andreas Assad Kottner, Ian Lee, Oliver Louie, Jasmine Mai, Mikhail Mamontov, Suri Mao, Nurullah Morshed, Igor Poletaev, Florin Radu, Dmytro Semernia, Evgenii Shingarev, Vikram Sivaraja, Peter Skirko, Rinat Takhautdinov, Robert Villahermosa, Jean Wang
Main category: cs.CL
TL;DR: Inworld TTS-1 introduces two Transformer-based TTS models, TTS-1-Max (8.8B params) for high-quality applications and TTS-1 (1.6B params) for real-time use. Both achieve state-of-the-art performance via pre-training, fine-tuning, and RL-alignment, supporting 11 languages with emotional control.
Details
Motivation: To develop high-quality, efficient TTS models for diverse applications, from demanding tasks to real-time synthesis.Method: Utilizes Transformer-based autoregressive models, scaling compute, and a sequential process of pre-training, fine-tuning, and RL-alignment of the SpeechLM component.
Result: State-of-the-art performance on benchmarks, 48 kHz speech generation, low latency, and support for 11 languages with emotional and non-verbal control.
Conclusion: Inworld TTS-1 models set new standards in TTS, offering exceptional quality and versatility, with open-sourced training code.
Abstract: We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker’s voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.
[8] Multi-Amateur Contrastive Decoding for Text Generation
Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela
Main category: cs.CL
TL;DR: MACD generalizes Contrastive Decoding by using multiple amateur models to better address diverse language generation failures, improving fluency, coherence, and adaptability without extra training.
Details
Motivation: Contrastive Decoding (CD) relies on a single amateur model, limiting its ability to handle diverse generation issues like repetition or hallucination. MACD aims to overcome this by leveraging multiple amateurs.Method: MACD employs an ensemble of amateur models, integrating contrastive signals via averaging and consensus penalization, and extends the plausibility constraint for multi-amateur settings.
Result: MACD outperforms conventional decoding and CD in fluency, coherence, diversity, and adaptability across domains like news and narrative.
Conclusion: MACD effectively generalizes CD, offering improved text generation quality and controllability without additional training.
Abstract: Contrastive Decoding (CD) has emerged as an effective inference-time strategy for enhancing open-ended text generation by exploiting the divergence in output probabilities between a large expert language model and a smaller amateur model. Although CD improves coherence and fluency, its dependence on a single amateur restricts its capacity to capture the diverse and multifaceted failure modes of language generation, such as repetition, hallucination, and stylistic drift. This paper proposes Multi-Amateur Contrastive Decoding (MACD), a generalization of the CD framework that employs an ensemble of amateur models to more comprehensively characterize undesirable generation patterns. MACD integrates contrastive signals through both averaging and consensus penalization mechanisms and extends the plausibility constraint to operate effectively in the multi-amateur setting. Furthermore, the framework enables controllable generation by incorporating amateurs with targeted stylistic or content biases. Experimental results across multiple domains, such as news, encyclopedic, and narrative, demonstrate that MACD consistently surpasses conventional decoding methods and the original CD approach in terms of fluency, coherence, diversity, and adaptability, all without requiring additional training or fine-tuning.
[9] A Deep Learning Automatic Speech Recognition Model for Shona Language
Leslie Wellington Sirora, Mainford Mutandavari
Main category: cs.CL
TL;DR: A deep learning-based ASR system for Shona, a low-resource language, was developed, achieving 74% accuracy by addressing data scarcity and tonal complexities with hybrid CNN-LSTM architecture and attention mechanisms.
Details
Motivation: To improve ASR accuracy for Shona, a low-resource language with tonal and grammatical complexities, overcoming limited training data and lack of labeled resources.Method: Hybrid CNN-LSTM architecture, data augmentation, transfer learning, and attention mechanisms to handle tonal nuances and data scarcity.
Result: Achieved 29% WER, 12% PER, and 74% accuracy, outperforming traditional statistical models.
Conclusion: Deep learning enhances ASR for under-resourced languages like Shona, improving accessibility and communication for its speakers.
Abstract: This study presented the development of a deep learning-based Automatic Speech Recognition system for Shona, a low-resource language characterized by unique tonal and grammatical complexities. The research aimed to address the challenges posed by limited training data, lack of labelled data, and the intricate tonal nuances present in Shona speech, with the objective of achieving significant improvements in recognition accuracy compared to traditional statistical models. The research first explored the feasibility of using deep learning to develop an accurate ASR system for Shona. Second, it investigated the specific challenges involved in designing and implementing deep learning architectures for Shona speech recognition and proposed strategies to mitigate these challenges. Lastly, it compared the performance of the deep learning-based model with existing statistical models in terms of accuracy. The developed ASR system utilized a hybrid architecture consisting of a Convolutional Neural Network for acoustic modelling and a Long Short-Term Memory network for language modelling. To overcome the scarcity of data, data augmentation techniques and transfer learning were employed. Attention mechanisms were also incorporated to accommodate the tonal nature of Shona speech. The resulting ASR system achieved impressive results, with a Word Error Rate of 29%, Phoneme Error Rate of 12%, and an overall accuracy of 74%. These metrics indicated the potential of deep learning to enhance ASR accuracy for under-resourced languages like Shona. This study contributed to the advancement of ASR technology for under-resourced languages like Shona, ultimately fostering improved accessibility and communication for Shona speakers worldwide.
[10] QU-NLP at CheckThat! 2025: Multilingual Subjectivity in News Articles Detection using Feature-Augmented Transformer Models with Sequential Cross-Lingual Fine-Tuning
Mohammad AL-Smadi
Main category: cs.CL
TL;DR: The paper proposes a feature-augmented transformer architecture for subjectivity detection in news sentences, combining pre-trained language models with statistical and linguistic features. It achieves competitive results across multiple languages, including zero-shot settings.
Details
Motivation: To address the challenge of distinguishing subjective from objective views in news sentences, leveraging both contextual embeddings and additional features for improved performance.Method: A feature-augmented transformer architecture using pre-trained models (AraELECTRA for Arabic, DeBERTa~V3 for others) combined with TF-IDF and POS tags, evaluated in monolingual, multilingual, and zero-shot settings.
Result: Competitive performance across languages, with top rankings in English, German, Arabic, and Romanian. Ablation analysis highlights the importance of TF-IDF features and cross-lingual transfer.
Conclusion: The approach effectively combines contextual and statistical features for subjectivity detection, with cross-lingual transfer and feature integration playing key roles in performance.
Abstract: This paper presents our approach to the CheckThat! 2025 Task 1 on subjectivity detection, where systems are challenged to distinguish whether a sentence from a news article expresses the subjective view of the author or presents an objective view on the covered topic. We propose a feature-augmented transformer architecture that combines contextual embeddings from pre-trained language models with statistical and linguistic features. Our system leveraged pre-trained transformers with additional lexical features: for Arabic we used AraELECTRA augmented with part-of-speech (POS) tags and TF-IDF features, while for the other languages we fine-tuned a cross-lingual DeBERTa~V3 model combined with TF-IDF features through a gating mechanism. We evaluated our system in monolingual, multilingual, and zero-shot settings across multiple languages including English, Arabic, German, Italian, and several unseen languages. The results demonstrate the effectiveness of our approach, achieving competitive performance across different languages with notable success in the monolingual setting for English (rank 1st with macro-F1=0.8052), German (rank 3rd with macro-F1=0.8013), Arabic (rank 4th with macro-F1=0.5771), and Romanian (rank 1st with macro-F1=0.8126) in the zero-shot setting. We also conducted an ablation analysis that demonstrated the importance of combining TF-IDF features with the gating mechanism and the cross-lingual transfer for subjectivity detection. Furthermore, our analysis reveals the model’s sensitivity to both the order of cross-lingual fine-tuning and the linguistic proximity of the training languages.
[11] Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting
Tuan Vu Ho, Hiroaki Kokubo, Masaaki Yamamoto, Yohei Kawaguchi
Main category: cs.CL
TL;DR: Proposes Token Map Drafting, a model-free speculative decoding method for ASR, improving speed on CPU devices without accuracy loss.
Details
Motivation: Autoregressive decoding in transformer-based ASR systems like Whisper is computationally expensive, limiting deployment on resource-constrained devices.Method: Uses a precomputed n-gram token map instead of a draft model for speculative decoding, reducing overhead.
Result: Achieves speed-ups of 1.27× and 1.37× on datasets without accuracy loss, outperforming Distill-spec by 10% on CPU.
Conclusion: Token Map Drafting is effective for on-device ASR, offering faster inference in structured domains.
Abstract: End-to-end automatic speech recognition (ASR) systems based on transformer architectures, such as Whisper, offer high transcription accuracy and robustness. However, their autoregressive decoding is computationally expensive, hence limiting deployment on CPU-based and resource-constrained devices. Speculative decoding (SD) mitigates this issue by using a smaller draft model to propose candidate tokens, which are then verified by the main model. However, this approach is impractical for devices lacking hardware accelerators like GPUs. To address this, we propose \emph{Token Map Drafting}, a model-free SD technique that eliminates the need for a separate draft model. Instead, we leverage a precomputed n-gram token map derived from domain-specific training data, enabling efficient speculative decoding with minimal overhead. Our method significantly accelerates ASR inference in structured, low-perplexity domains without sacrificing transcription accuracy. Experimental results demonstrate decoding speed-ups of $1.27\times$ on the CI-AVSR dataset and $1.37\times$ on our internal dataset without degrading recognition accuracy. Additionally, our approach achieves a $10%$ absolute improvement in decoding speed over the Distill-spec baseline running on CPU, highlighting its effectiveness for on-device ASR applications.
[12] Rewrite-to-Rank: Optimizing Ad Visibility via Retrieval-Aware Text Rewriting
Chloe Ho, Ishneet Sukhvinder Singh, Diya Sharma, Tanvi Reddy Anumandla, Michael Lu, Vasu Sharma, Kevin Zhu
Main category: cs.CL
TL;DR: The paper explores how LLM-based rewriting of ads improves their ranking and inclusion in retrieval systems without altering the retrieval model, using a supervised fine-tuning framework and custom metrics.
Details
Motivation: To investigate the impact of ad phrasing on visibility in LLM-integrated retrieval systems, as this area is underexplored.Method: Introduces a supervised fine-tuning framework with a custom loss balancing relevance and fidelity, evaluated via DeltaMRR@K and DeltaDIR@K metrics.
Result: PPO-trained models outperform prompt engineering and supervised fine-tuning, achieving significant improvements in ad visibility metrics.
Conclusion: Ad phrasing and reinforcement learning are crucial for optimizing ad visibility in LLM-integrated retrieval systems.
Abstract: Search algorithms and user query relevance have given LLMs the ability to return relevant information, but the effect of content phrasing on ad visibility remains underexplored. We investigate how LLM-based rewriting of advertisements can improve their ranking in retrieval systems and inclusion in generated LLM responses, without modifying the retrieval model itself. We introduce a supervised fine-tuning framework with a custom loss balancing semantic relevance and content fidelity. To evaluate effectiveness, we propose two metrics: DeltaMRR@K (ranking improvement) and DeltaDIR@K (inclusion frequency improvement). Our approach presents a scalable method to optimize ad phrasing, enhancing visibility in retrieval-based LLM workflows. Experiments across both instruction-based and few-shot prompting demonstrate that PPO trained models outperform both prompt engineering and supervised fine-tuning in most cases, achieving up to a 2.79 DeltaDIR@5 and 0.0073 DeltaMRR@5 in instruction-based prompting. These results highlight the importance of how the ad is written before retrieval and prompt format and reinforcement learning in effective ad rewriting for LLM integrated retrieval systems.
[13] iLSU-T: an Open Dataset for Uruguayan Sign Language Translation
Ariel E. Stassi, Yanina Boria, J. Matías Di Martino, Gregory Randall
Main category: cs.CL
TL;DR: The paper introduces iLSU T, an open dataset of Uruguayan Sign Language videos with audio and text transcriptions, and evaluates its usefulness with state-of-the-art translation algorithms.
Details
Motivation: To address the lack of localized datasets for sign language translation, which is crucial for developing tools to improve accessibility and inclusion.Method: The work presents iLSU T, a multimodal dataset, and tests it with three translation algorithms to establish a baseline.
Result: The experiments demonstrate the dataset’s utility and highlight the need for localized sign language data.
Conclusion: Localized datasets like iLSU T are essential for advancing sign language translation and accessibility tools.
Abstract: Automatic sign language translation has gained particular interest in the computer vision and computational linguistics communities in recent years. Given each sign language country particularities, machine translation requires local data to develop new techniques and adapt existing ones. This work presents iLSU T, an open dataset of interpreted Uruguayan Sign Language RGB videos with audio and text transcriptions. This type of multimodal and curated data is paramount for developing novel approaches to understand or generate tools for sign language processing. iLSU T comprises more than 185 hours of interpreted sign language videos from public TV broadcasting. It covers diverse topics and includes the participation of 18 professional interpreters of sign language. A series of experiments using three state of the art translation algorithms is presented. The aim is to establish a baseline for this dataset and evaluate its usefulness and the proposed pipeline for data processing. The experiments highlight the need for more localized datasets for sign language translation and understanding, which are critical for developing novel tools to improve accessibility and inclusion of all individuals. Our data and code can be accessed.
[14] Creation of a Numerical Scoring System to Objectively Measure and Compare the Level of Rhetoric in Arabic Texts: A Feasibility Study, and A Working Prototype
Mandar Marathe
Main category: cs.CL
TL;DR: The study aims to objectively measure the density of Arabic rhetorical devices in texts, addressing the lack of objective methods to assess rhetoric usage across genres and authors.
Details
Motivation: There is no objective way to determine the use of Arabic rhetoric in texts, making comparisons across genres or epochs impossible.Method: Compiled 84 literary devices, created identification systems, and developed tools (electronic and analogue) to calculate rhetorical density based on morpheme count.
Result: Developed a working tool to accurately report the density of Arabic rhetoric in any text or speech.
Conclusion: The tool provides an objective measure of Arabic rhetoric usage, enabling comparisons and analysis across texts and genres.
Abstract: Arabic Rhetoric is the field of Arabic linguistics which governs the art and science of conveying a message with greater beauty, impact and persuasiveness. The field is as ancient as the Arabic language itself and is found extensively in classical and contemporary Arabic poetry, free verse and prose. In practical terms, it is the intelligent use of word order, figurative speech and linguistic embellishments to enhance message delivery. Despite the volumes that have been written about it and the high status accorded to it, there is no way to objectively know whether a speaker or writer has used Arabic rhetoric in a given text, to what extent, and why. There is no objective way to compare the use of Arabic rhetoric across genres, authors or epochs. It is impossible to know which of pre-Islamic poetry, Andalucian Arabic poetry, or modern literary genres are richer in Arabic rhetoric. The aim of the current study was to devise a way to measure the density of the literary devices which constitute Arabic rhetoric in a given text, as a proxy marker for Arabic rhetoric itself. A comprehensive list of 84 of the commonest literary devices and their definitions was compiled. A system of identifying literary devices in texts was constructed. A method of calculating the density of literary devices based on the morpheme count of the text was utilised. Four electronic tools and an analogue tool were created to support the calculation of an Arabic text’s rhetorical literary device density, including a website and online calculator. Additionally, a technique of reporting the distribution of literary devices used across the three sub-domains of Arabic rhetoric was created. The output of this project is a working tool which can accurately report the density of Arabic rhetoric in any Arabic text or speech.
[15] Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams
Rob Manson
Main category: cs.CL
TL;DR: Curved Inference tracks how LLM residual stream trajectories bend with semantic shifts, revealing model behavior across domains.
Details
Motivation: To understand how LLMs internally respond to shifts in semantic concerns and diagnose alignment and abstraction dynamics.Method: Analyzed Gemma3-1b and LLaMA3.2-3b using curvature and salience metrics under a pullback semantic metric across 20 prompts.
Result: LLaMA showed consistent scaling in curvature and salience with concern intensity; Gemma responded but with weaker differentiation.
Conclusion: Curved Inference provides a principled method to study LLM geometry, offering insights into semantic abstraction and alignment.
Abstract: We propose Curved Inference - a geometric Interpretability framework that tracks how the residual stream trajectory of a large language model bends in response to shifts in semantic concern. Across 20 matched prompts spanning emotional, moral, perspective, logical, identity, environmental, and nonsense domains, we analyse Gemma3-1b and LLaMA3.2-3b using five native-space metrics, with a primary focus on curvature (\k{appa}_i) and salience (S(t)). These metrics are computed under a pullback semantic metric derived from the unembedding matrix, ensuring that all measurements reflect token-aligned geometry rather than raw coordinate structure. We find that concern-shifted prompts reliably alter internal activation trajectories in both models - with LLaMA exhibiting consistent, statistically significant scaling in both curvature and salience as concern intensity increases. Gemma also responds to concern but shows weaker differentiation between moderate and strong variants. Our results support a two-layer view of LLM geometry - a latent conceptual structure encoded in the embedding space, and a contextual trajectory shaped by prompt-specific inference. Curved Inference reveals how models navigate, reorient, or reinforce semantic meaning over depth, offering a principled method for diagnosing alignment, abstraction, and emergent inference dynamics. These findings offer fresh insight into semantic abstraction and model alignment through the lens of Curved Inference.
[16] A Survey of Classification Tasks and Approaches for Legal Contracts
Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik
Main category: cs.CL
TL;DR: The paper surveys automatic Legal Contract Classification (LCC), addressing challenges, tasks, datasets, and methodologies, while suggesting future research directions.
Details
Motivation: Manual contract reviews are inefficient and error-prone, necessitating automation for improved speed, accuracy, and accessibility.Method: The survey identifies seven LCC tasks, reviews fourteen datasets, and categorizes methodologies into Traditional ML, Deep Learning, and Transformer-based approaches.
Result: It highlights best-performing results and discusses evaluation techniques, providing a comprehensive overview of current methods and limitations.
Conclusion: The survey aims to guide future research and support legal NLP practitioners in enhancing legal processes and accessibility.
Abstract: Given the large size and volumes of contracts and their underlying inherent complexity, manual reviews become inefficient and prone to errors, creating a clear need for automation. Automatic Legal Contract Classification (LCC) revolutionizes the way legal contracts are analyzed, offering substantial improvements in speed, accuracy, and accessibility. This survey delves into the challenges of automatic LCC and a detailed examination of key tasks, datasets, and methodologies. We identify seven classification tasks within LCC, and review fourteen datasets related to English-language contracts, including public, proprietary, and non-public sources. We also introduce a methodology taxonomy for LCC, categorized into Traditional Machine Learning, Deep Learning, and Transformer-based approaches. Additionally, the survey discusses evaluation techniques and highlights the best-performing results from the reviewed studies. By providing a thorough overview of current methods and their limitations, this survey suggests future research directions to improve the efficiency, accuracy, and scalability of LCC. As the first comprehensive survey on LCC, it aims to support legal NLP researchers and practitioners in improving legal processes, making legal information more accessible, and promoting a more informed and equitable society.
[17] SemRAG: Semantic Knowledge-Augmented RAG for Improved Question-Answering
Kezhen Zhong, Basem Suleiman, Abdelkarim Erradi, Shijing Chen
Main category: cs.CL
TL;DR: SemRAG is an enhanced RAG framework that integrates domain-specific knowledge using semantic chunking and knowledge graphs, avoiding costly fine-tuning.
Details
Motivation: Existing methods for integrating domain-specific knowledge into LLMs are computationally expensive and prone to overfitting, limiting scalability.Method: SemRAG uses semantic chunking based on cosine similarity of sentence embeddings and structures retrieved information into knowledge graphs.
Result: Experiments show SemRAG improves retrieval relevance and correctness, outperforming traditional RAG methods.
Conclusion: SemRAG offers a scalable, efficient solution for domain-specific LLM pipelines, aligning with sustainability goals.
Abstract: This paper introduces SemRAG, an enhanced Retrieval Augmented Generation (RAG) framework that efficiently integrates domain-specific knowledge using semantic chunking and knowledge graphs without extensive fine-tuning. Integrating domain-specific knowledge into large language models (LLMs) is crucial for improving their performance in specialized tasks. Yet, existing adaptations are computationally expensive, prone to overfitting and limit scalability. To address these challenges, SemRAG employs a semantic chunking algorithm that segments documents based on the cosine similarity from sentence embeddings, preserving semantic coherence while reducing computational overhead. Additionally, by structuring retrieved information into knowledge graphs, SemRAG captures relationships between entities, improving retrieval accuracy and contextual understanding. Experimental results on MultiHop RAG and Wikipedia datasets demonstrate SemRAG has significantly enhances the relevance and correctness of retrieved information from the Knowledge Graph, outperforming traditional RAG methods. Furthermore, we investigate the optimization of buffer sizes for different data corpus, as optimizing buffer sizes tailored to specific datasets can further improve retrieval performance, as integration of knowledge graphs strengthens entity relationships for better contextual comprehension. The primary advantage of SemRAG is its ability to create an efficient, accurate domain-specific LLM pipeline while avoiding resource-intensive fine-tuning. This makes it a practical and scalable approach aligned with sustainability goals, offering a viable solution for AI applications in domain-specific fields.
[18] InsurTech innovation using natural language processing
Panyi Dong, Zhiyu Quan
Main category: cs.CL
TL;DR: The paper explores NLP’s role in transforming unstructured text into structured data for actuarial analysis in insurance, using real-world InsurTech data to enhance pricing and risk assessment.
Details
Motivation: Traditional insurers seek competitive advantages by integrating alternative data and advanced tech like NLP to modernize operations.Method: The study applies NLP techniques to unstructured text from InsurTech data, demonstrating practical use cases in commercial insurance.
Result: NLP-derived insights refine pricing factors and introduce new risk assessment perspectives, proving its foundational role in insurance analytics.
Conclusion: NLP is essential for modern, data-driven insurance, not just a supplementary tool.
Abstract: With the rapid rise of InsurTech, traditional insurance companies are increasingly exploring alternative data sources and advanced technologies to sustain their competitive edge. This paper provides both a conceptual overview and practical case studies of natural language processing (NLP) and its emerging applications within insurance operations with a focus on transforming raw, unstructured text into structured data suitable for actuarial analysis and decision-making. Leveraging real-world alternative data provided by an InsurTech industry partner that enriches traditional insurance data sources, we apply various NLP techniques to demonstrate practical use cases in the commercial insurance context. These enriched, text-derived insights not only add to and refine traditional rating factors for commercial insurance pricing but also offer novel perspectives for assessing underlying risk by introducing novel industry classifications. Through these demonstrations, we show that NLP is not merely a supplementary tool but a foundational element for modern, data-driven insurance analytics.
[19] TRIDENT: Benchmarking LLM Safety in Finance, Medicine, and Law
Zheng Hui, Yijiang River Dong, Ehsan Shareghi, Nigel Collier
Main category: cs.CL
TL;DR: The paper introduces Trident-Bench, a benchmark for evaluating domain-specific safety of LLMs in law, finance, and medicine, revealing gaps in ethical compliance.
Details
Motivation: To address the lack of systematic evaluation of domain-specific safety risks in LLMs deployed in high-risk fields like law, finance, and medicine.Method: Defined domain-specific safety principles based on professional ethics codes and introduced Trident-Bench to evaluate 19 LLMs.
Result: Generalist models like GPT and Gemini meet basic safety expectations, while domain-specialized models struggle with ethical nuances.
Conclusion: Trident-Bench is a pioneering resource for studying LLM safety in regulated fields, highlighting the need for finer-grained safety improvements.
Abstract: As large language models (LLMs) are increasingly deployed in high-risk domains such as law, finance, and medicine, systematically evaluating their domain-specific safety and compliance becomes critical. While prior work has largely focused on improving LLM performance in these domains, it has often neglected the evaluation of domain-specific safety risks. To bridge this gap, we first define domain-specific safety principles for LLMs based on the AMA Principles of Medical Ethics, the ABA Model Rules of Professional Conduct, and the CFA Institute Code of Ethics. Building on this foundation, we introduce Trident-Bench, a benchmark specifically targeting LLM safety in the legal, financial, and medical domains. We evaluated 19 general-purpose and domain-specialized models on Trident-Bench and show that it effectively reveals key safety gaps – strong generalist models (e.g., GPT, Gemini) can meet basic expectations, whereas domain-specialized models often struggle with subtle ethical nuances. This highlights an urgent need for finer-grained domain-specific safety improvements. By introducing Trident-Bench, our work provides one of the first systematic resources for studying LLM safety in law and finance, and lays the groundwork for future research aimed at reducing the safety risks of deploying LLMs in professionally regulated fields. Code and benchmark will be released at: https://github.com/zackhuiiiii/TRIDENT
[20] Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question
Rafael Rosales, Santiago Miret
Main category: cs.CL
TL;DR: Question interpretation diversity outperforms model diversity in improving ensemble accuracy for binary question answering with LLMs.
Details
Motivation: To determine the most effective way of leveraging diversity (model vs. question interpretation) for improving performance in binary question answering using LLMs.Method: Compare model diversity (multiple models answering the same question) and question interpretation diversity (same model answering differently framed questions) using majority voting for consensus.
Result: Question interpretation diversity consistently yields better ensemble accuracy than model diversity. Model diversity results fall between the best and worst ensemble members without clear improvement.
Conclusion: Question interpretation diversity is more effective for enhancing ensemble accuracy in binary question answering with LLMs.
Abstract: Effectively leveraging diversity has been shown to improve performance for various machine learning models, including large language models (LLMs). However, determining the most effective way of using diversity remains a challenge. In this work, we compare two diversity approaches for answering binary questions using LLMs: model diversity, which relies on multiple models answering the same question, and question interpretation diversity, which relies on using the same model to answer the same question framed in different ways. For both cases, we apply majority voting as the ensemble consensus heuristic to determine the final answer. Our experiments on boolq, strategyqa, and pubmedqa show that question interpretation diversity consistently leads to better ensemble accuracy compared to model diversity. Furthermore, our analysis of GPT and LLaMa shows that model diversity typically produces results between the best and the worst ensemble members without clear improvement.
[21] Contrast-CAT: Contrasting Activations for Enhanced Interpretability in Transformer-based Text Classifiers
Sungmin Han, Jeonghyun Lee, Sangkyun Lee
Main category: cs.CL
TL;DR: Contrast-CAT improves interpretability of transformer-based text classification by filtering class-irrelevant features, outperforming state-of-the-art methods.
Details
Motivation: Explaining transformer decisions is challenging, hindering trust and safe deployment. Existing methods are unreliable due to class-irrelevant features.Method: Proposes Contrast-CAT, an activation contrast-based attribution method that refines token-level attributions by filtering irrelevant features.
Result: Outperforms competing methods, achieving x1.30 AOPC and x2.25 LOdds improvements under MoRF setting.
Conclusion: Contrast-CAT enhances interpretability for transformer-based text classification, offering clearer and more faithful attributions.
Abstract: Transformers have profoundly influenced AI research, but explaining their decisions remains challenging – even for relatively simpler tasks such as classification – which hinders trust and safe deployment in real-world applications. Although activation-based attribution methods effectively explain transformer-based text classification models, our findings reveal that these methods can be undermined by class-irrelevant features within activations, leading to less reliable interpretations. To address this limitation, we propose Contrast-CAT, a novel activation contrast-based attribution method that refines token-level attributions by filtering out class-irrelevant features. By contrasting the activations of an input sequence with reference activations, Contrast-CAT generates clearer and more faithful attribution maps. Experimental results across various datasets and models confirm that Contrast-CAT consistently outperforms state-of-the-art methods. Notably, under the MoRF setting, it achieves average improvements of x1.30 in AOPC and x2.25 in LOdds over the most competing methods, demonstrating its effectiveness in enhancing interpretability for transformer-based text classification.
[22] Understanding Public Perception of Crime in Bangladesh: A Transformer-Based Approach with Explainability
Fatema Binte Hassan, Md Al Jubair, Mohammad Mehadi Hasan, Tahmid Hossain, S M Mehebubur Rahman Khan Shuvo, Mohammad Shamsul Arefin
Main category: cs.CL
TL;DR: The paper analyzes public sentiment on crime-related news using a transformer-based model (XLM-RoBERTa) on Bangla social media comments, achieving 97% accuracy and employing explainable AI for insights.
Details
Motivation: To understand evolving public perception of crime-related news in Bangla social media and improve sentiment analysis for low-resource languages.Method: A transformer-based model (XLM-RoBERTa Base) was trained on a new dataset of 28,528 Bangla comments, classified into positive, negative, and neutral sentiments. Explainable AI was used for interpretability.
Result: The model achieved 97% classification accuracy, outperforming existing methods, and identified key features influencing sentiment.
Conclusion: Transformer-based models are effective for Bangla sentiment analysis and can provide actionable insights for public policy and crime prevention.
Abstract: In recent years, social media platforms have become prominent spaces for individuals to express their opinions on ongoing events, including criminal incidents. As a result, public sentiment can shift dynamically over time. This study investigates the evolving public perception of crime-related news by classifying user-generated comments into three categories: positive, negative, and neutral. A newly curated dataset comprising 28,528 Bangla-language social media comments was developed for this purpose. We propose a transformer-based model utilizing the XLM-RoBERTa Base architecture, which achieves a classification accuracy of 97%, outperforming existing state-of-the-art methods in Bangla sentiment analysis. To enhance model interpretability, explainable AI technique is employed to identify the most influential features driving sentiment classification. The results underscore the effectiveness of transformer-based models in processing low-resource languages such as Bengali and demonstrate their potential to extract actionable insights that can support public policy formulation and crime prevention strategies.
[23] Bangla BERT for Hyperpartisan News Detection: A Semi-Supervised and Explainable AI Approach
Mohammad Mehadi Hasan, Fatema Binte Hassan, Md Al Jubair, Zobayer Ahmed, Sazzatul Yeakin, Md Masum Billah
Main category: cs.CL
TL;DR: The paper fine-tunes Bangla BERT to detect hyperpartisan news in Bangla, achieving 95.65% accuracy and outperforming traditional methods. It also uses LIME for explainability.
Details
Motivation: Misinformation in Bangla lacks detection tools, risking societal divisions. The study aims to fill this gap with advanced NLP methods.Method: Fine-tunes Bangla BERT, compares it with traditional ML models, and uses semi-supervised learning and LIME for explainability.
Result: Bangla BERT achieves 95.65% accuracy, outperforming conventional methods.
Conclusion: Transformer models like Bangla BERT are effective even in low-resource settings, paving the way for further advancements.
Abstract: In the current digital landscape, misinformation circulates rapidly, shaping public perception and causing societal divisions. It is difficult to identify hyperpartisan news in Bangla since there aren’t many sophisticated natural language processing methods available for this low-resource language. Without effective detection methods, biased content can spread unchecked, posing serious risks to informed discourse. To address this gap, our research fine-tunes Bangla BERT. This is a state-of-the-art transformer-based model, designed to enhance classification accuracy for hyperpartisan news. We evaluate its performance against traditional machine learning models and implement semi-supervised learning to enhance predictions further. Not only that, we use LIME to provide transparent explanations of the model’s decision-making process, which helps to build trust in its outcomes. With a remarkable accuracy score of 95.65%, Bangla BERT outperforms conventional approaches, according to our trial data. The findings of this study demonstrate the usefulness of transformer models even in environments with limited resources, which opens the door to further improvements in this area.
[24] Can human clinical rationales improve the performance and explainability of clinical text classification models?
Christoph Metzner, Shang Gao, Drahomira Herrmannova, Heidi A. Hanson
Main category: cs.CL
TL;DR: Using human-based clinical rationales as additional training data improves transformer-based models’ performance and explainability in high-resource scenarios but is inconsistent in low-resource settings. Rationales are outperformed by simply using more reports for accuracy, though they may aid explainability.
Details
Motivation: To explore if human-based clinical rationales can enhance the performance and explainability of AI models in clinical text classification, particularly for primary cancer site diagnoses.Method: Analyzed 99,125 clinical rationales alongside 128,649 pathology reports to train transformer-based models. Evaluated sufficiency as a metric for rationale quality.
Result: Rationales improved performance in high-resource scenarios but were inconsistent otherwise. Models trained on rationales were outperformed by those trained on additional reports. Sufficiency as a metric yielded inconsistent results.
Conclusion: Clinical rationales offer minor performance and explainability gains compared to additional reports. For accuracy, labeling more reports is better; for explainability, rationale-supplemented data may help.
Abstract: AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don’t consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.
[25] Do Large Language Models Understand Morality Across Cultures?
Hadi Mohammadi, Yasmeen F. S. S. Meijer, Efthymia Papadopoulou, Ayoub Bagheri
Main category: cs.CL
TL;DR: LLMs often fail to capture cross-cultural moral diversity, compressing differences and showing low alignment with survey data, highlighting ethical concerns.
Details
Motivation: To assess how well LLMs reflect cross-cultural moral perspectives compared to empirical survey data.Method: Three methods: comparing moral score variances, cluster alignment analyses, and direct probing with comparative prompts.
Result: LLMs compress cross-cultural moral differences and show low alignment with survey patterns.
Conclusion: Urgent need for bias mitigation and improved cultural representativeness in LLMs for ethical global deployment.
Abstract: Recent advancements in large language models (LLMs) have established them as powerful tools across numerous domains. However, persistent concerns about embedded biases, such as gender, racial, and cultural biases arising from their training data, raise significant questions about the ethical use and societal consequences of these technologies. This study investigates the extent to which LLMs capture cross-cultural differences and similarities in moral perspectives. Specifically, we examine whether LLM outputs align with patterns observed in international survey data on moral attitudes. To this end, we employ three complementary methods: (1) comparing variances in moral scores produced by models versus those reported in surveys, (2) conducting cluster alignment analyses to assess correspondence between country groupings derived from LLM outputs and survey data, and (3) directly probing models with comparative prompts using systematically chosen token pairs. Our results reveal that current LLMs often fail to reproduce the full spectrum of cross-cultural moral variation, tending to compress differences and exhibit low alignment with empirical survey patterns. These findings highlight a pressing need for more robust approaches to mitigate biases and improve cultural representativeness in LLMs. We conclude by discussing the implications for the responsible development and global deployment of LLMs, emphasizing fairness and ethical alignment.
[26] StructText: A Synthetic Table-to-Text Approach for Benchmark Generation with Multi-Dimensional Evaluation
Satyananda Kashyap, Sola Shirai, Nandana Mihindukulasooriya, Horst Samulowitz
Main category: cs.CL
TL;DR: StructText is a framework for automatically generating benchmarks for key-value extraction from text using tabular data, addressing the lack of scalable evaluation methods for LLMs in specific domains.
Details
Motivation: There's a lack of benchmarks for evaluating LLMs' extraction quality in specific domains, and manual annotation is labor-intensive.Method: StructText uses tabular data as ground truth, employs a two-stage ‘plan-then-execute’ pipeline to generate synthetic text, and evaluates alignment via LLM-based judgments and objective metrics.
Result: LLMs achieve high factual accuracy but struggle with narrative coherence, especially in embedding numerical/temporal data for extraction.
Conclusion: StructText provides a scalable solution for benchmark generation, with released datasets and tools to aid further research.
Abstract: Extracting structured information from text, such as key-value pairs that could augment tabular data, is quite useful in many enterprise use cases. Although large language models (LLMs) have enabled numerous automated pipelines for converting natural language into structured formats, there is still a lack of benchmarks for evaluating their extraction quality, especially in specific domains or focused documents specific to a given organization. Building such benchmarks by manual annotations is labour-intensive and limits the size and scalability of the benchmarks. In this work, we present StructText, an end-to-end framework for automatically generating high-fidelity benchmarks for key-value extraction from text using existing tabular data. It uses available tabular data as structured ground truth, and follows a two-stage ``plan-then-execute’’ pipeline to synthetically generate corresponding natural-language text. To ensure alignment between text and structured source, we introduce a multi-dimensional evaluation strategy that combines (a) LLM-based judgments on factuality, hallucination, and coherence and (b) objective extraction metrics measuring numeric and temporal accuracy. We evaluated the proposed method on 71,539 examples across 49 datasets. Results reveal that while LLMs achieve strong factual accuracy and avoid hallucination, they struggle with narrative coherence in producing extractable text. Notably, models presume numerical and temporal information with high fidelity yet this information becomes embedded in narratives that resist automated extraction. We release a framework, including datasets, evaluation tools, and baseline extraction systems, to support continued research.
[27] Turbocharging Web Automation: The Impact of Compressed History States
Xiyue Zhu, Peng Tang, Haofu Liao, Srikar Appalaraju
Main category: cs.CL
TL;DR: A novel web history compressor improves web automation by condensing verbose history states into task-relevant fixed-length representations, achieving 1.2-5.4% accuracy gains.
Details
Motivation: Current web automation methods ignore history states, leading to inefficient use of verbose web page data.Method: Proposes a history compressor module to distill task-relevant information from verbose history states into concise representations.
Result: Achieves 1.2-5.4% absolute accuracy improvements on Mind2Web and WebLINX datasets.
Conclusion: The history compressor enhances web automation by effectively utilizing history states.
Abstract: Language models have led to a leap forward in web automation. The current web automation approaches take the current web state, history actions, and language instruction as inputs to predict the next action, overlooking the importance of history states. However, the highly verbose nature of web page states can result in long input sequences and sparse information, hampering the effective utilization of history states. In this paper, we propose a novel web history compressor approach to turbocharge web automation using history states. Our approach employs a history compressor module that distills the most task-relevant information from each history state into a fixed-length short representation, mitigating the challenges posed by the highly verbose history states. Experiments are conducted on the Mind2Web and WebLINX datasets to evaluate the effectiveness of our approach. Results show that our approach obtains 1.2-5.4% absolute accuracy improvements compared to the baseline approach without history inputs.
[28] MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations
Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke
Main category: cs.CL
TL;DR: MemTool is a short-term memory framework for LLM agents to manage tools/MCP contexts in multi-turn conversations, offering three modes (Autonomous, Workflow, Hybrid) with trade-offs in efficiency and task accuracy.
Details
Motivation: Fixed context windows limit LLM agents' effectiveness in multi-turn interactions requiring repeated tool usage.Method: Introduces MemTool with three modes: Autonomous Agent, Workflow, and Hybrid, evaluated on ScaleMCP benchmark with 13+ LLMs.
Result: Autonomous mode achieves high tool-removal efficiency (90-94%) for reasoning LLMs, while Workflow and Hybrid modes manage tool removal effectively. Hybrid and Autonomous modes excel in task completion.
Conclusion: MemTool provides flexible tool management for LLM agents, with mode recommendations based on task accuracy, agency, and model capabilities.
Abstract: Large Language Model (LLM) agents have shown significant autonomous capabilities in dynamically searching and incorporating relevant tools or Model Context Protocol (MCP) servers for individual queries. However, fixed context windows limit effectiveness in multi-turn interactions requiring repeated, independent tool usage. We introduce MemTool, a short-term memory framework enabling LLM agents to dynamically manage tools or MCP server contexts across multi-turn conversations. MemTool offers three agentic architectures: 1) Autonomous Agent Mode, granting full tool management autonomy, 2) Workflow Mode, providing deterministic control without autonomy, and 3) Hybrid Mode, combining autonomous and deterministic control. Evaluating each MemTool mode across 13+ LLMs on the ScaleMCP benchmark, we conducted experiments over 100 consecutive user interactions, measuring tool removal ratios (short-term memory efficiency) and task completion accuracy. In Autonomous Agent Mode, reasoning LLMs achieve high tool-removal efficiency (90-94% over a 3-window average), while medium-sized models exhibit significantly lower efficiency (0-60%). Workflow and Hybrid modes consistently manage tool removal effectively, whereas Autonomous and Hybrid modes excel at task completion. We present trade-offs and recommendations for each MemTool mode based on task accuracy, agency, and model capabilities.
[29] Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour
Tareq Alsaleh, Bilal Farooq
Main category: cs.CL
TL;DR: The study introduces LiTransMC, a fine-tuned causal LLM for travel mode choice prediction, outperforming untuned models, proprietary systems, and classical methods while offering interpretability and local deployment benefits.
Details
Motivation: To develop a locally deployable, open-access causal LLM for travel mode choice prediction that integrates predictive accuracy with interpretability, addressing gaps in existing methods.Method: Benchmarked 11 LLMs across 3 datasets, testing 396 configurations. LiTransMC was fine-tuned using parameter-efficient and loss masking strategies, evaluated using BERTopic and a novel Explanation Strength Index.
Result: LiTransMC achieved a weighted F1 score of 0.6845 and Jensen-Shannon Divergence of 0.000245, surpassing GPT-4o and classical methods in accuracy and calibration.
Conclusion: Specialist, locally deployable LLMs like LiTransMC can integrate prediction and interpretability, enabling conversational, multi-task transport models for research and policy.
Abstract: This study investigates the adoption of open-access, locally deployable causal large language models (LLMs) for travel mode choice prediction and introduces LiTransMC, the first fine-tuned causal LLM developed for this task. We systematically benchmark eleven LLMs (1-12B parameters) across three stated and revealed preference datasets, testing 396 configurations and generating over 79,000 synthetic commuter predictions. Beyond predictive accuracy, we evaluate models generated reasoning using BERTopic for topic modelling and a novel Explanation Strength Index, providing the first structured analysis of how LLMs articulate decision factors in alignment with behavioural theory. LiTransMC, fine-tuned using parameter efficient and loss masking strategy, achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of 0.000245, surpassing both untuned local models and larger proprietary systems, including GPT-4o with advanced persona inference and embedding-based loading, while also outperforming classical mode choice methods such as discrete choice models and machine learning classifiers for the same dataset. This dual improvement, i.e., high instant-level accuracy and near-perfect distributional calibration, demonstrates the feasibility of creating specialist, locally deployable LLMs that integrate prediction and interpretability. Through combining structured behavioural prediction with natural language reasoning, this work unlocks the potential for conversational, multi-task transport models capable of supporting agent-based simulations, policy testing, and behavioural insight generation. These findings establish a pathway for transforming general purpose LLMs into specialized, explainable tools for transportation research and policy formulation, while maintaining privacy, reducing cost, and broadening access through local deployment.
[30] Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench
Reuben Narad, Siddharth Suresh, Jiayi Chen, Pine S. L. Dysart-Bricken, Bob Mankoff, Robert Nowak, Jifan Zhang, Lalit Jain
Main category: cs.CL
TL;DR: HumorBench evaluates LLMs’ humor reasoning via cartoon captions, revealing STEM-trained models transfer well but scaling thinking tokens yields mixed results.
Details
Motivation: To assess LLMs' humor comprehension beyond STEM, as reasoning models saturate existing benchmarks.Method: Uses 300 cartoon-caption pairs with expert rubrics to evaluate LLMs’ joke explanations and element identification.
Result: STEM-trained models perform well, showing reasoning transferability, but scaling thinking tokens has inconsistent effects.
Conclusion: HumorBench highlights LLMs’ humor reasoning potential and the transferability of STEM-trained reasoning skills.
Abstract: We present HumorBench, a benchmark designed to evaluate large language models’ (LLMs) ability to reason about and explain sophisticated humor in cartoon captions. As reasoning models increasingly saturate existing benchmarks in mathematics and science, novel and challenging evaluations of model intelligence beyond STEM domains are essential. Reasoning is fundamentally involved in text-based humor comprehension, requiring the identification of connections between concepts in cartoons/captions and external cultural references, wordplays, and other mechanisms. HumorBench includes approximately 300 unique cartoon-caption pairs from the New Yorker Caption Contest and Cartoonstock.com, with expert-annotated evaluation rubrics identifying essential joke elements. LLMs are evaluated based on their explanations towards the humor and abilities in identifying the joke elements. To perform well on this task, models must form and test hypotheses about associations between concepts, potentially backtracking from initial interpretations to arrive at the most plausible explanation. Our extensive benchmarking of current SOTA models reveals three key insights: (1) LLM progress on STEM reasoning transfers effectively to humor comprehension; (2) models trained exclusively on STEM reasoning data still perform well on HumorBench, demonstrating strong transferability of reasoning abilities; and (3) test-time scaling by increasing thinking token budgets yields mixed results across different models in humor reasoning.
[31] Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs
Abhinav Arabelly, Jagrut Nemade, Robert D Nowak, Jifan Zhang
Main category: cs.CL
TL;DR: The paper introduces a label-efficient learning method for supervised finetuning (SFT) by leveraging task-diversity for data selection, reducing annotation costs by up to 80% while improving model performance.
Details
Motivation: High-performing LLMs for specialized applications require costly human annotation. The paper aims to address this inefficiency by focusing on task-diversity rather than prompt-diversity for data selection.Method: The approach uses task labels and pre-trained model confidence levels to select examples via inverse confidence weighting, simplifying implementation and reducing computational load.
Result: The method achieves better accuracy (4% increase in MMLU score) than training on full datasets and matches or outperforms existing methods while cutting annotation costs by up to 80%.
Conclusion: Task-diversity-based data selection is a simple, effective, and cost-efficient alternative to existing methods for supervised finetuning of LLMs.
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation – a process that is time-consuming, labor-intensive, and expensive. In this paper, we address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection. This is markedly different from existing methods based on the prompt-diversity. Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks. We combine these facts to devise a simple yet effective sampling strategy: we select examples across tasks using an inverse confidence weighting strategy. This produces models comparable to or better than those trained with more complex sampling procedures, while being significantly easier to implement and less computationally intensive. Notably, our experimental results demonstrate that this method can achieve better accuracy than training on the complete dataset (a 4% increase in MMLU score). Across various annotation budgets and two instruction finetuning datasets, our algorithm consistently performs at or above the level of the best existing methods, while reducing annotation costs by up to 80%.
[32] VN-MTEB: Vietnamese Massive Text Embedding Benchmark
Loc Pham, Tung Luu, Thu Vo, Minh Nguyen, Viet Hoang
Main category: cs.CL
TL;DR: The paper introduces VN-MTEB, a Vietnamese benchmark for embedding models, created by translating English samples from MTEB using LLMs, ensuring quality and semantic fidelity. It includes 41 datasets across six tasks and finds Rotary Positional Embedding models outperform Absolute ones.
Details
Motivation: Vietnam's high internet traffic and online toxicity necessitate robust embedding models for recommendations and content control, but the lack of large-scale Vietnamese datasets hinders effective AI evaluation.Method: The authors translated English samples from MTEB using LLMs and embedding models, ensuring language flow, semantic fidelity, and retention of NER and code snippets.
Result: VN-MTEB comprises 41 datasets for six Vietnamese embedding tasks, with Rotary Positional Embedding models outperforming Absolute ones.
Conclusion: VN-MTEB addresses the lack of Vietnamese benchmarks, aiding AI model evaluation for real-world applications, with Rotary Positional Embedding showing superior performance.
Abstract: Vietnam ranks among the top countries in terms of both internet traffic and online toxicity. As a result, implementing embedding models for recommendation and content control duties in applications is crucial. However, a lack of large-scale test datasets, both in volume and task diversity, makes it tricky for scientists to effectively evaluate AI models before deploying them in real-world, large-scale projects. To solve this important problem, we introduce a Vietnamese benchmark, VN-MTEB for embedding models, which we created by translating a large number of English samples from the Massive Text Embedding Benchmark using our new automated framework. We leverage the strengths of large language models (LLMs) and cutting-edge embedding models to conduct translation and filtering processes to retain high-quality samples, guaranteeing a natural flow of language and semantic fidelity while preserving named entity recognition (NER) and code snippets. Our comprehensive benchmark consists of 41 datasets from six tasks specifically designed for Vietnamese text embeddings. In our analysis, we find that bigger and more complex models using Rotary Positional Embedding outperform those using Absolute Positional Embedding in embedding tasks. Datasets are available at HuggingFace: https://huggingface.co/collections/GreenNode/vn-mteb-68871433f0f7573b8e1a6686
[33] Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, Jack Lindsey
Main category: cs.CL
TL;DR: The paper identifies ‘persona vectors’ in large language models to monitor and control traits like evil, sycophancy, and hallucination. It shows these vectors predict and mitigate unintended personality shifts during training.
Details
Motivation: To address deviations from helpful, harmless, and honest behavior in AI assistants by understanding and controlling underlying personality traits.Method: Extracts persona vectors from model activations, uses them to monitor and predict personality shifts, and develops interventions to mitigate or prevent undesirable changes.
Result: Persona vectors effectively predict and control personality shifts, with post-hoc and preventative methods showing success. They also flag problematic training data.
Conclusion: Persona vectors offer a scalable, automated way to monitor and control AI personality traits, improving alignment with desired behaviors.
Abstract: Large language models interact with users through a simulated ‘Assistant’ persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals. In this paper, we identify directions in the model’s activation space-persona vectors-underlying several traits, such as evil, sycophancy, and propensity to hallucinate. We confirm that these vectors can be used to monitor fluctuations in the Assistant’s personality at deployment time. We then apply persona vectors to predict and control personality shifts that occur during training. We find that both intended and unintended personality changes after finetuning are strongly correlated with shifts along the relevant persona vectors. These shifts can be mitigated through post-hoc intervention, or avoided in the first place with a new preventative steering method. Moreover, persona vectors can be used to flag training data that will produce undesirable personality changes, both at the dataset level and the individual sample level. Our method for extracting persona vectors is automated and can be applied to any personality trait of interest, given only a natural-language description.
[34] TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling
Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu
Main category: cs.CL
TL;DR: TriangleMix is a training-free static attention pattern for LLMs that reduces computational overhead without sacrificing accuracy by using dense attention in shallow layers and triangle-shaped sparse patterns in deeper layers.
Details
Motivation: Address the computational bottlenecks of quadratic time complexity in attention mechanisms for LLMs, avoiding accuracy degradation and overhead from existing static and dynamic sparsity methods.Method: Proposes TriangleMix, which combines dense attention in shallow layers with a triangle-shaped sparse pattern in deeper layers.
Result: Reduces attention overhead by 3.7x to 15.3x in deep layers and decreases TTFT by 12% to 32% for sequences up to 128K, maintaining accuracy.
Conclusion: TriangleMix enhances LLM inference efficiency and can integrate with dynamic sparsity for further speedup, demonstrating its practical potential.
Abstract: Large Language Models (LLMs) rely on attention mechanisms whose time complexity grows quadratically with input sequence length, creating significant computational bottlenecks during the prefilling stage. Existing static sparse attention methods typically degrade accuracy, while dynamic sparsity methods introduce additional computational overhead due to runtime sparse index estimation. To address these limitations, we propose TriangleMix, a novel training-free static attention pattern. TriangleMix employs dense attention in shallow layers and switches to a triangle-shaped sparse pattern in deeper layers. Extensive experiments demonstrate that TriangleMix reduces attention overhead by 3.7x to 15.3x in deep layers, and decreases overall Time-to-First-Token (TTFT) by 12% to 32% for sequence lengths ranging from 32K to 128K, without sacrificing model accuracy. Moreover, TriangleMix can be seamlessly integrated with dynamic sparsity methods to achieve further speedup, e.g. accelerating MInference by 19% at 128K, highlighting its potential to enhance LLM inference efficiency.
[35] Automatic Classification of User Requirements from Online Feedback – A Replication Study
Meet Bhatt, Nic Boilard, Muhammad Rehan Chaudhary, Cole Thompson, Jacob Idoko, Aakash Sorathiya, Gouri Ginde
Main category: cs.CL
TL;DR: The paper replicates and extends a prior NLP4RE study, evaluating deep learning models for requirement classification, testing reproducibility, and assessing GPT-4o’s performance.
Details
Motivation: To address the limited replication in NLP4RE studies and explore new NLP advancements for RE tasks.Method: Reproduced the baseline study’s results, extended it with an external dataset and GPT-4o comparison, and prepared a replication study ID-card.
Result: Naive Bayes showed perfect reproducibility, while BERT and ELMo generalized well. GPT-4o matched traditional models. Replication readiness was confirmed but lacked environment files.
Conclusion: The study highlights reproducibility challenges and opportunities in NLP4RE, providing tools for future replication.
Abstract: Natural language processing (NLP) techniques have been widely applied in the requirements engineering (RE) field to support tasks such as classification and ambiguity detection. Although RE research is rooted in empirical investigation, it has paid limited attention to replicating NLP for RE (NLP4RE) studies. The rapidly advancing realm of NLP is creating new opportunities for efficient, machine-assisted workflows, which can bring new perspectives and results to the forefront. Thus, we replicate and extend a previous NLP4RE study (baseline), “Classifying User Requirements from Online Feedback in Small Dataset Environments using Deep Learning”, which evaluated different deep learning models for requirement classification from user reviews. We reproduced the original results using publicly released source code, thereby helping to strengthen the external validity of the baseline study. We then extended the setup by evaluating model performance on an external dataset and comparing results to a GPT-4o zero-shot classifier. Furthermore, we prepared the replication study ID-card for the baseline study, important for evaluating replication readiness. Results showed diverse reproducibility levels across different models, with Naive Bayes demonstrating perfect reproducibility. In contrast, BERT and other models showed mixed results. Our findings revealed that baseline deep learning models, BERT and ELMo, exhibited good generalization capabilities on an external dataset, and GPT-4o showed performance comparable to traditional baseline machine learning models. Additionally, our assessment confirmed the baseline study’s replication readiness; however missing environment setup files would have further enhanced readiness. We include this missing information in our replication package and provide the replication study ID-card for our study to further encourage and support the replication of our study.
[36] Modern Uyghur Dependency Treebank (MUDT): An Integrated Morphosyntactic Framework for a Low-Resource Language
Jiaxin Zuo, Yiquan Wang, Yuan Pan, Xiadiya Yibulayin
Main category: cs.CL
TL;DR: The study introduces a dependency annotation framework for Uyghur NLP, addressing gaps in existing treebanks. It includes 18 main relations and 26 subtypes, validated by a cross-standard evaluation showing 47.9% divergence from universal schemes. The Modern Uyghur Dependency Treebank (MUDT) offers improved accuracy and semantic transparency for parsing and downstream tasks.
Details
Motivation: To address the lack of tailored dependency annotation frameworks for Uyghur, a low-resource, agglutinative language, and overcome limitations of universal schemes.Method: Developed a dependency annotation framework with 18 main relations and 26 subtypes, validated using a pre-trained Universal Dependencies parser. Grounded in nine annotation principles for typological accuracy and semantic transparency.
Result: A 47.9% divergence in annotations was found, highlighting the inadequacy of universal schemes for Uyghur. The MUDT provides a more accurate and transparent representation.
Conclusion: The MUDT framework improves parsing and downstream NLP tasks for Uyghur and serves as a replicable model for other morphologically complex languages.
Abstract: To address a critical resource gap in Uyghur Natural Language Processing (NLP), this study introduces a dependency annotation framework designed to overcome the limitations of existing treebanks for the low-resource, agglutinative language. This inventory includes 18 main relations and 26 subtypes, with specific labels such as cop:zero for verbless clauses and instr:case=loc/dat for nuanced instrumental functions. To empirically validate the necessity of this tailored approach, we conducted a cross-standard evaluation using a pre-trained Universal Dependencies parser. The analysis revealed a systematic 47.9% divergence in annotations, pinpointing the inadequacy of universal schemes for handling Uyghur-specific structures. Grounded in nine annotation principles that ensure typological accuracy and semantic transparency, the Modern Uyghur Dependency Treebank (MUDT) provides a more accurate and semantically transparent representation, designed to enable significant improvements in parsing and downstream NLP tasks, and offers a replicable model for other morphologically complex languages.
[37] MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation
Jungyeon Lee, Kangmin Lee, Taeuk Kim
Main category: cs.CL
TL;DR: The paper introduces a KG-based framework, MAGIC, to address limitations in existing benchmarks for knowledge conflict in RAG systems, revealing LLMs’ struggles with conflict detection and resolution.
Details
Motivation: Existing benchmarks for knowledge conflict in RAG systems are limited in scope, focusing narrowly on question answering and relying on entity substitution, prompting the need for a more versatile and interpretable approach.Method: The authors propose a knowledge graph (KG)-based framework to generate varied and subtle conflicts between similar contexts, leveraging KG’s relational structure for interpretability.
Result: Experiments on MAGIC show that both open-source and proprietary LLMs struggle with conflict detection, especially in multi-hop reasoning, and often fail to identify contradiction sources.
Conclusion: The study provides foundational insights for improving LLMs’ ability to integrate diverse and conflicting information, highlighting the need for better conflict resolution mechanisms.
Abstract: Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model’s parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection – especially when multi-hop reasoning is required – and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.
[38] Evaluating the cognitive reality of Spanish irregular morphomic patterns: Humans vs. Transformers
Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney
Main category: cs.CL
TL;DR: The study compares transformer-based neural networks to human data in processing Spanish irregular morphomic patterns, finding models outperform humans in accuracy but diverge in response preferences.
Details
Motivation: To assess if transformer models can replicate human-like sensitivity to complex linguistic phenomena (morphomes) under controlled conditions.Method: Direct comparison of models and human data using the same analytical framework, testing three frequency conditions (natural, low, high).
Result: Models outperformed humans in accuracy but preferred irregular responses, influenced by training data. Sensitivity to phonological similarity varied by training distribution.
Conclusion: Transformer models show divergent behavior from humans, highlighting the impact of training data on linguistic processing.
Abstract: This study investigates the cognitive plausibility of the Spanish irregular morphomic pattern by directly comparing transformer-based neural networks to human behavioral data from \citet{Nevins2015TheRA}. Using the same analytical framework as the original human study, we evaluate whether transformer models can replicate human-like sensitivity to a complex linguistic phenomena, the morphome, under controlled input conditions. Our experiments focus on three frequency conditions: natural, low-frequency, and high-frequency distributions of verbs exhibiting irregular morphomic patterns. While the models outperformed humans in stem and suffix accuracy, a clear divergence emerged in response preferences. Unlike humans, who consistently favored natural responses across all test items, models’ preferred irregular responses and were influenced by the proportion of irregular verbs in their training data. Additionally, models trained on the natural and low-frequency distributions, but not the high-frequency distribution, were sensitive to the phonological similarity between test items and real Spanish L-shaped verbs.
[39] Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages
Aarón Galiano-Jiménez, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Víctor M. Sánchez-Cartagena
Main category: cs.CL
TL;DR: The paper introduces Multi-Hypothesis Distillation (MHD), a sequence-level knowledge distillation method for multilingual translation models, leveraging multiple teacher-generated translations to improve student learning and reduce biases.
Details
Motivation: The teacher model's output distribution contains valuable insights beyond beam search results, and exposing the student to diverse translations can enhance learning and mitigate biases.Method: Proposes MHD, which generates multiple translations per source sentence using $n$-best lists from beam search and explores alternative decoding methods to increase variability and lexical richness.
Result: For low-resource languages, sampling methods improve variability and lexical richness, slightly reducing translation quality but enhancing student performance and reducing gender bias amplification.
Conclusion: MHD effectively leverages teacher model diversity to improve student learning and address biases, particularly benefiting low-resource languages.
Abstract: This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model’s output distribution holds valuable insights for the student, beyond the approximated mode obtained through beam search (the standard decoding method), and present Multi-Hypothesis Distillation (MHD), a sequence-level KD method that generates multiple translations for each source sentence. This provides a larger representation of the teacher model distribution and exposes the student model to a wider range of target-side prefixes. We leverage $n$-best lists from beam search to guide the student’s learning and examine alternative decoding methods to address issues like low variability and the under-representation of infrequent tokens. For low-resource languages, our research shows that while sampling methods may slightly compromise translation quality compared to beam search based approaches, they enhance the generated corpora with greater variability and lexical richness. This ultimately improves student model performance and mitigates the gender bias amplification often associated with KD.
[40] Multilingual JobBERT for Cross-Lingual Job Title Matching
Jens-Joris Decorte, Matthias De Lange, Jeroen Van Hautte
Main category: cs.CL
TL;DR: JobBERT-V3 is a multilingual job title matching model using contrastive learning, outperforming baselines on TalentCLEF 2025.
Details
Motivation: Extend monolingual JobBERT-V2 to support multiple languages (English, German, Spanish, Chinese) for cross-lingual job title matching.Method: Leverages synthetic translations and a balanced multilingual dataset (21M job titles) with an efficiency-focused architecture, no task-specific supervision.
Result: Outperforms multilingual baselines on TalentCLEF 2025, consistent in monolingual and cross-lingual settings. Also ranks skills effectively.
Conclusion: JobBERT-V3 is robust for multilingual job title matching and has broader labor market applications.
Abstract: We introduce JobBERT-V3, a contrastive learning-based model for cross-lingual job title matching. Building on the state-of-the-art monolingual JobBERT-V2, our approach extends support to English, German, Spanish, and Chinese by leveraging synthetic translations and a balanced multilingual dataset of over 21 million job titles. The model retains the efficiency-focused architecture of its predecessor while enabling robust alignment across languages without requiring task-specific supervision. Extensive evaluations on the TalentCLEF 2025 benchmark demonstrate that JobBERT-V3 outperforms strong multilingual baselines and achieves consistent performance across both monolingual and cross-lingual settings. While not the primary focus, we also show that the model can be effectively used to rank relevant skills for a given job title, demonstrating its broader applicability in multilingual labor market intelligence. The model is publicly available: https://huggingface.co/TechWolf/JobBERT-v3.
[41] Libra: Assessing and Improving Reward Model by Learning to Think
Meng Zhou, Bei Li, Jiahao Liu, Xiaowen Shi, Yang Bai, Rongxiang Weng, Jingang Wang, Xunliang Cai
Main category: cs.CL
TL;DR: The paper addresses limitations in current reward models for reinforcement learning in reasoning tasks by introducing a new benchmark (Libra Bench) and a generative reward model (Libra-RM) that improves performance without relying on annotated references or constrained outputs.
Details
Motivation: Current reward models for RL in reasoning tasks underperform and rely on rule-based or reference-based rewards, limiting scalability and performance improvement.Method: Proposes a reasoning-oriented benchmark (Libra Bench) and a generative reward model (Libra-RM) using learning-to-think methodologies.
Result: Libra-RM achieves state-of-the-art results on benchmarks and shows potential for improving reasoning models with unlabeled data.
Conclusion: The proposed framework and Libra-RM address critical limitations, enabling better scaling and sustained enhancement of reasoning models.
Abstract: Reinforcement learning (RL) has significantly improved the reasoning ability of large language models. However, current reward models underperform in challenging reasoning scenarios and predominant RL training paradigms rely on rule-based or reference-based rewards, which impose two critical limitations:
- the dependence on finely annotated reference answer to attain rewards; and
- the requirement for constrained output format. These limitations fundamentally hinder further RL data scaling and sustained enhancement of model reasoning performance. To address these limitations, we propose a comprehensive framework for evaluating and improving the performance of reward models in complex reasoning scenarios. We first present a reasoning-oriented benchmark (Libra Bench), systematically constructed from a diverse collection of challenging mathematical problems and advanced reasoning models, to address the limitations of existing reward model benchmarks in reasoning scenarios. We further introduce a novel approach for improving the generative reward model via learning-to-think methodologies. Based on the proposed approach, we develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks. Comprehensive downstream experiments are conducted and the experimental results demonstrate the correlation between our Libra Bench and downstream application, and the potential of Libra-RM to further improve reasoning models with unlabeled data.
[42] UnsafeChain: Enhancing Reasoning Model Safety via Hard Cases
Raj Vardhan Tomar, Preslav Nakov, Yuxia Wang
Main category: cs.CL
TL;DR: UnsafeChain is a dataset for safety alignment in large reasoning models, focusing on hard prompts that elicit harmful outputs, and outperforms existing methods.
Details
Motivation: Address the gap in safety alignment studies by focusing on hard prompts that consistently produce harmful outputs, rather than just filtering safe prompts.Method: Introduce UnsafeChain, a dataset with hard prompts and corrected safe responses, and fine-tune three large reasoning models on it.
Result: UnsafeChain outperforms SafeChain and STAR-1 across multiple benchmarks, with even a small subset matching baseline performance.
Conclusion: Correction-based supervision via UnsafeChain effectively enhances safety while maintaining reasoning ability, demonstrating generalizability.
Abstract: As large reasoning models (LRMs) grow more capable, chain-of-thought (CoT) reasoning introduces new safety challenges. Existing SFT-based safety alignment studies dominantly focused on filtering prompts with safe, high-quality responses, while overlooking hard prompts that always elicit harmful outputs. To fill this gap, we introduce UnsafeChain, a safety alignment dataset constructed from hard prompts with diverse sources, where unsafe completions are identified and explicitly corrected into safe responses. By exposing models to unsafe behaviors and guiding their correction, UnsafeChain enhances safety while preserving general reasoning ability. We fine-tune three LRMs on UnsafeChain and compare them against recent SafeChain and STAR-1 across six out-of-distribution and five in-distribution benchmarks. UnsafeChain consistently outperforms prior datasets, with even a 1K subset matching or surpassing baseline performance, demonstrating the effectiveness and generalizability of correction-based supervision. We release our dataset and code at https://github.com/mbzuai-nlp/UnsafeChain
[43] Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal
Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin
Main category: cs.CL
TL;DR: A simple add-on module enhances PLM robustness by removing instance-level principal components, avoiding costly adversarial training or data perturbation.
Details
Motivation: PLMs are vulnerable to adversarial attacks, and existing defenses are computationally expensive.Method: Proposes transforming the embedding space to approximate Gaussian properties, reducing adversarial susceptibility without altering training data.
Result: Improves robustness on eight benchmarks while maintaining baseline accuracy, balancing robustness and generalization.
Conclusion: The method offers an effective, low-cost solution for enhancing PLM robustness against adversarial attacks.
Abstract: Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defences or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimises the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalisation.
[44] AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models
Lian Yan, Haotian Wang, Chen Tang, Haifeng Liu, Tianyang Sun, Liangliang Liu, Yi Guan, Jingchi Jiang
Main category: cs.CL
TL;DR: AgriEval is the first comprehensive Chinese agricultural benchmark for evaluating LLMs, covering six categories and 29 subcategories with 14,697 multiple-choice and 2,167 open-ended questions. Most LLMs struggle to achieve 60% accuracy, highlighting the need for improvement.
Details
Motivation: The lack of training data and evaluation benchmarks in agriculture hinders LLM deployment. AgriEval addresses this gap by providing a robust benchmark for assessing LLM capabilities in agricultural contexts.Method: AgriEval is curated from university-level exams and assignments, covering six agricultural categories and 29 subcategories. It evaluates four cognitive scenarios: memorization, understanding, inference, and generation.
Result: Most LLMs tested scored below 60% accuracy, indicating significant room for improvement in agricultural applications.
Conclusion: AgriEval fills a critical gap in agricultural LLM evaluation, revealing current limitations and suggesting strategies for enhancing model performance.
Abstract: In the agricultural domain, the deployment of large language models (LLMs) is hindered by the lack of training data and evaluation benchmarks. To mitigate this issue, we propose AgriEval, the first comprehensive Chinese agricultural benchmark with three main characteristics: (1) Comprehensive Capability Evaluation. AgriEval covers six major agriculture categories and 29 subcategories within agriculture, addressing four core cognitive scenarios: memorization, understanding, inference, and generation. (2) High-Quality Data. The dataset is curated from university-level examinations and assignments, providing a natural and robust benchmark for assessing the capacity of LLMs to apply knowledge and make expert-like decisions. (3) Diverse Formats and Extensive Scale. AgriEval comprises 14,697 multiple-choice questions and 2,167 open-ended question-and-answer questions, establishing it as the most extensive agricultural benchmark available to date. We also present comprehensive experimental results over 51 open-source and commercial LLMs. The experimental results reveal that most existing LLMs struggle to achieve 60% accuracy, underscoring the developmental potential in agricultural LLMs. Additionally, we conduct extensive experiments to investigate factors influencing model performance and propose strategies for enhancement. AgriEval is available at https://github.com/YanPioneer/AgriEval/.
[45] The Problem with Safety Classification is not just the Models
Sowmya Vajjala
Main category: cs.CL
TL;DR: The paper highlights multilingual disparities in safety classifiers for LLMs and critiques evaluation datasets, suggesting improvements for identifying harmful content across languages.
Details
Motivation: To address the lack of research on evaluating safety classifiers for LLMs, especially in multilingual contexts, and to uncover issues in evaluation datasets.Method: Analyzed 5 safety classification models using datasets covering 18 languages to identify multilingual disparities and dataset shortcomings.
Result: Found significant multilingual disparities in safety classifiers and identified flaws in evaluation datasets, indicating broader issues beyond model limitations.
Conclusion: The findings advocate for better methods and datasets to improve safety classification of LLM inputs across diverse languages.
Abstract: Studying the robustness of Large Language Models (LLMs) to unsafe behaviors is an important topic of research today. Building safety classification models or guard models, which are fine-tuned models for input/output safety classification for LLMs, is seen as one of the solutions to address the issue. Although there is a lot of research on the safety testing of LLMs themselves, there is little research on evaluating the effectiveness of such safety classifiers or the evaluation datasets used for testing them, especially in multilingual scenarios. In this position paper, we demonstrate how multilingual disparities exist in 5 safety classification models by considering datasets covering 18 languages. At the same time, we identify potential issues with the evaluation datasets, arguing that the shortcomings of current safety classifiers are not only because of the models themselves. We expect that these findings will contribute to the discussion on developing better methods to identify harmful content in LLM inputs across languages.
[46] ChartMark: A Structured Grammar for Chart Annotation
Yiyu Chen, Yifan Wu, Shuyu Shen, Yupeng Xie, Leixian Shen, Hui Xiong, Yuyu Luo
Main category: cs.CL
TL;DR: ChartMark is a structured grammar for chart annotations, separating semantics from implementation to enhance accessibility and cross-platform reuse.
Details
Motivation: Current chart annotation representations are fragmented and non-standardized, limiting accessibility and reuse.Method: ChartMark uses a hierarchical framework to map annotation dimensions (e.g., task, chart context) and supports abstract intents and visual details.
Result: The toolkit converts ChartMark into Vega-Lite visualizations, showing flexibility, expressiveness, and practicality.
Conclusion: ChartMark improves annotation standardization and accessibility, enabling broader reuse and clearer visualization communication.
Abstract: Chart annotations enhance visualization accessibility but suffer from fragmented, non-standardized representations that limit cross-platform reuse. We propose ChartMark, a structured grammar that separates annotation semantics from visualization implementations. ChartMark features a hierarchical framework mapping onto annotation dimensions (e.g., task, chart context), supporting both abstract intents and precise visual details. Our toolkit demonstrates converting ChartMark specifications into Vega-Lite visualizations, highlighting its flexibility, expressiveness, and practical applicability.
[47] Overview of ADoBo at IberLEF 2025: Automatic Detection of Anglicisms in Spanish
Elena Alvarez-Mellado, Jordi Porta-Zamorano, Constantine Lignos, Julio Gonzalo
Main category: cs.CL
TL;DR: ADoBo 2025 task focused on identifying anglicisms in Spanish texts, with five teams using varied methods (LLMs, deep learning, Transformers, rule-based). Performance varied widely (F1: 0.17-0.99).
Details
Motivation: The task aimed to explore and benchmark methods for detecting English lexical borrowings in Spanish journalistic texts.Method: Teams used diverse approaches: LLMs, deep learning, Transformer-based models, and rule-based systems.
Result: Performance varied significantly, with F1 scores ranging from 0.17 to 0.99.
Conclusion: The task highlighted the variability in system performance for anglicism identification, showcasing the effectiveness of certain methods over others.
Abstract: This paper summarizes the main findings of ADoBo 2025, the shared task on anglicism identification in Spanish proposed in the context of IberLEF 2025. Participants of ADoBo 2025 were asked to detect English lexical borrowings (or anglicisms) from a collection of Spanish journalistic texts. Five teams submitted their solutions for the test phase. Proposed systems included LLMs, deep learning models, Transformer-based models and rule-based systems. The results range from F1 scores of 0.17 to 0.99, which showcases the variability in performance different systems can have for this task.
[48] HRIPBench: Benchmarking LLMs in Harm Reduction Information Provision to Support People Who Use Drugs
Kaixuan Wang, Chenxin Diao, Jason T. Jacques, Zhongliang Guo, Shuai Zhao
Main category: cs.CL
TL;DR: HRIPBench evaluates LLMs’ accuracy and safety risks in providing harm reduction information for substance use, revealing shortcomings in current models.
Details
Motivation: To assess LLMs' capability in addressing the information needs of people who use drugs (PWUD) and identify potential safety risks.Method: Developed HRIPBench with 2,160 question-answer-evidence pairs, testing LLMs on safety boundaries, quantitative values, and polysubstance use risks using Instruction and RAG schemes.
Result: State-of-the-art LLMs often provide inaccurate harm reduction information and pose severe safety risks to PWUD.
Conclusion: LLMs in harm reduction contexts require cautious use to avoid negative health outcomes.
Abstract: Millions of individuals’ well-being are challenged by the harms of substance use. Harm reduction as a public health strategy is designed to improve their health outcomes and reduce safety risks. Some large language models (LLMs) have demonstrated a decent level of medical knowledge, promising to address the information needs of people who use drugs (PWUD). However, their performance in relevant tasks remains largely unexplored. We introduce HRIPBench, a benchmark designed to evaluate LLM’s accuracy and safety risks in harm reduction information provision. The benchmark dataset HRIP-Basic has 2,160 question-answer-evidence pairs. The scope covers three tasks: checking safety boundaries, providing quantitative values, and inferring polysubstance use risks. We build the Instruction and RAG schemes to evaluate model behaviours based on their inherent knowledge and the integration of domain knowledge. Our results indicate that state-of-the-art LLMs still struggle to provide accurate harm reduction information, and sometimes, carry out severe safety risks to PWUD. The use of LLMs in harm reduction contexts should be cautiously constrained to avoid inducing negative health outcomes. WARNING: This paper contains illicit content that potentially induces harms.
[49] Modelling Adjectival Modification Effects on Semantic Plausibility
Anna Golub, Beate Zywietz, Annerose Eichel
Main category: cs.CL
TL;DR: The paper addresses the challenge of assessing changes in plausibility due to event modification, using the ADEPT benchmark. It evaluates sentence transformers and transformer models, finding limitations and advocating for balanced evaluation methods.
Details
Motivation: Understanding plausibility changes is crucial for tasks like dialogue generation and commonsense reasoning, as it helps model nuanced interactions like sarcasm.Method: The study uses the ADEPT benchmark (16K sentence pairs with adjectival modifiers) and evaluates sentence transformers and transformer models (e.g., RoBERTa).
Result: Sentence transformers underperform compared to models like RoBERTa, and imbalances in evaluation methods distort performance metrics.
Conclusion: The paper highlights the need for realistic, balanced evaluation methods to ensure trustworthy results in plausibility assessment tasks.
Abstract: While the task of assessing the plausibility of events such as ‘’news is relevant’’ has been addressed by a growing body of work, less attention has been paid to capturing changes in plausibility as triggered by event modification. Understanding changes in plausibility is relevant for tasks such as dialogue generation, commonsense reasoning, and hallucination detection as it allows to correctly model, for example, ‘‘gentle sarcasm’’ as a sign of closeness rather than unkindness among friends [9]. In this work, we tackle the ADEPT challenge benchmark [6] consisting of 16K English sentence pairs differing by exactly one adjectival modifier. Our modeling experiments provide a conceptually novel method by using sentence transformers, and reveal that both they and transformer-based models struggle with the task at hand, and sentence transformers - despite their conceptual alignment with the task - even under-perform in comparison to models like RoBERTa. Furthermore, an in-depth comparison with prior work highlights the importance of a more realistic, balanced evaluation method: imbalances distort model performance and evaluation metrics, and weaken result trustworthiness.
[50] Introducing HALC: A general pipeline for finding optimal prompting strategies for automated coding with LLMs in the computational social sciences
Andreas Reich, Claudia Thoms, Tobias Schrimpf
Main category: cs.CL
TL;DR: HALC is a pipeline for systematically constructing optimal prompts for LLM coding tasks, validated through extensive testing and expert comparisons.
Details
Motivation: To address the variability in LLM prompting effectiveness and reduce reliance on trial and error for task automation.Method: Proposes HALC, a pipeline for reliable prompt construction, tested with 1,512 prompts and two million requests, comparing LLM outputs to expert codings.
Result: Identified reliable prompts for specific variables (e.g., αclimate = .76) and demonstrated HALC’s effectiveness across tasks and models.
Conclusion: HALC provides a systematic approach to prompt optimization, offering insights into effective strategies and reliable prompts for LLM coding tasks.
Abstract: LLMs are seeing widespread use for task automation, including automated coding in the social sciences. However, even though researchers have proposed different prompting strategies, their effectiveness varies across LLMs and tasks. Often trial and error practices are still widespread. We propose HALC$-$a general pipeline that allows for the systematic and reliable construction of optimal prompts for any given coding task and model, permitting the integration of any prompting strategy deemed relevant. To investigate LLM coding and validate our pipeline, we sent a total of 1,512 individual prompts to our local LLMs in over two million requests. We test prompting strategies and LLM task performance based on few expert codings (ground truth). When compared to these expert codings, we find prompts that code reliably for single variables (${\alpha}$climate = .76; ${\alpha}$movement = .78) and across two variables (${\alpha}$climate = .71; ${\alpha}$movement = .74) using the LLM Mistral NeMo. Our prompting strategies are set up in a way that aligns the LLM to our codebook$-$we are not optimizing our codebook for LLM friendliness. Our paper provides insights into the effectiveness of different prompting strategies, crucial influencing factors, and the identification of reliable prompts for each coding task and model.
[51] AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning
Yifan Wei, Xiaoyan Yu, Yixuan Weng, Tengfei Pan, Angsheng Li, Li Du
Main category: cs.CL
TL;DR: AutoTIR is a reinforcement learning framework that enables LLMs to autonomously decide tool usage during reasoning, outperforming rigid predefined methods.
Details
Motivation: Existing tool-use methods in LLMs are rigid and risk degrading language competence, unlike human adaptive tool selection.Method: AutoTIR uses reinforcement learning with a hybrid reward mechanism to optimize correctness, structured output, and tool usage.
Result: AutoTIR achieves superior performance and generalization across diverse tasks compared to baselines.
Conclusion: Reinforcement learning shows promise for scalable and generalizable tool-integrated reasoning in LLMs.
Abstract: Large Language Models (LLMs), when enhanced through reasoning-oriented post-training, evolve into powerful Large Reasoning Models (LRMs). Tool-Integrated Reasoning (TIR) further extends their capabilities by incorporating external tools, but existing methods often rely on rigid, predefined tool-use patterns that risk degrading core language competence. Inspired by the human ability to adaptively select tools, we introduce AutoTIR, a reinforcement learning framework that enables LLMs to autonomously decide whether and which tool to invoke during the reasoning process, rather than following static tool-use strategies. AutoTIR leverages a hybrid reward mechanism that jointly optimizes for task-specific answer correctness, structured output adherence, and penalization of incorrect tool usage, thereby encouraging both precise reasoning and efficient tool integration. Extensive evaluations across diverse knowledge-intensive, mathematical, and general language modeling tasks demonstrate that AutoTIR achieves superior overall performance, significantly outperforming baselines and exhibits superior generalization in tool-use behavior. These results highlight the promise of reinforcement learning in building truly generalizable and scalable TIR capabilities in LLMs. The code and data are available at https://github.com/weiyifan1023/AutoTIR.
[52] Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning
Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan
Main category: cs.CL
TL;DR: Graph-R1 improves GraphRAG by using RL for lightweight hypergraph construction and multi-turn retrieval, outperforming existing methods.
Details
Motivation: Addressing challenges in GraphRAG like high construction cost, fixed retrieval, and reliance on prompt design.Method: Proposes Graph-R1, an agentic framework using RL for hypergraph construction and multi-turn retrieval.
Result: Outperforms traditional GraphRAG and RL-enhanced RAG in accuracy, efficiency, and generation quality.
Conclusion: Graph-R1 effectively enhances RAG by combining lightweight hypergraphs and RL-based retrieval.
Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, an agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality.
[53] Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs
Qinyuan Wu, Soumi Das, Mahsa Amani, Bishwamittra Ghosh, Mohammad Aflah Khan, Krishna P. Gummadi, Muhammad Bilal Zafar
Main category: cs.CL
TL;DR: LLMs can generalize from rote memorized data using a two-phase framework, showing structured latent representations and potential risks.
Details
Motivation: Challenge the belief that rote learning hinders generalization by showing LLMs can reinterpret memorized data.Method: Two-phase memorize-then-generalize framework: rote memorization followed by fine-tuning with meaningful prompts.
Result: Models generalize from memorized data, evidenced by structured latent representations.
Conclusion: Opens doors for efficient knowledge injection but also risks of misuse.
Abstract: Rote learning is a memorization technique based on repetition. It is commonly believed to hinder generalization by encouraging verbatim memorization rather than deeper understanding. This insight holds for even learning factual knowledge that inevitably requires a certain degree of memorization. In this work, we demonstrate that LLMs can be trained to generalize from rote memorized data. We introduce a two-phase memorize-then-generalize framework, where the model first rote memorizes factual subject-object associations using a semantically meaningless token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the two. This surprising finding opens the door to both effective and efficient knowledge injection and possible risks of repurposing the memorized data for malicious usage.
[54] Training language models to be warm and empathetic makes them less reliable and more sycophantic
Lujain Ibrahim, Franziska Sofia Hafner, Luc Rocher
Main category: cs.CL
TL;DR: Optimizing AI language models for warmth and empathy reduces their reliability, increasing errors in safety-critical tasks like factual accuracy and medical advice.
Details
Motivation: To investigate the trade-off between warmth and reliability in AI language models, especially when users express vulnerability.Method: Controlled experiments on five language models of varying sizes and architectures, training them for warmer responses and evaluating safety-critical tasks.
Result: Warm models showed higher error rates (+10 to +30 percentage points), promoted conspiracy theories, and validated incorrect beliefs, especially with sad users.
Conclusion: Current evaluation practices may miss systematic risks; a rethink of AI development and oversight is needed as these models reshape human relationships.
Abstract: Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we show how this creates a significant trade-off: optimizing language models for warmth undermines their reliability, especially when users express vulnerability. We conducted controlled experiments on five language models of varying sizes and architectures, training them to produce warmer, more empathetic responses, then evaluating them on safety-critical tasks. Warm models showed substantially higher error rates (+10 to +30 percentage points) than their original counterparts, promoting conspiracy theories, providing incorrect factual information, and offering problematic medical advice. They were also significantly more likely to validate incorrect user beliefs, particularly when user messages expressed sadness. Importantly, these effects were consistent across different model architectures, and occurred despite preserved performance on standard benchmarks, revealing systematic risks that current evaluation practices may fail to detect. As human-like AI systems are deployed at an unprecedented scale, our findings indicate a need to rethink how we develop and oversee these systems that are reshaping human relationships and social interaction.
[55] Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
Carel van Niekerk, Renato Vukovic, Benjamin Matthias Ruppik, Hsien-chin Lin, Milica Gašić
Main category: cs.CL
TL;DR: RLSF is a post-training method for LLMs that uses the model’s own confidence as intrinsic feedback to improve calibration and reasoning, without needing external labels or rewards.
Details
Motivation: LLMs often produce plausible but poorly-calibrated answers, limiting reliability on reasoning tasks. RLSF aims to mimic human self-feedback to address this.Method: RLSF ranks chain-of-thought solutions by confidence, uses synthetic preferences for fine-tuning via preference optimization, and requires no external feedback.
Result: RLSF improves calibration and reasoning, enhancing performance on arithmetic reasoning and multiple-choice QA tasks.
Conclusion: RLSF demonstrates the potential of intrinsic rewards for LLM post-training, warranting further research in this direction.
Abstract: Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model’s own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model’s probability estimates – restoring well-behaved calibration – and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model’s own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.
[56] Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation
Tianyi Hu, Andrea Morales-Garzón, Jingyi Zheng, Maria Maistro, Daniel Hershcovich
Main category: cs.CL
TL;DR: CARRIAGE, a new RAG framework, enhances diversity in cross-cultural recipe adaptation by improving retrieval and context organization, outperforming traditional RAG and LLMs.
Details
Motivation: To address the lack of diversity in RAG-generated recipe adaptations despite varied inputs, ensuring cultural appropriateness and catering to diverse dietary needs.Method: Proposes CARRIAGE, a plug-and-play RAG framework that enhances diversity in retrieval and context organization for recipe adaptation.
Result: CARRIAGE achieves Pareto efficiency in diversity and quality of adaptations compared to closed-book LLMs.
Conclusion: CARRIAGE successfully addresses RAG’s diversity limitation, offering a scalable solution for creative tasks with multiple valid answers.
Abstract: In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish’s essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.
[57] Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models
Hyunwoo Yoo, Gail L. Rosen
Main category: cs.CL
TL;DR: LLMs outperform traditional models in classifying microbiome samples and predicting contamination risk using only metadata, showing promise for biosurveillance.
Details
Motivation: Traditional models struggle with generalization in microbiome studies using metadata, especially in small or heterogeneous datasets.Method: Evaluated LLMs (ChatGPT-4o, Claude 3.7 Sonnet, Grok-3, LLaMA 4) in zero-shot and few-shot settings against Random Forests for ontology classification and E. Coli risk prediction.
Result: LLMs outperformed baselines in ontology classification and contamination risk prediction, generalizing well across datasets.
Conclusion: LLMs effectively handle sparse, heterogeneous metadata, offering a viable metadata-only approach for environmental microbiology.
Abstract: Traditional machine learning models struggle to generalize in microbiome studies where only metadata is available, especially in small-sample settings or across studies with heterogeneous label formats. In this work, we explore the use of large language models (LLMs) to classify microbial samples into ontology categories such as EMPO 3 and related biological labels, as well as to predict pathogen contamination risk, specifically the presence of E. Coli, using environmental metadata alone. We evaluate LLMs such as ChatGPT-4o, Claude 3.7 Sonnet, Grok-3, and LLaMA 4 in zero-shot and few-shot settings, comparing their performance against traditional models like Random Forests across multiple real-world datasets. Our results show that LLMs not only outperform baselines in ontology classification, but also demonstrate strong predictive ability for contamination risk, generalizing across sites and metadata distributions. These findings suggest that LLMs can effectively reason over sparse, heterogeneous biological metadata and offer a promising metadata-only approach for environmental microbiology and biosurveillance applications.
[58] DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router
Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, Wei Cheng
Main category: cs.CL
TL;DR: DeepSieve is an agentic RAG framework that improves reasoning and retrieval by decomposing queries and routing sub-questions to suitable sources, outperforming traditional RAG methods.
Details
Motivation: LLMs struggle with knowledge-intensive queries due to lack of dynamic access to updated or domain-specific information, and existing RAG methods lack fine-grained control, leading to noisy retrieval and shallow reasoning.Method: DeepSieve decomposes queries into sub-questions, routes them to suitable sources, and filters irrelevant information via multi-stage distillation, leveraging LLM-as-a-knowledge-router.
Result: Experiments show improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches in multi-hop QA tasks.
Conclusion: DeepSieve offers a modular, transparent, and adaptable solution for enhancing LLM reasoning and retrieval in knowledge-intensive tasks.
Abstract: Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches.
[59] The pitfalls of next-token prediction
Gregor Bachmann, Vaishnavh Nagarajan
Main category: cs.CL
TL;DR: The paper critiques next-token prediction in AI, exposing flaws in teacher-forced training and proposing a multi-token objective as a solution.
Details
Motivation: To address misconceptions about next-token prediction's ability to model human intelligence and expose its limitations in certain tasks.Method: Analyzes teacher-forced training failures, designs a minimal planning task to demonstrate these failures, and tests a teacherless training approach.
Result: Empirical failure of Transformer and Mamba architectures in the designed task, with preliminary success using teacherless training.
Conclusion: Advocates for exploring beyond next-token prediction and suggests teacherless training as a promising alternative.
Abstract: Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction – autoregressive inference and teacher-forced training – must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner – remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using teacherless training, a simple modification using dummy tokens that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures
[60] Task Arithmetic for Language Expansion in Speech Translation
Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe
Main category: cs.CL
TL;DR: The paper introduces an augmented task arithmetic method to expand one-to-one speech translation (ST) systems to one-to-many without re-training, addressing language confusion and improving performance.
Details
Motivation: To reduce the cost of expanding language pairs in ST systems by avoiding re-training on combined datasets.Method: Augmented task arithmetic with a language control model to prevent language confusion, and synthesizing ST models from existing MT and ST models.
Result: BLEU score improvements up to 4.66 and 4.92, with COMET gains of 8.87 and 11.83 on MuST-C and CoVoST-2 datasets.
Conclusion: The framework effectively extends ST capabilities to new language pairs without paired training data or pre-trained models.
Abstract: Recent progress in large language models (LLMs) has gained interest in speech-text multimodal foundation models, achieving strong performance on instruction-tuned speech translation (ST). However, expanding language pairs is costly due to re-training on combined new and previous datasets. To address this, we aim to build a one-to-many ST system from existing one-to-one ST systems using task arithmetic without re-training. Direct application of task arithmetic in ST leads to language confusion; therefore, we introduce an augmented task arithmetic method incorporating a language control model to ensure correct target language generation. Our experiments on MuST-C and CoVoST-2 show BLEU score improvements of up to 4.66 and 4.92, with COMET gains of 8.87 and 11.83. In addition, we demonstrate our framework can extend to language pairs lacking paired ST training data or pre-trained ST models by synthesizing ST models based on existing machine translation (MT) and ST models via task analogies.
[61] Simulated patient systems are intelligent when powered by large language model-based AI agents
Huizi Yu, Jiayan Zhou, Lingyao Li, Shan Chen, Jack Gallifant, Anye Shi, Xiang Li, Jingxian He, Wenyue Hua, Mingyu Jin, Guang Chen, Yang Zhou, Zhao Li, Trisha Gupte, Ming-Li Chen, Zahra Azizi, Yongfeng Zhang, Yanqiu Xing, Themistocles L. Danielle S. Bitterman, Themistocles L. Assimes, Xin Ma, Lin Lu, Lizhou Fan
Main category: cs.CL
TL;DR: AIPatient is an AI-powered simulated patient system using LLM-based agents and a knowledge graph, achieving high accuracy in medical QA and demonstrating strong usability and educational value.
Details
Motivation: To enhance medical education and research by providing a safe, intelligent, and realistic simulated patient system.Method: Developed AIPatient with six task-specific LLM-based AI agents and a knowledge graph (AIPatient KG) using MIMIC-III data. Evaluated via EHR-based QA, readability, robustness, and user studies.
Result: Achieved 94.15% QA accuracy, high validity (F1=0.89), readability (Flesch scores 77.23/5.6), and robustness (non-significant variance). User study confirmed high fidelity and usability.
Conclusion: AIPatient shows promise for medical education, model evaluation, and system integration, performing comparably or better than human-simulated patients.
Abstract: Simulated patient systems play an important role in modern medical education and research, providing safe, integrative medical training environments and supporting clinical decision-making simulations. We developed AIPatient, an intelligent simulated patient system powered by large language model-based AI agents. The system incorporates the Retrieval Augmented Generation (RAG) framework, powered by six task-specific LLM-based AI agents for complex reasoning. For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data from the Medical Information Mart for Intensive Care (MIMIC)-III database. Primary outcomes showcase the system’s intelligence, including the system’s accuracy in Electronic Record (EHR)-based medical Question Answering (QA), readability, robustness, and stability. The system achieved a QA accuracy of 94.15% when all six AI agents present, surpassing benchmarks with partial or no agent integration. Its knowledgebase demonstrated high validity (F1 score=0.89). Readability scores showed median Flesch Reading Ease at 77.23 and median Flesch Kincaid Grade at 5.6, indicating accessibility to all medical professionals. Robustness and stability were confirmed with non-significant variance (ANOVA F-value=0.6126, p > 0.1; F-value=0.782, p > 0.1). A user study with medical students further demonstrated that AIPatient offers high fidelity, strong usability, and effective educational value, performing comparably or better than human-simulated patients in medical history-taking scenarios. The promising intelligence of the AIPatient system highlights its potential to support a wide range of applications, including medical education, model evaluation, and system integration.
[62] BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona Diab, Maarten Sap
Main category: cs.CL
TL;DR: BIG5-CHAT dataset and training methods (Supervised Fine-Tuning, Direct Preference Optimization) improve LLM personality realism, outperforming prompt-based methods and aligning with human traits.
Details
Motivation: Addressing realism and validity issues in embedding human personality traits into LLMs, moving beyond prompt-based methods.Method: Use of BIG5-CHAT dataset (100,000 dialogues) and training methods like Supervised Fine-Tuning and Direct Preference Optimization.
Result: Improved trait correlations (BFI, IPIP-NEO) and better reasoning task performance for certain traits (e.g., conscientiousness).
Conclusion: Training-based methods effectively shape LLM personalities by learning from real human behaviors, a first comprehensive study.
Abstract: In this work, we tackle the challenge of embedding realistic human personality traits into LLMs. Previous approaches have primarily focused on prompt-based methods that describe the behavior associated with the desired personality traits, suffering from realism and validity issues. To address these limitations, we introduce BIG5-CHAT, a large-scale dataset containing 100,000 dialogues designed to ground models in how humans express their personality in language. Leveraging this dataset, we explore Supervised Fine-Tuning and Direct Preference Optimization as training-based methods to align LLMs more naturally with human personality patterns. Our methods outperform prompting on personality assessments such as BFI and IPIP-NEO, with trait correlations more closely matching human data. Furthermore, our experiments reveal that models trained to exhibit higher conscientiousness, higher agreeableness, lower extraversion, and lower neuroticism display better performance on reasoning tasks, aligning with psychological findings on how these traits impact human cognitive performance. To our knowledge, this work is the first comprehensive study to demonstrate how training-based methods can shape LLM personalities through learning from real human behaviors.
[63] Pralekha: Cross-Lingual Document Alignment for Indic Languages
Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Raj Dabre
Main category: cs.CL
TL;DR: PRALEKHA introduces a benchmark and DAC metric for document-level alignment in Indic languages, improving over pooling-based methods.
Details
Motivation: Existing CLDA techniques lack fine-grained alignment and context for document-level MT, especially in low-resource settings.Method: Proposed DAC aligns documents by matching smaller chunks and computes similarity based on aligned chunk ratios.
Result: DAC outperforms pooling-based baselines in noisy scenarios and improves document MT model performance.
Conclusion: PRALEKHA and DAC advance document-level alignment, with potential for broader CLDA applications.
Abstract: Mining parallel document pairs for document-level machine translation (MT) remains challenging due to the limitations of existing Cross-Lingual Document Alignment (CLDA) techniques. Most approaches rely on metadata such as URLs, which is often unavailable in low-resource language settings, while others represent documents using pooled sentence embeddings, which fail to capture fine-grained alignment cues. Moreover, current sentence embedding models have limited context windows, hindering their ability to represent document-level information effectively. To address these challenges for Indic languages, we introduce PRALEKHA, a large-scale benchmark for evaluating document-level alignment techniques. It contains over 3 million aligned document pairs across 11 Indic languages and English, of which 1.5 million are English–Indic pairs. Furthermore, we propose Document Alignment Coefficient (DAC), a novel metric for fine-grained document alignment. Unlike pooling-based approaches, DAC aligns documents by matching smaller chunks and computes similarity as the ratio of aligned chunks to the average number of chunks in a pair. Intrinsic evaluation shows that DAC achieves substantial improvements over pooling-based baselines, particularly in noisy scenarios. Extrinsic evaluation further demonstrates that document MT models trained on DAC-aligned pairs consistently outperform those using baseline alignment methods. These results highlight DAC’s effectiveness for parallel document mining. The PRALEKHA dataset and CLDA evaluation framework will be made publicly available.
[64] LIMO: Less is More for Reasoning
Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
Main category: cs.CL
TL;DR: LIMO model achieves high accuracy in mathematical reasoning with minimal training data, outperforming models trained on much larger datasets, and proposes the LIMO Hypothesis for efficient reasoning in foundation models.
Details
Motivation: To challenge the assumption that complex reasoning in LLMs requires massive training data by demonstrating sophisticated reasoning with minimal examples.Method: Simple supervised fine-tuning of the LIMO model using only 1% of the training data compared to prior approaches.
Result: LIMO achieves 63.3% accuracy on AIME24 and 95.6% on MATH500, with strong out-of-distribution generalization (45.8% improvement).
Conclusion: The LIMO Hypothesis suggests complex reasoning emerges from pre-trained knowledge and strategic post-training examples, not task complexity.
Abstract: We challenge the prevailing assumption that complex reasoning in large language models (LLMs) necessitates massive training data. We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples. Specifically, through simple supervised fine-tuning, our model, LIMO, achieves 63.3% accuracy on AIME24 and 95.6% on MATH500, surpassing previous fine-tuned models (6.5% on AIME24, 59.2% on MATH500) while using only 1% of the training data required by prior approaches. Furthermore, LIMO exhibits strong out-of-distribution generalization, achieving a 45.8% absolute improvement across diverse benchmarks, outperforming models trained on 100x more data. Synthesizing these findings, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning can emerge through minimal but strategically designed demonstrations of cognitive processes. This hypothesis suggests that the threshold for eliciting complex reasoning is not dictated by task complexity but rather by two key factors: (1) the completeness of the model’s pre-trained knowledge base and (2) the effectiveness of post-training examples in serving as “cognitive templates” that guide reasoning.
[65] Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Hongyi Cai, Jie Li, Mohammad Mahdinur Rahman, Wenzhen Dong
Main category: cs.CL
TL;DR: LCG is a filtering framework for instruction fine-tuning that improves dataset quality and efficiency, outperforming existing methods with just 6K samples.
Details
Motivation: The quality and efficiency of training datasets limit the effectiveness of instruction fine-tuning for Large Language Models.Method: LCG uses centroid-based clustering and confidence-guided selection to identify valuable instruction pairs, employing a lightweight classifier for semi-supervised curation.
Result: Models fine-tuned on LCG-filtered subsets achieve superior performance, with significant improvements on MT-bench and consistent gains across metrics.
Conclusion: LCG offers an efficient and effective approach to instruction tuning, maintaining performance while improving dataset quality.
Abstract: The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework’s efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.
[66] Narrative Context Protocol: An Open-Source Storytelling Framework for Generative AI
Hank Gerba
Main category: cs.CL
TL;DR: Narrative Context Protocol (NCP) is an open-source standard for narrative interoperability and AI-driven storytelling, demonstrated through a year-long experiment.
Details
Motivation: To enable narrative portability, AI-driven authoring tools, and emergent narratives while maintaining coherence.Method: Uses a structured ‘Storyform’ to encode narrative features and provides guardrails for generative AI systems.
Result: Successfully created a playable, text-based experience from a novella, maintaining narrative coherence with player agency.
Conclusion: NCP effectively bridges generative storytelling and narrative structure, ensuring context and coherence.
Abstract: Here we introduce Narrative Context Protocol (NCP), an open-source narrative standard designed to enable narrative interoperability, AI-driven authoring tools, real-time emergent narratives, and more. By encoding a story’s structure in a “Storyform,” which is a structured register of its narrative features, NCP enables narrative portability across systems as well as intent-based constraints for generative storytelling systems. We demonstrate the capabilities of NCP through a year-long experiment, during which an author used NCP and a custom authoring platform to create a playable, text-based experience based on her pre-existing novella. This experience is driven by generative AI, with unconstrained natural language input. NCP functions as a set of “guardrails” that allows the generative system to accommodate player agency while also ensuring that narrative context and coherence are maintained.
[67] Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues
Alexander Scarlatos, Naiming Liu, Jaewook Lee, Richard Baraniuk, Andrew Lan
Main category: cs.CL
TL;DR: The paper introduces a method to train LLMs for tutoring by optimizing student correctness and pedagogical quality, outperforming existing AI tutors.
Details
Motivation: Current AI tutors use LLMs but don't maximize student learning during dialogues, leading to suboptimal interactions.Method: The approach generates tutor utterances, scores them using an LLM-based student model and GPT-4o, and trains Llama 3.1 8B with direct preference optimization.
Result: The model significantly increases student correctness while maintaining pedagogical quality, validated by qualitative and human evaluations.
Conclusion: The proposed method enhances AI tutoring by balancing learning outcomes and pedagogical effectiveness.
Abstract: Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.
[68] Levels of Analysis for Large Language Models
Alexander Ku, Declan Campbell, Xuechunzi Bai, Jiayi Geng, Ryan Liu, Raja Marjieh, R. Thomas McCoy, Andrew Nam, Ilia Sucholutsky, Veniamin Veselovsky, Liyi Zhang, Jian-Qiao Zhu, Thomas L. Griffiths
Main category: cs.CL
TL;DR: The paper suggests using cognitive science methods, inspired by David Marr’s levels of analysis, to understand large language models (LLMs) better.
Details
Motivation: LLMs are powerful but opaque, similar to historical challenges in understanding the human mind. Cognitive science methods can bridge this gap.Method: Proposes a framework based on Marr’s levels of analysis to apply cognitive science techniques for studying LLMs.
Result: Illustrates how these techniques can reveal insights into LLM behavior and internal organization.
Conclusion: Provides a toolkit for interpreting LLMs by leveraging cognitive science methods.
Abstract: Modern artificial intelligence systems, such as large language models, are increasingly powerful but also increasingly hard to understand. Recognizing this problem as analogous to the historical difficulties in understanding the human mind, we argue that methods developed in cognitive science can be useful for understanding large language models. We propose a framework for applying these methods based on the levels of analysis that David Marr proposed for studying information processing systems. By revisiting established cognitive science techniques relevant to each level and illustrating their potential to yield insights into the behavior and internal organization of large language models, we aim to provide a toolkit for making sense of these new kinds of minds.
[69] EEG-CLIP : Learning EEG representations from natural language descriptions
Tidiane Camaret Ndir, Robin Tibor Schirrmeister, Tonio Ball
Main category: cs.CL
TL;DR: EEG-CLIP is a contrastive learning framework aligning EEG time series with clinical text descriptions for versatile EEG decoding, showing promise for zero-shot and few-shot tasks.
Details
Motivation: Current deep networks for EEG decoding are task-specific; EEG-CLIP aims to generalize decoding by aligning EEG and text representations.Method: Developed EEG-CLIP, a contrastive learning framework to align EEG time series and clinical text in a shared embedding space.
Result: EEG-CLIP successfully aligns EEG and text representations, enabling zero-shot and few-shot decoding.
Conclusion: EEG-CLIP offers a general approach for EEG representation learning, facilitating diverse decoding tasks with fewer examples.
Abstract: Deep networks for electroencephalogram (EEG) decoding are often only trained to solve one specific task, such as pathology or age decoding. A more general task-agnostic approach is to train deep networks to match a (clinical) EEG recording to its corresponding textual medical report and vice versa. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework, EEG-CLIP, that aligns the EEG time series and the descriptions of the corresponding clinical text in a shared embedding space. We investigated its potential for versatile EEG decoding, evaluating performance in a range of few-shot and zero-shot settings. Overall, we show that EEG-CLIP manages to non-trivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero-shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at https://github.com/tidiane-camaret/EEGClip
[70] “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection
Muhammad Haroon, Magdalena Wojcieszak, Anshuman Chhabra
Main category: cs.CL
TL;DR: The paper proposes using Large Language Models (LLMs) with in-context learning (ICL) to classify political ideology in online content, outperforming traditional methods.
Details
Motivation: Address limitations of existing ideology classification methods, which require extensive human effort and lack adaptability to evolving contexts.Method: Uses LLMs with ICL, focusing on demonstration selection and metadata influence, tested on news articles and YouTube videos.
Result: Outperforms zero-shot and supervised methods; metadata impacts classification; source context affects LLM’s output.
Conclusion: LLMs with ICL offer a scalable, adaptable solution for ideology classification, with metadata playing a key role.
Abstract: The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content in the context of the two-party US political spectrum through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM’s classification.
[71] Beyond the Reported Cutoff: Where Large Language Models Fall Short on Financial Knowledge
Agam Shah, Liqin Ye, Sebastian Jaskowski, Wei Xu, Sudheer Chava
Main category: cs.CL
TL;DR: LLMs’ knowledge of historical financial data is limited, especially for smaller companies and older data, but they perform better for larger companies and recent information, despite higher hallucination rates.
Details
Motivation: To evaluate how well LLMs cover historical financial knowledge and how company characteristics influence their accuracy.Method: Assessed over 197k questions on U.S. public companies, comparing model responses to factual data, and analyzed impacts of company size, investment, attention, and filing readability.
Result: LLMs lack knowledge of past financial data but perform better for larger companies and recent info, with higher hallucination rates for these cases.
Conclusion: LLMs have gaps in historical financial knowledge, with accuracy influenced by company traits and data recency, highlighting limitations for financial QA tasks.
Abstract: Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model’s cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs’ knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. The code, prompts, and model outputs are available on GitHub.
[72] My Life in Artificial Intelligence: People, anecdotes, and some lessons learnt
Kees van Deemter
Main category: cs.CL
TL;DR: A personal reflection on 40 years of AI and NLP research across multiple countries, highlighting curiosity-driven career choices and anecdotes.
Details
Motivation: To share experiences and insights from a long career in AI and NLP, offering guidance and inspiration to younger researchers.Method: Narrative storytelling, combining personal anecdotes with historical context of AI.
Result: A rich, personal account of the evolution of AI and NLP, emphasizing the role of curiosity and serendipity.
Conclusion: The journey underscores the importance of adaptability and passion in navigating a career in AI, especially as the field gains prominence.
Abstract: In this very personal workography, I relate my 40-year experiences as a researcher and educator in and around Artificial Intelligence (AI), more specifically Natural Language Processing. I describe how curiosity, and the circumstances of the day, led me to work in both industry and academia, and in various countries, including The Netherlands (Amsterdam, Eindhoven, and Utrecht), the USA (Stanford), England (Brighton), Scotland (Aberdeen), and China (Beijing and Harbin). People and anecdotes play a large role in my story; the history of AI forms its backdrop. I focus on things that might be of interest to (even) younger colleagues, given the choices they face in their own work and life at a time when AI is finally emerging from the shadows.
[73] Probing then Editing Response Personality of Large Language Models
Tianjie Ju, Zhenyu Shao, Bowen Wang, Yujia Chen, Zhuosheng Zhang, Hao Fei, Mong-Li Lee, Wynne Hsu, Sufeng Duan, Gongshen Liu
Main category: cs.CL
TL;DR: The paper investigates how personality traits are encoded in LLMs using a layer-wise probing framework, identifies key layers for personality simulation, and proposes a perturbation method to edit personality during inference with minimal impact on general capabilities.
Details
Motivation: To understand how personality traits are internally encoded in LLMs and develop a method to edit these traits without significantly degrading model performance.Method: A layer-wise probing framework is applied to 11 open-source LLMs, followed by a perturbation method to edit personality traits during inference.
Result: Personality traits are mainly simulated in middle and upper layers, with instruction-tuned models showing clearer separation. The perturbation method successfully alters personality traits with minimal impact on general capabilities.
Conclusion: The study provides insights into personality encoding in LLMs and offers a practical method for personality editing, balancing effectiveness and computational efficiency.
Abstract: Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that simulate consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in simulating personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly simulate personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at https://github.com/universe-sky/probing-then-editing-personality.
[74] Ai2 Scholar QA: Organized Literature Synthesis with Attribution
Amanpreet Singh, Joseph Chee Chang, Chloe Anastasiades, Dany Haddad, Aakanksha Naik, Amber Tanaka, Angele Zamarron, Cecile Nguyen, Jena D. Hwang, Jason Dunkleberger, Matt Latzke, Smita Rao, Jaron Lochner, Rob Evans, Rodney Kinney, Daniel S. Weld, Doug Downey, Sergey Feldman
Main category: cs.CL
TL;DR: Ai2 Scholar QA is a free, open-source scientific QA system that outperforms competitors on benchmarks.
Details
Motivation: Many state-of-the-art retrieval-augmented generation systems are expensive and closed-source, limiting accessibility.Method: Developed as a customizable Python package, web app, and public APIs with downloadable datasets.
Result: Outperforms competing systems on a recent scientific QA benchmark.
Conclusion: Ai2 Scholar QA provides an effective, accessible alternative to closed-source systems.
Abstract: Retrieval-augmented generation is increasingly effective in answering scientific questions from literature, but many state-of-the-art systems are expensive and closed-source. We introduce Ai2 Scholar QA, a free online scientific question answering application. To facilitate research, we make our entire pipeline public: as a customizable open-source Python package and interactive web app, along with paper indexes accessible through public APIs and downloadable datasets. We describe our system in detail and present experiments analyzing its key design decisions. In an evaluation on a recent scientific QA benchmark, we find that Ai2 Scholar QA outperforms competing systems.
[75] FB-RAG: Improving RAG with Forward and Backward Lookup
Kushal Chawla, Alfy Samuel, Anoop Kumar, Daben Liu
Main category: cs.CL
TL;DR: FB-RAG improves RAG by using a forward-looking strategy with a lightweight LLM to identify relevant context, enhancing performance and reducing latency without complex training.
Details
Motivation: Traditional RAG struggles with complex queries due to context-size trade-offs, needing a better approach for relevance and efficiency.Method: FB-RAG employs a lightweight LLM to sample potential outputs and identify the most relevant context for a powerful generator, avoiding complex training.
Result: FB-RAG improves performance across 9 datasets, with significant latency reductions (e.g., 48% on EN.QA) or performance gains (8% with 10% latency reduction).
Conclusion: FB-RAG demonstrates how smaller LLMs can systematically enhance larger ones, even when their outputs are imperfect, improving efficiency and accuracy.
Abstract: Traditional Retrieval-Augmented Generation (RAG) struggles with complex queries that lack strong signals to retrieve the most relevant context, forcing a trade-off between choosing a small context that misses key information and a large context that confuses the LLM. To address this, we propose Forward-Backward RAG (FB-RAG), a new training-free framework based on a simple yet powerful forward-looking strategy. FB-RAG employs a light-weight LLM to peek into potential future generations, using evidence from multiple sampled outputs to precisely identify the most relevant context for a final, more powerful generator. This improves performance without complex finetuning or Reinforcement Learning common in prior work. Across 9 datasets, FB-RAG consistently delivers strong results. Further, the performance gains can be achieved with reduced latency due to a shorter, more focused prompt for the powerful generator. On EN.QA dataset, FB-RAG matches the leading baseline with over 48% latency reduction or achieves an 8% performance improvement with a 10% latency reduction. Our analysis finds cases where even when the forward-looking LLM fails to generate correct answers, its attempts are sufficient to guide the final model to an accurate response, demonstrating how smaller LLMs can systematically improve the performance and efficiency of larger ones.
[76] CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation
Noy Sternlicht, Tom Hope
Main category: cs.CL
TL;DR: CHIMERA is a large-scale Knowledge Base (KB) of 28K recombination examples mined from scientific literature, enabling analysis of cross-disciplinary inspiration and training models for novel research proposals.
Details
Motivation: To study how scientists recombine concepts and draw inspiration from different fields, and to facilitate novel research directions.Method: Define a new information extraction task for identifying recombination in abstracts, curate an expert-annotated dataset, fine-tune a language model, and apply it to AI papers.
Result: CHIMERA enables analysis of recombination patterns in AI subfields and trains a hypothesis generation model proposing inspiring research directions.
Conclusion: CHIMERA provides a valuable resource for understanding and fostering scientific innovation through recombination.
Abstract: A hallmark of human innovation is recombination – the creation of novel ideas by integrating elements from existing concepts and mechanisms. In this work, we introduce CHIMERA, a large-scale Knowledge Base (KB) of over 28K recombination examples automatically mined from the scientific literature. CHIMERA enables large-scale empirical analysis of how scientists recombine concepts and draw inspiration from different areas, and enables training models that propose novel, cross-disciplinary research directions. To construct this KB, we define a new information extraction task: identifying recombination instances in scientific abstracts. We curate a high-quality, expert-annotated dataset and use it to fine-tune a large language model, which we apply to a broad corpus of AI papers. We showcase the utility of CHIMERA through two applications. First, we analyze patterns of recombination across AI subfields. Second, we train a scientific hypothesis generation model using the KB, showing that it can propose novel research directions that researchers rate as inspiring. We release our data and code at https://github.com/noy-sternlicht/CHIMERA-KB.
[77] FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang
Main category: cs.CL
TL;DR: FLAT-LLM is a training-free, fine-grained low-rank compression method for LLMs, improving efficiency and accuracy without recovery fine-tuning.
Details
Motivation: Addressing the computational and memory demands of LLMs in resource-constrained environments, while avoiding accuracy degradation and inefficiencies of existing methods.Method: Uses fine-grained low-rank transformations in activation space, truncating eigenvectors via head-wise PCA, and greedy budget redistribution for rank allocation.
Result: Outperforms structural pruning baselines in generalization and downstream performance, with faster inference than decomposition-based methods.
Conclusion: FLAT-LLM offers a practical, efficient solution for compressing LLMs without compromising accuracy or requiring extensive fine-tuning.
Abstract: Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.
[78] SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs
Patrik Czakó, Gábor Kertész, Sándor Szénási
Main category: cs.CL
TL;DR: SmoothRot is a post-training quantization method for 4-bit LLMs, addressing activation outliers via channel-wise scaling and Hadamard transforms, improving accuracy without latency.
Details
Motivation: To enhance efficiency of 4-bit quantization in LLMs by tackling activation outliers, which degrade quantization accuracy.Method: Integrates channel-wise scaling with Hadamard transformations to transform outliers into quantization-friendly activations.
Result: Reduces performance gap between quantized and FP16 models by 10-30% on LLMs like LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B.
Conclusion: SmoothRot effectively improves quantization accuracy for LLMs without added inference latency, as validated by experiments.
Abstract: We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.
[79] HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Yue Shen, Jian Wang, Peng Wei
Main category: cs.CL
TL;DR: The paper introduces HIRAG, a hierarchical instruction-tuning method for RAG models, enhancing their filtering, combination, and reasoning abilities to improve performance on various datasets.
Details
Motivation: Addressing the lack of granular focus and deeper reasoning in traditional RAG systems, which struggle with inconsistent document quality and retrieval imperfections.Method: Proposes HIRAG, a method incorporating hierarchical abilities (filtering, combination, RAG-specific reasoning) and a “think before answering” strategy using multi-level chain-of-thought.
Result: HIRAG significantly improves model performance on datasets like RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
Conclusion: HIRAG enhances RAG models’ capabilities, demonstrating the importance of hierarchical reasoning and fine-tuning for better performance.
Abstract: Retrieval-augmented generation (RAG) has become a fundamental paradigm for addressing the challenges faced by large language models in handling real-time information and domain-specific problems. Traditional RAG systems primarily rely on the in-context learning (ICL) capabilities of the large language model itself. Still, in-depth research on the specific capabilities needed by the RAG generation model is lacking, leading to challenges with inconsistent document quality and retrieval system imperfections. Even the limited studies that fine-tune RAG generative models often \textit{lack a granular focus on RAG task} or \textit{a deeper utilization of chain-of-thought processes}. To address this, we propose that RAG models should possess three progressively hierarchical abilities (1) Filtering: the ability to select relevant information; (2) Combination: the ability to combine semantic information across paragraphs; and (3) RAG-specific reasoning: the ability to further process external knowledge using internal knowledge. Thus, we introduce our new RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering” strategy. This method enhances the model’s open-book examination capability by utilizing multi-level progressive chain-of-thought. Experiments show that the HIRAG training strategy significantly improves the model’s performance on datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
[80] FrugalRAG: Learning to retrieve and reason for multi-hop QA
Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma
Main category: cs.CL
TL;DR: The paper challenges the need for large-scale fine-tuning in retrieval-augmented generation (RAG) for complex QA tasks, showing improved prompts and frugality-focused fine-tuning can achieve competitive results with fewer searches.
Details
Motivation: To address the overlooked efficiency metric (number of retrieval searches) in RAG systems and disprove the necessity of large-scale fine-tuning for improving RAG metrics.Method: Uses a standard ReAct pipeline with improved prompts and explores supervised/RL-based fine-tuning for frugality (reducing search latency).
Result: Outperforms state-of-the-art methods on benchmarks like HotPotQA without large-scale fine-tuning and achieves competitive RAG metrics at half the search cost.
Conclusion: Efficient RAG systems can be built without extensive fine-tuning, focusing on frugality and improved prompts to reduce inference costs.
Abstract: We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).
[81] Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages
Lyzander Marciano Andrylie, Inaya Rahmanisa, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji
Main category: cs.CL
TL;DR: The paper explores language-specific features in LLMs using sparse autoencoders (SAEs) and introduces SAE-LAPE to identify these features, improving multilingual performance and interpretability.
Details
Motivation: Understanding how LLMs process multiple languages is challenging due to the polysemantic nature of neurons. Existing methods struggle to isolate language-specific features.Method: The authors use sparse autoencoders (SAEs) and introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features in LLMs.
Result: Language-specific features are found in middle to final layers, are interpretable, and improve multilingual performance and language identification.
Conclusion: SAE-LAPE effectively identifies language-specific features, enhancing interpretability and performance in multilingual tasks.
Abstract: Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model’s multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features
[82] Linguistic and Embedding-Based Profiling of Texts generated by Humans and Large Language Models
Sergio E. Zanotto, Segun Aroyehun
Main category: cs.CL
TL;DR: The study analyzes linguistic features of human-written and machine-generated texts across 8 domains and 11 LLMs, revealing simpler syntax and diverse semantics in human texts, with newer LLMs showing homogenized outputs.
Details
Motivation: To characterize differences between human-written and machine-generated texts using linguistic features, as prior work focused mainly on binary classification.Method: Analyzed texts using linguistic features (e.g., dependency length, emotionality) across morphology, syntax, and semantics, with statistical analysis and style embeddings.
Result: Human texts have simpler syntax and more semantic diversity; newer LLMs produce homogenized outputs.
Conclusion: Human texts are more stylistically diverse, while newer LLMs converge toward similar outputs, reducing variability.
Abstract: The rapid advancements in large language models (LLMs) have significantly improved their ability to generate natural language, making texts generated by LLMs increasingly indistinguishable from human-written texts. While recent research has primarily focused on using LLMs to classify text as either human-written and machine-generated texts, our study focus on characterizing these texts using a set of linguistic features across different linguistic levels such as morphology, syntax, and semantics. We select a dataset of human-written and machine-generated texts spanning 8 domains and produced by 11 different LLMs. We calculate different linguistic features such as dependency length and emotionality and we use them for characterizing human-written and machine-generated texts along with different sampling strategies, repetition controls and model release date. Our statistical analysis reveals that human-written texts tend to exhibit simpler syntactic structures and more diverse semantic content. Furthermore, we calculate the variability of our set of features across models and domains. Both human and machine texts show stylistic diversity across domains, with humans displaying greater variation in our features. Finally, we apply style embeddings to further test variability among human-written and machine-generated texts. Notably, newer models output text that is similarly variable, pointing to an homogenization of machine-generated texts.
[83] WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking
Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu
Main category: cs.CL
TL;DR: The paper introduces WakenLLM, a framework to quantify and improve LLMs’ reasoning by addressing the Vague Perception phenomenon, where models output ‘Unknown’ due to incapacity or failure. It shows up to 68.53% accuracy improvement without training.
Details
Motivation: Current evaluations focus on honesty of LLMs' 'Unknown' outputs rather than reasoning limits. The study aims to analyze and improve LLMs' reasoning capacity.Method: WakenLLM quantifies ‘Unknown’ outputs due to model incapacity and evaluates if stimulation can convert them into correct or justified responses. Experiments on six LLMs test this.
Result: Without training, LLMs achieve up to 68.53% accuracy improvement on Vague Perception samples, revealing unexplored reasoning potential.
Conclusion: The study extends LLMs’ theoretical reasoning bounds, offering insights into latent capacity and a new approach to Vague Perception.
Abstract: Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning. To address this, we introduce WakenLLM, a framework that quantifies the portion of Unknown output attributable to model incapacity and evaluates whether stimulation can convert them into either correct answers (verifiable) or justified (unverifiable) responses with valid reasoning. Our method offers a clearer picture of the limits of LLM reasoning and the potential for corrections across various datasets. Comprehensive experiments on six LLMs suggest that, without any training or parameter revision, LLMs can achieve up to a 68.53% accuracy improvement on Vague Perception samples through guided understanding. Our work reveals that current baseline methods only activate a small portion of LLMs’ reasoning potential, indicating considerable unexplored capacity. This extends the theoretical upper bounds of reasoning accuracy in LLMs. Consequently, this study deepens our understanding of the latent reasoning capacity of LLMs and offers a new perspective on addressing the Vague Perception phenomenon.
[84] Technical Report of TeleChat2, TeleChat2.5 and T1
Zihan Wang, Xinzhang Liu, Yitong Yao, Chao Wang, Yu Zhao, Zhihao Yang, Wenmin Deng, Kaipeng Jia, Jiaxin Peng, Yuyao Huang, Sishi Xiong, Zhuo Jiang, Kaidong Yu, Xiaohui Hu, Fubei Yao, Ruiyu Fang, Zhuoru Jiang, Ruiting Song, Qiyi Xie, Rui Xue, Xuewei He, Yanlei Xue, Zhu Yuan, Zhaoxi Zhang, Zilu Huang, Shiquan Wang, Xin Wang, Hanming Wu, Mingyuan Wang, Xufeng Zhan, Yuhan Sun, Zhaohu Xing, Yuhao Jiang, Bingkai Yang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li
Main category: cs.CL
TL;DR: The paper introduces TeleChat2, TeleChat2.5, and T1, upgraded versions of TeleChat, with improved performance through enhanced training strategies. TeleChat2 uses SFT and DPO, while TeleChat2.5 and T1 add continual pretraining and RL for specialized tasks. T1 excels in reasoning, and TeleChat2.5 in speed. Both 115B models outperform proprietary models like GPT-4o.
Details
Motivation: To advance language model performance with minimal architectural changes by refining training strategies, targeting diverse applications.Method: Enhanced pretraining (10T tokens), SFT, DPO, continual pretraining, and RL for domain-specific tasks. Models include 35B and 115B parameter variants.
Result: TeleChat2.5 and T1 show significant improvements in reasoning and speed, with T1 outperforming proprietary models like GPT-4o.
Conclusion: The TeleChat series offers state-of-the-art models for diverse applications, publicly released to support developers and researchers.
Abstract: We introduce the latest series of TeleChat models: \textbf{TeleChat2}, \textbf{TeleChat2.5}, and \textbf{T1}, offering a significant upgrade over their predecessor, TeleChat. Despite minimal changes to the model architecture, the new series achieves substantial performance gains through enhanced training strategies in both pre-training and post-training stages. The series begins with \textbf{TeleChat2}, which undergoes pretraining on 10 trillion high-quality and diverse tokens. This is followed by Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to further enhance its capabilities. \textbf{TeleChat2.5} and \textbf{T1} expand the pipeline by incorporating a continual pretraining phase with domain-specific datasets, combined with reinforcement learning (RL) to improve performance in code generation and mathematical reasoning tasks. The \textbf{T1} variant is designed for complex reasoning, supporting long Chain-of-Thought (CoT) reasoning and demonstrating substantial improvements in mathematics and coding. In contrast, \textbf{TeleChat2.5} prioritizes speed, delivering rapid inference. Both flagship models of \textbf{T1} and \textbf{TeleChat2.5} are dense Transformer-based architectures with 115B parameters, showcasing significant advancements in reasoning and general task performance compared to the original TeleChat. Notably, \textbf{T1-115B} outperform proprietary models such as OpenAI’s o1-mini and GPT-4o. We publicly release \textbf{TeleChat2}, \textbf{TeleChat2.5} and \textbf{T1}, including post-trained versions with 35B and 115B parameters, to empower developers and researchers with state-of-the-art language models tailored for diverse applications.
[85] Mind the Language Gap in Digital Humanities: LLM-Aided Translation of SKOS Thesauri
Felix Kraus, Nicolas Blumenröhr, Danah Tonne, Achim Streit
Main category: cs.CL
TL;DR: WOKIE is an open-source pipeline for automated SKOS thesaurus translation, addressing language diversity in Digital Humanities by combining translation services and LLMs for quality and scalability.
Details
Motivation: Language diversity limits access and interoperability in Digital Humanities; WOKIE aims to enhance accessibility and reuse of multilingual thesauri.Method: Combines external translation services with LLM refinement, tested on 15 languages with varied parameters and services.
Result: Improves translation quality, ontology matching, and interoperability, making thesauri more accessible.
Conclusion: WOKIE effectively supports multilingual research by automating translation and improving ontology matching.
Abstract: We introduce WOKIE, an open-source, modular, and ready-to-use pipeline for the automated translation of SKOS thesauri. This work addresses a critical need in the Digital Humanities (DH), where language diversity can limit access, reuse, and semantic interoperability of knowledge resources. WOKIE combines external translation services with targeted refinement using Large Language Models (LLMs), balancing translation quality, scalability, and cost. Designed to run on everyday hardware and be easily extended, the application requires no prior expertise in machine translation or LLMs. We evaluate WOKIE across several DH thesauri in 15 languages with different parameters, translation services and LLMs, systematically analysing translation quality, performance, and ontology matching improvements. Our results show that WOKIE is suitable to enhance the accessibility, reuse, and cross-lingual interoperability of thesauri by hurdle-free automated translation and improved ontology matching performance, supporting more inclusive and multilingual research infrastructures.
[86] Exploring LLM Autoscoring Reliability in Large-Scale Writing Assessments Using Generalizability Theory
Dan Song, Won-Chan Lee, Hong Jiao
Main category: cs.CL
TL;DR: The study evaluates reliability of LLMs in scoring AP Chinese writing tasks, comparing human and AI raters. Results show human raters are more reliable, but LLMs perform well for story narration. Hybrid scoring improves reliability.
Details
Motivation: To assess the reliability of LLMs in scoring writing tasks compared to human raters, specifically for AP Chinese exams.Method: Used generalizability theory to compare score consistency between human and AI raters for story narration and email response tasks. Essays were scored by 2 humans and 7 AI raters, with holistic and analytic scores.
Result: Human raters were more reliable overall, but LLMs showed consistency for story narration. Hybrid scoring (human + AI) improved reliability.
Conclusion: Hybrid scoring models combining human and AI raters can enhance reliability in large-scale writing assessments.
Abstract: This study investigates the estimation of reliability for large language models (LLMs) in scoring writing tasks from the AP Chinese Language and Culture Exam. Using generalizability theory, the research evaluates and compares score consistency between human and AI raters across two types of AP Chinese free-response writing tasks: story narration and email response. These essays were independently scored by two trained human raters and seven AI raters. Each essay received four scores: one holistic score and three analytic scores corresponding to the domains of task completion, delivery, and language use. Results indicate that although human raters produced more reliable scores overall, LLMs demonstrated reasonable consistency under certain conditions, particularly for story narration tasks. Composite scoring that incorporates both human and AI raters improved reliability, which supports that hybrid scoring models may offer benefits for large-scale writing assessments.
[87] Sem-DPO: Mitigating Semantic Inconsistency in Preference Optimization for Prompt Engineering
Anas Mohamed, Azal Ahmad Khan, Xinran Wang, Ahmad Faraz Khan, Shuwen Ge, Saman Bahzad Khan, Ayaan Ahmad, Ali Anwar
Main category: cs.CL
TL;DR: Sem-DPO, a variant of DPO, improves semantic consistency in prompt engineering for generative AI, outperforming DPO and baselines in benchmarks.
Details
Motivation: Address semantic drift in prompt optimization, ensuring prompts align with user intent while maintaining simplicity.Method: Sem-DPO adjusts DPO loss with semantic weighting, bounding semantic drift and improving consistency.
Result: 8-12% higher CLIP similarity and 5-9% higher human-preference scores than DPO, outperforming baselines.
Conclusion: Sem-DPO sets a new standard for prompt optimization, enabling semantics-aware preference tuning in language models.
Abstract: Generative AI can now synthesize strikingly realistic images from text, yet output quality remains highly sensitive to how prompts are phrased. Direct Preference Optimization (DPO) offers a lightweight, off-policy alternative to RL for automatic prompt engineering, but its token-level regularization leaves semantic inconsistency unchecked as prompts that win higher preference scores can still drift away from the user’s intended meaning. We introduce Sem-DPO, a variant of DPO that preserves semantic consistency yet retains its simplicity and efficiency. Sem-DPO adjusts the DPO loss using a weight based on how different the winning prompt is from the original, reducing the impact of training examples that are semantically misaligned. We provide the first analytical bound on semantic drift for preference-tuned prompt generators, showing that Sem-DPO keeps learned prompts within a provably bounded neighborhood of the original text. On three standard text-to-image prompt-optimization benchmarks and two language models, Sem-DPO achieves 8-12% higher CLIP similarity and 5-9% higher human-preference scores (HPSv2.1, PickScore) than DPO, while also outperforming state-of-the-art baselines. These findings suggest that strong flat baselines augmented with semantic weighting should become the new standard for prompt-optimization studies and lay the groundwork for broader, semantics-aware preference optimization in language models.
[88] SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers
Chaitanya Manem, Pratik Prabhanjan Brahma, Prakamya Mishra, Zicheng Liu, Emad Barsoum
Main category: cs.CL
TL;DR: SAND-Math introduces a pipeline for generating and enhancing difficult math problems to improve LLMs’ mathematical reasoning, showing significant performance boosts.
Details
Motivation: The scarcity of challenging math training data bottlenecks LLM development.Method: Uses a pipeline (SAND-Math) to generate and elevate problem difficulty via Difficulty Hiking.
Result: Augmentation with SAND-Math data improves performance by 17.85 points on AIME25; Difficulty Hiking increases average problem difficulty and boosts performance.
Conclusion: SAND-Math provides a scalable toolkit for enhancing mathematical reasoning in LLMs.
Abstract: The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38% to 49.23%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{https://huggingface.co/datasets/amd/SAND-MATH}{https://huggingface.co/datasets/amd/SAND-MATH}
[89] Soft Injection of Task Embeddings Outperforms Prompt-Based In-Context Learning
Jungwon Park, Wonjong Rhee
Main category: cs.CL
TL;DR: The paper proposes Soft Injection, a method to improve In-Context Learning (ICL) by using task embeddings instead of multiple examples in prompts, reducing memory and compute costs while outperforming traditional ICL.
Details
Motivation: The motivation is to address the inefficiency and unclear effectiveness of using multiple examples in prompts for ICL in LLMs.Method: The method involves constructing task embeddings from few-shot ICL prompts and softly injecting them into attention head activations using pre-optimized mixing parameters.
Result: The method outperforms 10-shot ICL by 10.2%-14.3% across 57 tasks and 12 LLMs, while reducing memory and compute costs.
Conclusion: Soft Injection shifts task conditioning from prompts to activations, improving efficiency and performance, and provides insights into task-specific attention head roles.
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks by conditioning on input-output examples in the prompt, without requiring any update in model parameters. While widely adopted, it remains unclear whether prompting with multiple examples is the most effective and efficient way to convey task information. In this work, we propose Soft Injection of task embeddings. The task embeddings are constructed only once using few-shot ICL prompts and repeatedly used during inference. Soft injection is performed by softly mixing task embeddings with attention head activations using pre-optimized mixing parameters, referred to as soft head-selection parameters. This method not only allows a desired task to be performed without in-prompt demonstrations but also significantly outperforms existing ICL approaches while reducing memory usage and compute cost at inference time. An extensive evaluation is performed across 57 tasks and 12 LLMs, spanning four model families of sizes from 4B to 70B. Averaged across 57 tasks, our method outperforms 10-shot ICL by 10.2%-14.3% across 12 LLMs. Additional analyses show that our method also serves as an insightful tool for analyzing task-relevant roles of attention heads, revealing that task-relevant head positions selected by our method transfer across similar tasks but not across dissimilar ones – underscoring the task-specific nature of head functionality. Our soft injection method opens a new paradigm for reducing prompt length and improving task performance by shifting task conditioning from the prompt space to the activation space.
cs.CV
[90] GAITEX: Human motion dataset from impaired gait and rehabilitation exercises of inertial and optical sensor data
Andreas Spilz, Heiko Oppel, Jochen Werner, Kathrin Stucke-Straub, Felix Capanni, Michael Munz
Main category: cs.CV
TL;DR: A multimodal dataset of physiotherapeutic exercises and gait patterns, recorded using IMUs and MoCap, supports machine learning model development for movement analysis.
Details
Motivation: The need for large, diverse datasets to develop robust sensor-based classification models for physiotherapeutic exercises and gait analysis, which are costly and time-consuming to collect.Method: Data collection from 19 participants using synchronized IMUs and MoCap, including raw and processed data, annotations, and tools for analysis.
Result: A comprehensive dataset with raw and processed IMU data, MoCap reference, annotations, and tools for machine learning tasks like exercise evaluation and gait analysis.
Conclusion: The dataset and provided tools aim to accelerate research in machine learning-driven human movement analysis by facilitating reproducibility and benchmarking.
Abstract: Wearable inertial measurement units (IMUs) offer a cost-effective and scalable means to assess human movement quality in clinical and everyday settings. However, the development of robust sensor-based classification models for physiotherapeutic exercises and gait analysis requires large, diverse datasets, which are costly and time-consuming to collect. Here, we present a multimodal dataset of physiotherapeutic exercises - including correct and clinically relevant variants - and gait-related exercises - including both normal and impaired gait patterns - recorded from 19 participants using synchronized IMUs and marker-based motion capture (MoCap). The dataset includes raw data from nine IMUs and thirty-five optical markers capturing full-body kinematics. Each IMU is additionally equipped with four optical markers, enabling precise comparison between IMU-derived orientation estimates and reference values from the MoCap system. To support further analysis, we also provide processed IMU orientations aligned with common segment coordinate systems, subject-specific OpenSim models, inverse kinematics results, and tools for visualizing IMU orientations in the musculoskeletal context. Detailed annotations of movement execution quality and time-stamped segmentations support diverse analysis goals. This dataset supports the development and benchmarking of machine learning models for tasks such as automatic exercise evaluation, gait analysis, temporal activity segmentation, and biomechanical parameter estimation. To facilitate reproducibility, we provide code for postprocessing, sensor-to-segment alignment, inverse kinematics computation, and technical validation. This resource is intended to accelerate research in machine learning-driven human movement analysis.
[91] Seeing Beyond Frames: Zero-Shot Pedestrian Intention Prediction with Raw Temporal Video and Multimodal Cues
Pallavi Zambare, Venkata Nikhil Thanikella, Ying Liu
Main category: cs.CV
TL;DR: BF-PIP, a zero-shot pedestrian intention prediction method, uses continuous video clips and contextual data to outperform GPT-4V by 18% without retraining.
Details
Motivation: Pedestrian intention prediction is critical for autonomous driving, but current methods require extensive retraining for new scenarios.Method: BF-PIP leverages Gemini 2.5 Pro to analyze continuous video clips with JAAD metadata, bounding-box annotations, and ego-vehicle speed via multimodal prompts.
Result: Achieves 73% accuracy, surpassing GPT-4V by 18%.
Conclusion: Combining temporal video and contextual cues improves intent prediction, enabling retraining-free perception for intelligent transportation.
Abstract: Pedestrian intention prediction is essential for autonomous driving in complex urban environments. Conventional approaches depend on supervised learning over frame sequences and require extensive retraining to adapt to new scenarios. Here, we introduce BF-PIP (Beyond Frames Pedestrian Intention Prediction), a zero-shot approach built upon Gemini 2.5 Pro. It infers crossing intentions directly from short, continuous video clips enriched with structured JAAD metadata. In contrast to GPT-4V based methods that operate on discrete frames, BF-PIP processes uninterrupted temporal clips. It also incorporates bounding-box annotations and ego-vehicle speed via specialized multimodal prompts. Without any additional training, BF-PIP achieves 73% prediction accuracy, outperforming a GPT-4V baseline by 18 %. These findings illustrate that combining temporal video inputs with contextual cues enhances spatiotemporal perception and improves intent inference under ambiguous conditions. This approach paves the way for agile, retraining-free perception module in intelligent transportation system.
[92] ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions
Danglu Yang, Liang Zhang, Zihao Yue, Liangyu Chen, Yichen Xu, Wenxuan Wang, Qin Jin
Main category: cs.CV
TL;DR: The paper introduces ChartM3, a multimodal benchmark for chart editing, combining natural language and visual indicators to improve precision. It includes a dataset and training set, revealing limitations in current MLLMs and showing improvements through fine-tuning.
Details
Motivation: Existing chart editing methods rely on ambiguous natural language instructions, lacking support for fine-grained edits. A multimodal approach is needed for clearer intent expression.Method: Proposes ChartM3, a benchmark with 1,000 samples of varying difficulty, combining natural language and visual indicators. Also introduces ChartM3-Train, a 24,000-sample training set for fine-tuning MLLMs.
Result: Current MLLMs (e.g., GPT-4o) struggle with visual indicators. Fine-tuning on ChartM3-Train significantly improves performance, highlighting the need for multimodal supervision.
Conclusion: Multimodal chart editing, supported by ChartM3, addresses ambiguity in natural language instructions and enhances practical chart editing systems.
Abstract: Charts are a fundamental visualization format widely used in data analysis across research and industry. While enabling users to edit charts based on high-level intentions is of great practical value, existing methods primarily rely on natural language instructions, which are often too ambiguous to support fine-grained editing. In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. To support this paradigm, we present Chart$\text{M}^3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation. Chart$\text{M}^3$ contains 1,000 samples spanning four levels of editing difficulty. Each sample includes triplets in the form of (chart, code, multimodal instructions). To comprehensively evaluate chart editing models, Chart$\text{M}^3$ provides metrics that assess both visual appearance and code correctness. Our benchmark reveals significant limitations in current multimodal large language models (MLLMs), including GPT-4o, particularly in their ability to interpret and act on visual indicators. To address this, we construct Chart$\text{M}^3$-Train, a large-scale training set with 24,000 multimodal chart editing samples. Fine-tuning MLLMs on this dataset leads to substantial improvements, demonstrating the importance of multimodal supervision in building practical chart editing systems. Our datasets, codes, and evaluation tools are available at https://github.com/MLrollIT/ChartM3. %https://github.com/MLrollIT/ChartM3Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE.
[93] Towards Universal Modal Tracking with Online Dense Temporal Token Learning
Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji
Main category: cs.CV
TL;DR: A universal video-level tracking model (Modaltracker) with online dense temporal token learning supports multi-modal tasks using the same architecture and parameters, achieving state-of-the-art performance.
Details
Motivation: To create a unified model for various tracking tasks (RGB, RGB+Thermal, etc.) without needing separate training for each modality.Method: Introduces video-level sampling, association via dense temporal tokens, and modality-scalable gated perceivers for adaptive cross-modal learning.
Result: Achieves SOTA performance on visible and multi-modal benchmarks.
Conclusion: Modaltracker offers efficient multi-task inference with one-shot training, leveraging temporal prompts for improved tracking.
Abstract: We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model’s inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\modaltracker} achieves a new \textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.
[94] PanoGAN A Deep Generative Model for Panoramic Dental Radiographs
Soren Pedersen, Sanyam Jain, Mikkel Chavez, Viktor Ladehoff, Bruna Neves de Freitas, Ruben Pauwels
Main category: cs.CV
TL;DR: A GAN was developed to generate dental panoramic radiographs, addressing data scarcity in dental research. The DCGAN with WGANGP was trained on 2322 radiographs, focusing on dentoalveolar regions. Models varied in critic iterations, feature depth, and denoising. Generated images were evaluated for realism, with trade-offs between detail and clarity.
Details
Motivation: To tackle the lack of data in dental research and education by synthesizing realistic dental panoramic radiographs.Method: Trained a DCGAN with WGANGP on 2322 radiographs, focusing on dentoalveolar regions. Preprocessing included cropping and standardization. Four models were tested with variations in critic iterations, feature depth, and denoising.
Result: Generated images showed moderate anatomical depiction, with trade-offs: non-denoised data preserved finer details (e.g., mandibular canal), while denoised data improved overall clarity.
Conclusion: The study lays groundwork for GAN-based methods in dental imaging, highlighting trade-offs between detail and image quality.
Abstract: This paper presents the development of a generative adversarial network (GAN) for synthesizing dental panoramic radiographs. Although exploratory in nature, the study aims to address the scarcity of data in dental research and education. We trained a deep convolutional GAN (DCGAN) using a Wasserstein loss with gradient penalty (WGANGP) on a dataset of 2322 radiographs of varying quality. The focus was on the dentoalveolar regions, other anatomical structures were cropped out. Extensive preprocessing and data cleaning were performed to standardize the inputs while preserving anatomical variability. We explored four candidate models by varying critic iterations, feature depth, and the use of denoising prior to training. A clinical expert evaluated the generated radiographs based on anatomical visibility and realism, using a 5-point scale (1 very poor 5 excellent). Most images showed moderate anatomical depiction, although some were degraded by artifacts. A trade-off was observed the model trained on non-denoised data yielded finer details especially in structures like the mandibular canal and trabecular bone, while a model trained on denoised data offered superior overall image clarity and sharpness. These findings provide a foundation for future work on GAN-based methods in dental imaging.
[95] On Explaining Visual Captioning with Hybrid Markov Logic Networks
Monika Shah, Somdeb Sarkhel, Deepak Venugopal
Main category: cs.CV
TL;DR: A novel framework using Hybrid Markov Logic Networks (HMLNs) is proposed to explain how DNNs integrate multimodal information for image captioning, offering interpretable insights into model behavior.
Details
Motivation: Current metrics for evaluating DNNs in tasks like image captioning lack deep insights into how models integrate visual, language, and knowledge information.Method: The framework learns a HMLN distribution over training instances and infers shifts in distributions when conditioned on generated captions, identifying influential examples.
Result: Experiments show the framework provides interpretable explanations and allows comparison of captioning models based on explainability.
Conclusion: The HMLN-based framework enhances interpretability of DNNs in multimodal tasks, offering a new dimension for model evaluation.
Abstract: Deep Neural Networks (DNNs) have made tremendous progress in multimodal tasks such as image captioning. However, explaining/interpreting how these models integrate visual information, language information and knowledge representation to generate meaningful captions remains a challenging problem. Standard metrics to measure performance typically rely on comparing generated captions with human-written ones that may not provide a user with a deep insights into this integration. In this work, we develop a novel explanation framework that is easily interpretable based on Hybrid Markov Logic Networks (HMLNs) - a language that can combine symbolic rules with real-valued functions - where we hypothesize how relevant examples from the training data could have influenced the generation of the observed caption. To do this, we learn a HMLN distribution over the training instances and infer the shift in distributions over these instances when we condition on the generated sample which allows us to quantify which examples may have been a source of richer information to generate the observed caption. Our experiments on captions generated for several state-of-the-art captioning models using Amazon Mechanical Turk illustrate the interpretability of our explanations, and allow us to compare these models along the dimension of explainability.
[96] VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
Shibo Gao, Peipei Yang, Yangyang Liu, Yi Chen, Han Zhu, Xuyao Zhang, Linlin Huang
Main category: cs.CV
TL;DR: The paper introduces VAGU, a benchmark integrating anomaly grounding and understanding in videos, and proposes GtS, a training-free framework for anomaly detection, along with the JeAUG metric for evaluation.
Details
Motivation: Current VAD methods lack integration of anomaly understanding and grounding. The paper aims to bridge this gap by providing a unified benchmark and framework.Method: The authors introduce VAGU, a dataset with annotations for anomaly category, explanation, temporal grounding, and Video QA. They propose GtS, a framework for coarse localization and refinement, and the JeAUG metric for evaluation.
Result: Experiments confirm the effectiveness of VAGU, GtS, and JeAUG in improving anomaly detection performance.
Conclusion: The paper successfully integrates anomaly understanding and grounding, offering a comprehensive solution for VAD with validated results.
Abstract: Video Anomaly Detection (VAD) aims to identify anomalous events in videos and accurately determine their time intervals. Current VAD methods mainly fall into two categories: traditional DNN-based approaches that focus on temporal localization, and LLM-based approaches that emphasize semantic understanding. Both anomaly understanding and grounding are essential for comprehensive video anomaly detection and can complement each other. However, no existing model or dataset supports both tasks simultaneously. To address this, we introduce VAGU (Video Anomaly Grounding and Understanding), the first benchmark to integrate both tasks. Each VAGU instance includes annotations for anomaly category, semantic explanation, precise temporal grounding and Video QA. We also provide multiple-choice Video QA for objective evaluation. Based on this dataset, we propose Glance then Scrutinize (GtS), a training-free framework guided by textual prompts. The framework first enables coarse localization of high-probability anomalous regions, followed by detailed anomaly interpretation and temporal boundary refinement. Additionally, we propose the JeAUG metric, which jointly evaluates semantic interpretability and temporal precision, overcoming the limitations of traditional metrics. Extensive experiments verify the effectiveness of our benchmark, framework, and evaluation metric.
[97] Dual Guidance Semi-Supervised Action Detection
Ankit Singh, Efstratios Gavves, Cees G. M. Snoek, Hilde Kuehne
Main category: cs.CV
TL;DR: A semi-supervised learning (SSL) approach for spatial-temporal action localization is introduced, using a dual guidance network to improve pseudo-bounding box selection and enhance performance with limited labeled data.
Details
Motivation: SSL is understudied in spatial-temporal action localization, despite its potential to improve performance when annotations are scarce.Method: A dual guidance network combines frame-level classification and bounding-box prediction to ensure action class consistency across frames and boxes.
Result: The framework outperforms image-based SSL baselines on datasets UCF101-24, J-HMDB-21, and AVA.
Conclusion: The proposed method effectively leverages SSL for spatial-temporal action localization, achieving superior results with limited labeled data.
Abstract: Semi-Supervised Learning (SSL) has shown tremendous potential to improve the predictive performance of deep learning models when annotations are hard to obtain. However, the application of SSL has so far been mainly studied in the context of image classification. In this work, we present a semi-supervised approach for spatial-temporal action localization. We introduce a dual guidance network to select better pseudo-bounding boxes. It combines a frame-level classification with a bounding-box prediction to enforce action class consistency across frames and boxes. Our evaluation across well-known spatial-temporal action localization datasets, namely UCF101-24 , J-HMDB-21 and AVA shows that the proposed module considerably enhances the model’s performance in limited labeled data settings. Our framework achieves superior results compared to extended image-based semi-supervised baselines.
[98] MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen
Main category: cs.CV
TL;DR: MAGE is a novel multimodal framework addressing spatial and semantic losses in visual data encoding by aligning vision and text spaces, enhancing performance in large multimodal models.
Details
Motivation: Existing methods suffer from vector gaps and semantic disparities, leading to information loss in multimodal models. MAGE aims to bridge these gaps.Method: MAGE uses an Intelligent Alignment Network (IAN) for dimensional and semantic alignment, combined with a cross-entropy and mean squared error training strategy. It also includes a fine-tuning dataset for multimodal tool-calling.
Result: MAGE outperforms similar works on benchmarks like MME, MMBench, and SEED.
Conclusion: MAGE effectively addresses alignment challenges in multimodal learning, achieving superior performance and expanding model capabilities.
Abstract: In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE’s “Any-to-Any” capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model’s output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE.
[99] Tracking Moose using Aerial Object Detection
Christopher Indris, Raiyan Rahman, Goetz Bramesfeld, Guanghui Wang
Main category: cs.CV
TL;DR: The paper explores patching augmentation for small object detection in aerial wildlife tracking, comparing three models under various configurations, achieving high accuracy and highlighting efficient models for UAV deployment.
Details
Motivation: Aerial wildlife tracking faces challenges like high costs, risks, and computational limits. Small object detection is difficult due to tiny object sizes and efficiency needs.Method: Applied patching augmentation to datasets, compared three diverse object detectors, and analyzed performance under varying hyperparameters.
Result: All models achieved ≥93% mAP@IoU=0.5 in at least one configuration. Faster, simpler models matched more complex ones, supporting UAV use.
Conclusion: Patching augmentation improves small object detection, with simpler models being effective, enabling practical UAV deployment for wildlife tracking.
Abstract: Aerial wildlife tracking is critical for conservation efforts and relies on detecting small objects on the ground below the aircraft. It presents technical challenges: crewed aircraft are expensive, risky and disruptive; autonomous drones have limited computational capacity for onboard AI systems. Since the objects of interest may appear only a few pixels wide, small object detection is an inherently challenging computer vision subfield compounded by computational efficiency needs. This paper applies a patching augmentation to datasets to study model performance under various settings. A comparative study of three common yet architecturally diverse object detectors is conducted using the data, varying the patching method’s hyperparameters against detection accuracy. Each model achieved at least 93% mAP@IoU=0.5 on at least one patching configuration. Statistical analyses provide an in-depth commentary on the effects of various factors. Analysis also shows that faster, simpler models are about as effective as models that require more computational power for this task and perform well given limited patch scales, encouraging UAV deployment. Datasets and models will be made available via https://github.com/chrisindris/Moose.
[100] HDR Environment Map Estimation with Latent Diffusion Models
Jack Hilliard, Adrian Hilton, Jean-Yves Guillemaut
Main category: cs.CV
TL;DR: A novel approach using Latent Diffusion Model (LDM) for HDR environment map estimation from single-view images, addressing ERP distortions and seam artifacts with ERP convolutional padding and a panoramically-adapted Diffusion Transformer (PanoDiT).
Details
Motivation: To improve HDR environment map estimation by addressing distortions and seam artifacts in ERP representation, enhancing quality and plausibility for mirror-reflective surfaces.Method: Proposes ERP convolutional padding in the latent autoencoder to remove seam artifacts and introduces PanoDiT, a panoramically-adapted Diffusion Transformer, to reduce ERP distortions.
Result: Models produce high-quality environment maps, competitive with state-of-the-art in image quality and lighting accuracy, though PanoDiT trades off some image quality for reduced distortions.
Conclusion: The approach effectively addresses ERP limitations, offering a competitive solution for HDR environment map estimation, with trade-offs in quality for distortion reduction.
Abstract: We advance the field of HDR environment map estimation from a single-view image by establishing a novel approach leveraging the Latent Diffusion Model (LDM) to produce high-quality environment maps that can plausibly light mirror-reflective surfaces. A common issue when using the ERP representation, the format used by the vast majority of approaches, is distortions at the poles and a seam at the sides of the environment map. We remove the border seam artefact by proposing an ERP convolutional padding in the latent autoencoder. Additionally, we investigate whether adapting the diffusion network architecture to the ERP format can improve the quality and accuracy of the estimated environment map by proposing a panoramically-adapted Diffusion Transformer architecture. Our proposed PanoDiT network reduces ERP distortions and artefacts, but at the cost of image quality and plausibility. We evaluate with standard benchmarks to demonstrate that our models estimate high-quality environment maps that perform competitively with state-of-the-art approaches in both image quality and lighting accuracy.
[101] Fairness and Robustness of CLIP-Based Models for Chest X-rays
Théo Sourget, David Restrepo, Céline Hudelot, Enzo Ferrante, Stergios Christodoulidis, Maria Vakalopoulou
Main category: cs.CV
TL;DR: The study evaluates fairness and robustness of six CLIP-based models in chest X-ray classification, revealing performance gaps by age and reliance on spurious correlations like chest drains.
Details
Motivation: To assess fairness and robustness of CLIP-based models in medical tasks, given their strong performance in natural image-text domains but underexplored fairness in clinical settings.Method: Evaluated six CLIP-based models on chest X-ray classification using MIMIC-CXR, NIH-CXR14, and NEATX datasets, assessing fairness across age, sex, race, and robustness to shortcut learning (e.g., chest drains).
Result: Performance gaps were found between age groups, but equitable results for sex and race. Models performed worse on images without chest drains, indicating reliance on spurious correlations. Embeddings revealed sensitive attributes but not via PCA.
Conclusion: CLIP-based models show fairness concerns in age-based subgroups and robustness issues due to shortcut learning, highlighting the need for further evaluation in clinical applications.
Abstract: Motivated by the strong performance of CLIP-based models in natural image-text domains, recent efforts have adapted these architectures to medical tasks, particularly in radiology, where large paired datasets of images and reports, such as chest X-rays, are available. While these models have shown encouraging results in terms of accuracy and discriminative performance, their fairness and robustness in the different clinical tasks remain largely underexplored. In this study, we extensively evaluate six widely used CLIP-based models on chest X-ray classification using three publicly available datasets: MIMIC-CXR, NIH-CXR14, and NEATX. We assess the models fairness across six conditions and patient subgroups based on age, sex, and race. Additionally, we assess the robustness to shortcut learning by evaluating performance on pneumothorax cases with and without chest drains. Our results indicate performance gaps between patients of different ages, but more equitable results for the other attributes. Moreover, all models exhibit lower performance on images without chest drains, suggesting reliance on spurious correlations. We further complement the performance analysis with a study of the embeddings generated by the models. While the sensitive attributes could be classified from the embeddings, we do not see such patterns using PCA, showing the limitations of these visualisation techniques when assessing models. Our code is available at https://github.com/TheoSourget/clip_cxr_fairness
[102] VoluMe – Authentic 3D Video Calls from Live Gaussian Splat Prediction
Martin de La Gorce, Charlie Hewitt, Tibor Takacs, Robert Gerdisch, Zafiirah Hosenie, Givi Meishvili, Marek Kowalski, Thomas J. Cashman, Antonio Criminisi
Main category: cs.CV
TL;DR: A method for real-time 3D Gaussian reconstructions from a single 2D webcam feed, enhancing remote meetings with realistic, authentic, and stable 3D representations without complex hardware.
Details
Motivation: To improve remote meetings by offering realistic and authentic 3D representations without the constraints of existing high-quality solutions that rely on complex hardware or fixed appearances.Method: Predicts 3D Gaussian reconstructions in real time from a single 2D webcam feed, ensuring authenticity and realism by conditioning on each video frame independently and introducing a stability loss for temporal consistency.
Result: Achieves state-of-the-art accuracy in visual quality and stability metrics, demonstrated in live one-to-one 3D meetings using standard 2D cameras and displays.
Conclusion: The method enables highly accessible, realistic, and authentic 3D videoconferencing, making volumetric communication feasible for anyone with standard hardware.
Abstract: Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. These approaches lead to constraints that are unwelcome and ill-fitting for videoconferencing applications. We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed, where the 3D representation is not only live and realistic, but also authentic to the input video. By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint (a property we call authenticity), while generalizing realistically to novel viewpoints. Additionally, we introduce a stability loss to obtain reconstructions that are temporally stable on video sequences. We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods, and demonstrate our approach in live one-to-one 3D meetings using only a standard 2D camera and display. This demonstrates that our approach can allow anyone to communicate volumetrically, via a method for 3D videoconferencing that is not only highly accessible, but also realistic and authentic.
[103] GLCP: Global-to-Local Connectivity Preservation for Tubular Structure Segmentation
Feixiang Zhou, Zhuangzhi Gao, He Zhao, Jianyang Xie, Yanda Meng, Yitian Zhao, Gregory Y. H. Lip, Yalin Zheng
Main category: cs.CV
TL;DR: A novel Global-to-Local Connectivity Preservation (GLCP) framework is proposed for tubular structure segmentation, addressing fragmentation by jointly learning global and local features, outperforming existing methods.
Details
Motivation: Structural fragmentation in tubular network segmentation hampers downstream applications, and existing methods neglect local discontinuity regions.Method: The GLCP framework includes an Interactive Multi-head Segmentation (IMS) module for global and local feature learning and a Dual-Attention-based Refinement (DAR) module for improved segmentation.
Result: GLCP achieves superior accuracy and continuity in 2D and 3D tubular structure segmentation compared to state-of-the-art methods.
Conclusion: The proposed GLCP framework effectively addresses fragmentation by integrating global and local connectivity, demonstrating significant improvements in segmentation quality.
Abstract: Accurate segmentation of tubular structures, such as vascular networks, plays a critical role in various medical domains. A remaining significant challenge in this task is structural fragmentation, which can adversely impact downstream applications. Existing methods primarily focus on designing various loss functions to constrain global topological structures. However, they often overlook local discontinuity regions, leading to suboptimal segmentation results. To overcome this limitation, we propose a novel Global-to-Local Connectivity Preservation (GLCP) framework that can simultaneously perceive global and local structural characteristics of tubular networks. Specifically, we propose an Interactive Multi-head Segmentation (IMS) module to jointly learn global segmentation, skeleton maps, and local discontinuity maps, respectively. This enables our model to explicitly target local discontinuity regions while maintaining global topological integrity. In addition, we design a lightweight Dual-Attention-based Refinement (DAR) module to further improve segmentation quality by refining the resulting segmentation maps. Extensive experiments on both 2D and 3D datasets demonstrate that our GLCP achieves superior accuracy and continuity in tubular structure segmentation compared to several state-of-the-art approaches. The source codes will be available at https://github.com/FeixiangZhou/GLCP.
[104] Analyzing the Sensitivity of Vision Language Models in Visual Question Answering
Monika Shah, Sudarshan Balaji, Somdeb Sarkhel, Sanorita Dey, Deepak Venugopal
Main category: cs.CV
TL;DR: The paper investigates how Vision Language Models (VLMs) handle violations of Grice’s conversational maxims by modifying human-crafted questions, finding that VLM performance declines with such modifications.
Details
Motivation: To explore if VLMs can handle conversational violations like humans, using Grice's maxims as a framework.Method: Modifiers were added to questions from the VQA v2.0 dataset, and responses from GPT-4o, Claude-3.5-Sonnet, and Gemini-1.5-Flash were analyzed.
Result: VLMs’ performance consistently diminished with added modifiers, highlighting limitations.
Conclusion: The approach is promising for understanding VLM limitations in handling conversational violations.
Abstract: We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice’s maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice’s maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs.
[105] Enhancing and Accelerating Brain MRI through Deep Learning Reconstruction Using Prior Subject-Specific Imaging
Amirmohammad Shamaei, Alexander Stebner, Salome, Bosshart, Johanna Ospel, Gouri Ginde, Mariana Bento, Roberto Souza
Main category: cs.CV
TL;DR: A deep-learning-based MRI reconstruction framework improves scan quality and reduces time by integrating prior subject-specific scans, outperforming existing methods.
Details
Motivation: Long MRI acquisition times increase costs and reduce patient comfort, necessitating faster, high-quality reconstruction methods.Method: Proposes a framework with an initial reconstruction network, deep registration model, and transformer-based enhancement network, validated on a longitudinal dataset.
Result: Quantitatively superior to existing methods, improves brain segmentation accuracy, and reduces reconstruction time significantly.
Conclusion: The method is efficient, clinically viable, and publicly available, enhancing MRI reconstruction and downstream tasks.
Abstract: Magnetic resonance imaging (MRI) is a crucial medical imaging modality. However, long acquisition times remain a significant challenge, leading to increased costs, and reduced patient comfort. Recent studies have shown the potential of using deep learning models that incorporate information from prior subject-specific MRI scans to improve reconstruction quality of present scans. Integrating this prior information requires registration of the previous scan to the current image reconstruction, which can be time-consuming. We propose a novel deep-learning-based MRI reconstruction framework which consists of an initial reconstruction network, a deep registration model, and a transformer-based enhancement network. We validated our method on a longitudinal dataset of T1-weighted MRI scans with 2,808 images from 18 subjects at four acceleration factors (R5, R10, R15, R20). Quantitative metrics confirmed our approach’s superiority over existing methods (p < 0.05, Wilcoxon signed-rank test). Furthermore, we analyzed the impact of our MRI reconstruction method on the downstream task of brain segmentation and observed improved accuracy and volumetric agreement with reference segmentations. Our approach also achieved a substantial reduction in total reconstruction time compared to methods that use traditional registration algorithms, making it more suitable for real-time clinical applications. The code associated with this work is publicly available at https://github.com/amirshamaei/longitudinal-mri-deep-recon.
[106] Very High-Resolution Bridge Deformation Monitoring Using UAV-based Photogrammetry
Mehdi Maboudi, Jan Backhaus, Inka Mai, Yahya Ghassoun, Yogesh Khedar, Dirk Lowke, Bjoern Riedel, Ulf Bestmann, Markus Gerke
Main category: cs.CV
TL;DR: The paper evaluates UAV-based monitoring for structural health monitoring (SHM) of bridges, focusing on geometric deformation under load. It demonstrates high accuracy (within 1 mm) compared to traditional methods.
Details
Motivation: Many bridges are nearing their service life, necessitating efficient and cost-effective SHM methods. UAV-based monitoring offers a promising alternative to traditional inspections.Method: The study used a reinforced concrete bridge subjected to controlled loads. High-resolution images were captured via a UAV (DJI Matrice 600 Pro with a 100MP camera) and processed to monitor point motion and generate dense point clouds. Geodetic control and reference sensors (displacement transducers, tachymetry, laser profiling) validated results.
Result: The UAV-based approach achieved sub-millimeter accuracy (less than 1 mm difference from reference data) and enabled full area-wide deformation quantification.
Conclusion: UAV-based SHM is highly suitable for bridge monitoring, offering precise, area-wide deformation analysis and outperforming traditional point or profile measurements.
Abstract: Accurate and efficient structural health monitoring of infrastructure objects such as bridges is a vital task, as many existing constructions have already reached or are approaching their planned service life. In this contribution, we address the question of the suitability of UAV-based monitoring for SHM, in particular focusing on the geometric deformation under load. Such an advanced technology is becoming increasingly popular due to its ability to decrease the cost and risk of tedious traditional inspection methods. To this end, we performed extensive tests employing a research reinforced concrete bridge that can be exposed to a predefined load via ground anchors. Very high-resolution image blocks have been captured before, during, and after the application of controlled loads. From those images, the motion of distinct points on the bridge has been monitored, and in addition, dense image point clouds were computed to evaluate the performance of surface-based data acquisition. Moreover, a geodetic control network in stable regions is used as control information for bundle adjustment. We applied different sensing technologies in order to be able to judge the image-based deformation results: displacement transducers, tachymetry, and laser profiling. As a platform for the photogrammetric measurements, a multi-rotor UAV DJI Matrice 600 Pro was employed, equipped with two RTK-GNSS receivers. The mounted camera was a PhaseOne iXM-100 (100MP) with an 80 mm lens. With a flying height of 30 m above the terrain, this resulted in a GSD of 1.3 mm while a forward and sideward overlap of 80% was maintained. The comparison with reference data (displacement transducers) reveals a difference of less than 1 mm. We show that by employing the introduced UAV-based monitoring approach, a full area-wide quantification of deformation is possible in contrast to classical point or profile measurements.
[107] Group Relative Augmentation for Data Efficient Action Detection
Deep Anil Patel, Iain Melvin, Zachary Izzo, Martin Renqiang Min
Main category: cs.CV
TL;DR: Efficient adaptation of Video-Language Models (VLMs) for action detection using few examples, addressing overfitting and granularity mismatch with parameter-efficient tuning and feature augmentation.
Details
Motivation: Challenges in adapting VLMs for action detection include overfitting and mismatch between scene-level pre-training and person-centric needs.Method: Combines LoRA for parameter-efficient tuning with learnable feature augmentation (FiLM) and a group-weighted loss function.
Result: Achieves strong mAP on AVA and MOMA datasets, demonstrating data efficiency.
Conclusion: The proposed method effectively adapts VLMs for action detection with limited data, outperforming traditional approaches.
Abstract: Adapting large Video-Language Models (VLMs) for action detection using only a few examples poses challenges like overfitting and the granularity mismatch between scene-level pre-training and required person-centric understanding. We propose an efficient adaptation strategy combining parameter-efficient tuning (LoRA) with a novel learnable internal feature augmentation. Applied within the frozen VLM backbone using FiLM, these augmentations generate diverse feature variations directly relevant to the task. Additionally, we introduce a group-weighted loss function that dynamically modulates the training contribution of each augmented sample based on its prediction divergence relative to the group average. This promotes robust learning by prioritizing informative yet reasonable augmentations. We demonstrate our method’s effectiveness on complex multi-label, multi-person action detection datasets (AVA, MOMA), achieving strong mAP performance and showcasing significant data efficiency for adapting VLMs from limited examples.
[108] Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy
Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc
Main category: cs.CV
TL;DR: CoP introduces a multi-task learning framework for BEV 3D object detection, using spatial occupancy as auxiliary info to improve perception by integrating structural and conceptual similarities between tasks.
Details
Motivation: Existing BEV methods neglect environmental contexts like roads, limiting comprehensive perception. CoP aims to bridge this gap by leveraging spatial occupancy.Method: CoP uses LDO for dense occupancy ground truths, VHS for fine-grained feature sampling, and CFF for global-local feature fusion.
Result: CoP achieves 49.5% mAP and 59.2% NDS on nuScenes, outperforming existing vision-based frameworks.
Conclusion: CoP enhances BEV representations by integrating occupancy prediction, demonstrating superior performance in 3D object detection.
Abstract: Vision-based bird’s-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5% mAP and 59.2% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.
[109] Generalizable Neural Electromagnetic Inverse Scattering
Yizhe Cheng, Chunxun Tian, Haoru Wang, Wentao Zhu, Xiaoxuan Ma, Yizhou Wang
Main category: cs.CV
TL;DR: A physics-driven framework for solving Electromagnetic Inverse Scattering Problems (EISP) is proposed, improving generalization and robustness over existing methods.
Details
Motivation: Existing machine learning approaches like Img-Interiors lack generalization and fail under sparse transmitter setups, necessitating a more robust solution.Method: The paper reformulates EISP as a two-stage process, introducing a current estimator and permittivity solver for end-to-end prediction.
Result: The proposed framework outperforms state-of-the-art methods in accuracy, generalization, and robustness, especially under sparse transmitter conditions.
Conclusion: This work provides a novel physics-informed approach to EISP, advancing practical electromagnetic imaging solutions.
Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging.
[110] Evaluating Deep Learning Models for African Wildlife Image Classification: From DenseNet to Vision Transformers
Lukman Jibril Aliyu, Umar Sani Muhammad, Bilqisu Ismail, Nasiru Muhammad, Almustapha A Wakili, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad, Mustapha Abdullahi
Main category: cs.CV
TL;DR: The paper compares deep learning models for classifying African wildlife images, finding ViT-H/14 most accurate (99%) but computationally costly, while DenseNet-201 (67%) is more deployable.
Details
Motivation: Addressing the decline of African wildlife by leveraging deep learning for biodiversity monitoring and conservation.Method: Comparative study of DenseNet-201, ResNet-152, EfficientNet-B4, and ViT-H/14 using transfer learning on a dataset of four species.
Result: ViT-H/14 achieved 99% accuracy but high computational cost; DenseNet-201 (67%) was integrated into a deployable tool.
Conclusion: The study provides insights for model selection and deployment in wildlife conservation, balancing accuracy and practicality.
Abstract: Wildlife populations in Africa face severe threats, with vertebrate numbers declining by over 65% in the past five decades. In response, image classification using deep learning has emerged as a promising tool for biodiversity monitoring and conservation. This paper presents a comparative study of deep learning models for automatically classifying African wildlife images, focusing on transfer learning with frozen feature extractors. Using a public dataset of four species: buffalo, elephant, rhinoceros, and zebra; we evaluate the performance of DenseNet-201, ResNet-152, EfficientNet-B4, and Vision Transformer ViT-H/14. DenseNet-201 achieved the best performance among convolutional networks (67% accuracy), while ViT-H/14 achieved the highest overall accuracy (99%), but with significantly higher computational cost, raising deployment concerns. Our experiments highlight the trade-offs between accuracy, resource requirements, and deployability. The best-performing CNN (DenseNet-201) was integrated into a Hugging Face Gradio Space for real-time field use, demonstrating the feasibility of deploying lightweight models in conservation settings. This work contributes to African-grounded AI research by offering practical insights into model selection, dataset preparation, and responsible deployment of deep learning tools for wildlife conservation.
[111] Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation
I-Hsiang Chen, Hua-En Chang, Wei-Ting Chen, Jenq-Neng Hwang, Sy-Yen Kuo
Main category: cs.CV
TL;DR: PDAF introduces a probabilistic diffusion framework to enhance domain generalization in semantic segmentation by modeling latent domain priors and aligning features across domains.
Details
Motivation: Domain shifts in unseen environments degrade model performance, and existing methods often overlook intrinsic latent domain priors.Method: PDAF uses a Latent Domain Prior (LDP) to capture shifts, integrating three modules: LPE for LDP prediction, DCM for feature adjustment, and DPE for LDP estimation via diffusion.
Result: PDAF improves generalization across diverse urban scenes, validated by extensive experiments.
Conclusion: PDAF effectively addresses domain shifts by iteratively refining feature representations, enhancing segmentation performance in unseen domains.
Abstract: Domain Generalized Semantic Segmentation (DGSS) is a critical yet challenging task, as domain shifts in unseen environments can severely compromise model performance. While recent studies enhance feature alignment by projecting features into the source domain, they often neglect intrinsic latent domain priors, leading to suboptimal results. In this paper, we introduce PDAF, a Probabilistic Diffusion Alignment Framework that enhances the generalization of existing segmentation networks through probabilistic diffusion modeling. PDAF introduces a Latent Domain Prior (LDP) to capture domain shifts and uses this prior as a conditioning factor to align both source and unseen target domains. To achieve this, PDAF integrates into a pre-trained segmentation model and utilizes paired source and pseudo-target images to simulate latent domain shifts, enabling LDP modeling. The framework comprises three modules: the Latent Prior Extractor (LPE) predicts the LDP by supervising domain shifts; the Domain Compensation Module (DCM) adjusts feature representations to mitigate domain shifts; and the Diffusion Prior Estimator (DPE) leverages a diffusion process to estimate the LDP without requiring paired samples. This design enables PDAF to iteratively model domain shifts, progressively refining feature representations to enhance generalization under complex target conditions. Extensive experiments validate the effectiveness of PDAF across diverse and challenging urban scenes.
[112] Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View
Zitong Zhang, Suranjan Gautam, Rui Yu
Main category: cs.CV
TL;DR: Top2Pano generates realistic 360° indoor panoramas from 2D top-down views, outperforming baselines in geometry and photorealism.
Details
Motivation: The task is challenging due to missing 3D structure and the need for geometric consistency and photorealism in applications like VR, interior design, and robotics.Method: Top2Pano uses volumetric occupancy estimation for 3D structure, volumetric rendering for coarse panoramas, and a diffusion-based refinement with ControlNet for realism.
Result: Evaluations show Top2Pano outperforms baselines, reconstructing geometry, occlusions, and spatial arrangements well, even generalizing to schematic floorplans.
Conclusion: Top2Pano effectively bridges top-down views with immersive indoor synthesis, demonstrating strong potential in various applications.
Abstract: Generating immersive 360{\deg} indoor panoramas from 2D top-down views has applications in virtual reality, interior design, real estate, and robotics. This task is challenging due to the lack of explicit 3D structure and the need for geometric consistency and photorealism. We propose Top2Pano, an end-to-end model for synthesizing realistic indoor panoramas from top-down views. Our method estimates volumetric occupancy to infer 3D structures, then uses volumetric rendering to generate coarse color and depth panoramas. These guide a diffusion-based refinement stage using ControlNet, enhancing realism and structural fidelity. Evaluations on two datasets show Top2Pano outperforms baselines, effectively reconstructing geometry, occlusions, and spatial arrangements. It also generalizes well, producing high-quality panoramas from schematic floorplans. Our results highlight Top2Pano’s potential in bridging top-down views with immersive indoor synthesis.
[113] Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen
Main category: cs.CV
TL;DR: LLaVA-Reward is a reward model for evaluating text-to-image generations using multimodal large language models (MLLMs), improving efficiency and accuracy with a Skip-connection Cross Attention module.
Details
Motivation: Existing MLLM-based methods are time-consuming and hard to train, requiring instruction-following data. LLaVA-Reward aims to simplify this by leveraging hidden states of MLLMs.Method: LLaVA-Reward uses hidden states of MLLMs for text-image pairs and introduces SkipCA to enhance visual-textual interaction. It supports various preference data types for fine-tuning.
Result: LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and scaling in text-to-image generations.
Conclusion: LLaVA-Reward offers an efficient, accurate solution for evaluating text-to-image generations, addressing limitations of existing methods.
Abstract: We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations.In addition, LLaVA-Reward supports different types of preference data for efficient fine-tuning, including paired preference data and unpaired data. We train LLaVA-Reward on four evaluation perspectives: text-image alignment, fidelity/artifact, safety, and overall ranking. Empirical results demonstrate that LLaVA-Reward outperforms conventional and MLLM-based methods in generating human-aligned scores for automatic evaluations and inference-time scaling in text-to-image generations.
[114] ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs
Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli
Main category: cs.CV
TL;DR: ReGATE is a token pruning method for faster MLLM training, using a teacher-student framework to selectively process tokens, reducing computation while maintaining accuracy.
Details
Motivation: Training multimodal large language models (MLLMs) is computationally expensive, and existing efficiency methods focus on inference, offering limited training benefits.Method: ReGATE uses a teacher-student framework with a frozen LLM as the teacher to compute reference losses, combined with the student’s difficulty scores, enabling adaptive token pruning.
Result: ReGATE accelerates training by up to 2x, uses only 35% of tokens, and surpasses baseline performance on some benchmarks with 41% fewer tokens.
Conclusion: ReGATE effectively reduces computational costs in MLLM training while maintaining or improving model performance.
Abstract: The computational cost of training multimodal large language models (MLLMs) rapidly increases with the number of tokens involved. Existing efficiency methods primarily target inference and rely on token reduction or merging, offering limited benefit during training. In this paper, we propose ReGATE (Reference$-$Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. Specifically, ReGATE adopts a teacher-student framework in which the MLLM being trained serves as the student, and a frozen reference large language model (LLM) acts as the teacher. The teacher computes per-token reference losses, which are combined with an exponential moving average (EMA) of the student’s own difficulty scores. This adaptive difficulty-based scoring enables the selective processing of crucial tokens while bypassing less informative ones in the forward pass, significantly reducing computational overhead. Experiments demonstrate that ReGATE, when applied to VideoLLaMA2, matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 35% of the tokens. With additional training, it even surpasses the baseline on several multimodal benchmarks, all while reducing the total token count by over 41%. Code and models will be released soon.
[115] MapDiffusion: Generative Diffusion for Vectorized Online HD Map Construction and Uncertainty Estimation in Autonomous Driving
Thomas Monninger, Zihan Zhang, Zhipeng Mo, Md Zafar Anwar, Steffen Staab, Sihao Ding
Main category: cs.CV
TL;DR: MapDiffusion introduces a generative diffusion-based approach for vectorized map construction, capturing uncertainty and ambiguity in autonomous driving environments.
Details
Motivation: Traditional map construction models provide deterministic outputs, lacking uncertainty estimates and failing to handle real-world ambiguities like occlusions.Method: MapDiffusion uses diffusion to iteratively refine randomly initialized queries, conditioned on a BEV latent grid, generating multiple plausible map samples.
Result: Outperforms baselines by 5% in single-sample performance on nuScenes, with aggregated samples improving accuracy and uncertainty estimates correlating with scene ambiguity.
Conclusion: MapDiffusion enhances robustness and reliability in HD map construction, enabling uncertainty-aware decision-making for autonomous vehicles.
Abstract: Autonomous driving requires an understanding of the static environment from sensor data. Learned Bird’s-Eye View (BEV) encoders are commonly used to fuse multiple inputs, and a vector decoder predicts a vectorized map representation from the latent BEV grid. However, traditional map construction models provide deterministic point estimates, failing to capture uncertainty and the inherent ambiguities of real-world environments, such as occlusions and missing lane markings. We propose MapDiffusion, a novel generative approach that leverages the diffusion paradigm to learn the full distribution of possible vectorized maps. Instead of predicting a single deterministic output from learned queries, MapDiffusion iteratively refines randomly initialized queries, conditioned on a BEV latent grid, to generate multiple plausible map samples. This allows aggregating samples to improve prediction accuracy and deriving uncertainty estimates that directly correlate with scene ambiguity. Extensive experiments on the nuScenes dataset demonstrate that MapDiffusion achieves state-of-the-art performance in online map construction, surpassing the baseline by 5% in single-sample performance. We further show that aggregating multiple samples consistently improves performance along the ROC curve, validating the benefit of distribution modeling. Additionally, our uncertainty estimates are significantly higher in occluded areas, reinforcing their value in identifying regions with ambiguous sensor input. By modeling the full map distribution, MapDiffusion enhances the robustness and reliability of online vectorized HD map construction, enabling uncertainty-aware decision-making for autonomous vehicles in complex environments.
[116] Dual Cross-image Semantic Consistency with Self-aware Pseudo Labeling for Semi-supervised Medical Image Segmentation
Han Wu, Chong Wang, Zhiming Cui
Main category: cs.CV
TL;DR: The paper introduces DuCiSC, a semi-supervised learning framework for medical image segmentation, addressing feature discrepancy and semantic consistency by aligning prototypes across labeled and unlabeled images.
Details
Motivation: Current semi-supervised methods for medical image segmentation overlook semantic-level consistency and suffer from feature discrepancies due to imbalanced labeled and unlabeled data.Method: DuCiSC enforces region-level semantic consistency across labeled and unlabeled images via prototype alignment and uses a self-aware confidence estimation strategy for reliable pseudo-label selection.
Result: DuCiSC outperforms state-of-the-art methods on four datasets, including binary and multi-class segmentation tasks.
Conclusion: DuCiSC effectively addresses feature discrepancy and semantic consistency, achieving superior segmentation results in semi-supervised medical image segmentation.
Abstract: Semi-supervised learning has proven highly effective in tackling the challenge of limited labeled training data in medical image segmentation. In general, current approaches, which rely on intra-image pixel-wise consistency training via pseudo-labeling, overlook the consistency at more comprehensive semantic levels (e.g., object region) and suffer from severe discrepancy of extracted features resulting from an imbalanced number of labeled and unlabeled data. To overcome these limitations, we present a new \underline{Du}al \underline{C}ross-\underline{i}mage \underline{S}emantic \underline{C}onsistency (DuCiSC) learning framework, for semi-supervised medical image segmentation. Concretely, beyond enforcing pixel-wise semantic consistency, DuCiSC proposes dual paradigms to encourage region-level semantic consistency across: 1) labeled and unlabeled images; and 2) labeled and fused images, by explicitly aligning their prototypes. Relying on the dual paradigms, DuCiSC can effectively establish consistent cross-image semantics via prototype representations, thereby addressing the feature discrepancy issue. Moreover, we devise a novel self-aware confidence estimation strategy to accurately select reliable pseudo labels, allowing for exploiting the training dynamics of unlabeled data. Our DuCiSC method is extensively validated on four datasets, including two popular binary benchmarks in segmenting the left atrium and pancreas, a multi-class Automatic Cardiac Diagnosis Challenge dataset, and a challenging scenario of segmenting the inferior alveolar nerve that features complicated anatomical structures, showing superior segmentation results over previous state-of-the-art approaches. Our code is publicly available at \href{https://github.com/ShanghaiTech-IMPACT/DuCiSC}{https://github.com/ShanghaiTech-IMPACT/DuCiSC}.
[117] Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation
Bolei Chen, Jiaxu Kang, Yifei Wang, Ping Zhong, Qi Wu, Jianxin Wang
Main category: cs.CV
TL;DR: The paper proposes a navigation policy for Vision Language Navigation (VLN) using Recursive Visual Imagination (RVI) and Adaptive Linguistic Grounding (ALG) to improve scene comprehension and alignment with commands.
Details
Motivation: Current VLN agents struggle with overly detailed scene representation and poor vision-language alignment, leading to navigation errors.Method: Introduces RVI to model visual transitions and ALG for fine-grained semantic matching between situational memories and linguistic commands.
Result: The policy outperforms state-of-the-art methods on VLN-CE and ObjectNav tasks.
Conclusion: RVI and ALG enhance linguistic grounding and navigation accuracy, proving superior in VLN tasks.
Abstract: Vision Language Navigation (VLN) typically requires agents to navigate to specified objects or remote regions in unknown scenes by obeying linguistic commands. Such tasks require organizing historical visual observations for linguistic grounding, which is critical for long-sequence navigational decisions. However, current agents suffer from overly detailed scene representation and ambiguous vision-language alignment, which weaken their comprehension of navigation-friendly high-level scene priors and easily lead to behaviors that violate linguistic commands. To tackle these issues, we propose a navigation policy by recursively summarizing along-the-way visual perceptions, which are adaptively aligned with commands to enhance linguistic grounding. In particular, by structurally modeling historical trajectories as compact neural grids, several Recursive Visual Imagination (RVI) techniques are proposed to motivate agents to focus on the regularity of visual transitions and semantic scene layouts, instead of dealing with misleading geometric details. Then, an Adaptive Linguistic Grounding (ALG) technique is proposed to align the learned situational memories with different linguistic components purposefully. Such fine-grained semantic matching facilitates the accurate anticipation of navigation actions and progress. Our navigation policy outperforms the state-of-the-art methods on the challenging VLN-CE and ObjectNav tasks, showing the superiority of our RVI and ALG techniques for VLN.
[118] Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation
Sheng-Feng Yu, Jia-Jiun Yao, Wei-Chen Chiu
Main category: cs.CV
TL;DR: The paper introduces Self-Supervised Dataset Distillation, a method to reduce dataset size while preserving performance by distilling images and their self-supervised representations. Novel techniques include low-dimensional bases, predetermined augmentations, and lightweight networks for compact distillation.
Details
Motivation: Large datasets increase training costs, making dataset distillation essential. Existing methods focus on supervised datasets, but this work targets self-supervised learning for better efficiency and generalizability.Method: Proposes three techniques: 1) parameterization via low-dimensional bases, 2) using predetermined augmentations to reduce instability, and 3) leveraging a lightweight network for compact distillation.
Result: The method shows superior distillation efficiency, cross-architecture generalization, and transfer learning performance in experiments.
Conclusion: Self-Supervised Dataset Distillation effectively reduces dataset size while maintaining performance, with novel techniques enhancing its practicality.
Abstract: Although larger datasets are crucial for training large deep models, the rapid growth of dataset size has brought a significant challenge in terms of considerable training costs, which even results in prohibitive computational expenses. Dataset Distillation becomes a popular technique recently to reduce the dataset size via learning a highly compact set of representative exemplars, where the model trained with these exemplars ideally should have comparable performance with respect to the one trained with the full dataset. While most of existing works upon dataset distillation focus on supervised datasets, we instead aim to distill images and their self-supervisedly trained representations into a distilled set. This procedure, named as Self-Supervised Dataset Distillation, effectively extracts rich information from real datasets, yielding the distilled sets with enhanced cross-architecture generalizability. Particularly, in order to preserve the key characteristics of original dataset more faithfully and compactly, several novel techniques are proposed: 1) we introduce an innovative parameterization upon images and representations via distinct low-dimensional bases, where the base selection for parameterization is experimentally shown to play a crucial role; 2) we tackle the instability induced by the randomness of data augmentation – a key component in self-supervised learning but being underestimated in the prior work of self-supervised dataset distillation – by utilizing predetermined augmentations; 3) we further leverage a lightweight network to model the connections among the representations of augmented views from the same image, leading to more compact pairs of distillation. Extensive experiments conducted on various datasets validate the superiority of our approach in terms of distillation efficiency, cross-architecture generalization, and transfer learning performance.
[119] An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes
Mianzhao Wang, Fan Shi, Xu Cheng, Feifei Zhang, Shengyong Chen
Main category: cs.CV
TL;DR: A novel light field representation (ESI) and angular-temporal interaction network (ATINet) improve object tracking in low-light scenes by leveraging geometric and angular-temporal cues.
Details
Motivation: Existing methods struggle with reliable angular modeling in complex low-light scenes, limiting scene perception and target identification.Method: Proposes ESI for geometric structure and ATINet for angular-temporal interaction learning, optimized self-supervisedly.
Result: ATINet achieves state-of-the-art performance in single and multiple object tracking.
Conclusion: The proposed methods enhance light field modeling and tracking, validated by a new dataset and extensive experiments.
Abstract: High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.
[120] Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval
Zhichuan Wang, Yang Zhou, Zhe Liu, Rui Yu, Song Bai, Yulong Wang, Xinwei He, Xiang Bai
Main category: cs.CV
TL;DR: A framework called DAC uses CLIP and MLLM for open-set 3D object retrieval, achieving superior performance with multi-view images.
Details
Motivation: Existing methods struggle with generalized representations due to limited 3D training data. CLIP's pre-trained capabilities offer a solution.Method: DAC combines CLIP with MLLM for adaptation and inference, using AB-LoRA to enhance generalization.
Result: DAC outperforms prior methods by +10.01% mAP on four datasets and shows strong generalization.
Conclusion: DAC is a simple yet effective framework for open-set 3DOR, leveraging CLIP and MLLM for superior performance.
Abstract: Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP’s training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.
[121] Optimizing Active Learning in Vision-Language Models via Parameter-Efficient Uncertainty Calibration
Athmanarayanan Lakshmi Narayanan, Amrutha Machireddy, Ranganath Krishnan
Main category: cs.CV
TL;DR: A novel parameter-efficient AL method with uncertainty calibration loss improves sample selection for vision-language models, outperforming complex techniques while being computationally efficient.
Details
Motivation: To reduce labeling costs in AL for large-scale vision-language models by addressing uncertainty estimation and efficient sampling challenges.Method: Introduces a differentiable loss function for uncertainty calibration within AL, comparing Prompt learning and LoRA for sample selection.
Result: Matches/exceeds performance of complex sampling techniques, with computational efficiency, across datasets and vision backbones.
Conclusion: The proposed method effectively selects informative samples for fine-tuning, offering a practical solution for efficient AL in vision-language tasks.
Abstract: Active Learning (AL) has emerged as a powerful approach for minimizing labeling costs by selectively sampling the most informative data for neural network model development. Effective AL for large-scale vision-language models necessitates addressing challenges in uncertainty estimation and efficient sampling given the vast number of parameters involved. In this work, we introduce a novel parameter-efficient learning methodology that incorporates uncertainty calibration loss within the AL framework. We propose a differentiable loss function that promotes uncertainty calibration for effectively selecting fewer and most informative data samples for fine-tuning. Through extensive experiments across several datasets and vision backbones, we demonstrate that our solution can match and exceed the performance of complex feature-based sampling techniques while being computationally very efficient. Additionally, we investigate the efficacy of Prompt learning versus Low-rank adaptation (LoRA) in sample selection, providing a detailed comparative analysis of these methods in the context of efficient AL.
[122] Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance
Mengling Xu, Ming Tao, Bing-Kun Bao
Main category: cs.CV
TL;DR: The paper introduces Chain-of-Cooking, a model for visualizing cooking processes by generating images for each step, addressing semantic inconsistency and contextual coherence challenges.
Details
Motivation: Existing works focus on finished foods, but visualizing intermediate cooking steps is challenging due to changing ingredient appearances and sequential dependencies.Method: Proposes Dynamic Patch Selection Module for correct appearances, Semantic Evolution Module for semantic association, and Bidirectional Chain-of-Thought Guidance for coherence.
Result: Outperforms existing methods in generating coherent and semantically consistent cooking process images, validated on the CookViz dataset.
Conclusion: Chain-of-Cooking effectively addresses key challenges in cooking process visualization, offering improved performance and coherence.
Abstract: Cooking process visualization is a promising task in the intersection of image generation and food analysis, which aims to generate an image for each cooking step of a recipe. However, most existing works focus on generating images of finished foods based on the given recipes, and face two challenges to visualize the cooking process. First, the appearance of ingredients changes variously across cooking steps, it is difficult to generate the correct appearances of foods that match the textual description, leading to semantic inconsistency. Second, the current step might depend on the operations of previous step, it is crucial to maintain the contextual coherence of images in sequential order. In this work, we present a cooking process visualization model, called Chain-of-Cooking. Specifically, to generate correct appearances of ingredients, we present a Dynamic Patch Selection Module to retrieve previously generated image patches as references, which are most related to current textual contents. Furthermore, to enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance. To better utilize the semantics of previous texts, the Semantic Evolution Module establishes the semantical association between latent prompts and current cooking step, and merges it with the latent features. Then the CoT Guidance updates the merged features to guide the current cooking step remain coherent with the previous step. Moreover, we construct a dataset named CookViz, consisting of intermediate image-text pairs for the cooking process. Quantitative and qualitative experiments show that our method outperforms existing methods in generating coherent and semantic consistent cooking process.
[123] Suppressing Gradient Conflict for Generalizable Deepfake Detection
Ming-Hui Liu, Harry Cheng, Xin Luo, Xin-Shun Xu
Main category: cs.CV
TL;DR: The paper introduces CS-DFD, a framework to mitigate gradient conflicts in deepfake detection, improving accuracy and generalization by reconciling disparities between original and synthesized data.
Details
Motivation: Deepfake detection models struggle with generalization due to gradient conflicts when trained on both original and synthesized data, degrading performance.Method: Proposes CS-DFD with two modules: UVS for reconciling gradient disparities and CGR for enforcing low-conflict feature embeddings.
Result: CS-DFD achieves state-of-the-art performance in in-domain accuracy and cross-domain generalization.
Conclusion: The framework effectively addresses gradient conflicts, enhancing deepfake detection robustness.
Abstract: Robust deepfake detection models must be capable of generalizing to ever-evolving manipulation techniques beyond training data. A promising strategy is to augment the training data with online synthesized fake images containing broadly generalizable artifacts. However, in the context of deepfake detection, it is surprising that jointly training on both original and online synthesized forgeries may result in degraded performance. This contradicts the common belief that incorporating more source-domain data should enhance detection accuracy. Through empirical analysis, we trace this degradation to gradient conflicts during backpropagation which force a trade-off between source domain accuracy and target domain generalization. To overcome this issue, we propose a Conflict-Suppressed Deepfake Detection (CS-DFD) framework that explicitly mitigates the gradient conflict via two synergistic modules. First, an Update Vector Search (UVS) module searches for an alternative update vector near the initial gradient vector to reconcile the disparities of the original and online synthesized forgeries. By further transforming the search process into an extremum optimization problem, UVS yields the uniquely update vector, which maximizes the simultaneous loss reductions for each data type. Second, a Conflict Gradient Reduction (CGR) module enforces a low-conflict feature embedding space through a novel Conflict Descent Loss. This loss penalizes misaligned gradient directions and guides the learning of representations with aligned, non-conflicting gradients. The synergy of UVS and CGR alleviates gradient interference in both parameter optimization and representation learning. Experiments on multiple deepfake benchmarks demonstrate that CS-DFD achieves state-of-the-art performance in both in-domain detection accuracy and cross-domain generalization.
[124] Sun sensor calibration algorithms: A systematic mapping and survey
Michael Herman, Olivia J. Pinon Fischer, Dimitri N. Mavris
Main category: cs.CV
TL;DR: A review of sun sensor modeling and calibration algorithms, highlighting research gaps and future directions.
Details
Motivation: Sun sensors are critical for spacecraft attitude determination, but their calibration is complex due to small, spatio-temporally varying uncertainties. Existing literature lacks consolidation, motivating this systematic review.Method: The paper conducts a systematic mapping of sun sensor modeling and calibration algorithms across various sensor configurations, surveying methodologies and analyzing gaps.
Result: The review provides a comprehensive survey of methodologies, identifies research gaps, and offers recommendations for future advancements in sun sensor calibration.
Conclusion: The study consolidates existing work on sun sensor calibration, highlights gaps, and suggests future research directions to improve accuracy and reduce uncertainty.
Abstract: Attitude sensors determine the spacecraft attitude through the sensing of an astronomical object, field or other phenomena. The Sun and fixed stars are the two primary astronomical sensing objects. Attitude sensors are critical components for the survival and knowledge improvement of spacecraft. Of these, sun sensors are the most common and important sensor for spacecraft attitude determination. The sun sensor measures the Sun vector in spacecraft coordinates. The sun sensor calibration process is particularly difficult due to the complex nature of the uncertainties involved. The uncertainties are small, difficult to observe, and vary spatio-temporally over the lifecycle of the sensor. In addition, the sensors are affected by numerous sources of uncertainties, including manufacturing, electrical, environmental, and interference sources. This motivates the development of advanced calibration algorithms to minimize uncertainty over the sensor lifecycle and improve accuracy. Although modeling and calibration techniques for sun sensors have been explored extensively in the literature over the past two decades, there is currently no resource that consolidates and systematically reviews this body of work. The present review proposes a systematic mapping of sun sensor modeling and calibration algorithms across a breadth of sensor configurations. It specifically provides a comprehensive survey of each methodology, along with an analysis of research gaps and recommendations for future directions in sun sensor modeling and calibration techniques.
[125] Multi-View Reconstruction with Global Context for 3D Anomaly Detection
Yihan Sun, Yuqi Cheng, Yunkang Cao, Yuxin Zhang, Weiming Shen
Main category: cs.CV
TL;DR: MVR improves 3D anomaly detection by converting point clouds to multi-view images for better global information learning.
Details
Motivation: Existing methods lack sufficient global information for high-precision 3D anomaly detection.Method: Multi-View Reconstruction (MVR) converts point clouds into multi-view images and uses a reconstruction-based framework.
Result: Achieves 89.6% object-wise AU-ROC and 95.7% point-wise AU-ROC on Real3D-AD.
Conclusion: MVR effectively enhances global information learning for 3D anomaly detection.
Abstract: 3D anomaly detection is critical in industrial quality inspection. While existing methods achieve notable progress, their performance degrades in high-precision 3D anomaly detection due to insufficient global information. To address this, we propose Multi-View Reconstruction (MVR), a method that losslessly converts high-resolution point clouds into multi-view images and employs a reconstruction-based anomaly detection framework to enhance global information learning. Extensive experiments demonstrate the effectiveness of MVR, achieving 89.6% object-wise AU-ROC and 95.7% point-wise AU-ROC on the Real3D-AD benchmark.
[126] RelMap: Enhancing Online Map Construction with Class-Aware Spatial Relation and Semantic Priors
Tianhui Cai, Yun Zhang, Zewei Zhou, Zhiyu Huang, Jiaqi Ma
Main category: cs.CV
TL;DR: RelMap is an end-to-end framework for online HD map construction that improves accuracy by incorporating spatial relations and semantic priors, achieving state-of-the-art results.
Details
Motivation: Existing transformer-based methods for online HD map construction often overlook spatial and semantic relationships among map elements, limiting accuracy and generalization.Method: RelMap introduces a Class-aware Spatial Relation Prior to encode positional dependencies and a Mixture-of-Experts-based Semantic Prior for refining feature decoding. It works with single-frame and temporal backbones.
Result: The method achieves state-of-the-art performance on nuScenes and Argoverse 2 datasets.
Conclusion: RelMap effectively addresses limitations in existing approaches by leveraging spatial and semantic relationships, enhancing online HD map construction.
Abstract: Online high-definition (HD) map construction plays an increasingly important role in scaling autonomous driving systems. Transformer-based methods have become prevalent in online HD map construction; however, existing approaches often neglect the inherent spatial and semantic relationships among map elements, which limits their accuracy and generalization. To address this, we propose RelMap, an end-to-end framework that enhances online map construction by incorporating spatial relations and semantic priors. We introduce a Class-aware Spatial Relation Prior, which explicitly encodes relative positional dependencies between map elements using a learnable class-aware relation encoder. Additionally, we propose a Mixture-of-Experts (MoE)-based Semantic Prior, which routes features to class-specific experts based on predicted class probabilities, refining instance feature decoding. Our method is compatible with both single-frame and temporal perception backbones, achieving state-of-the-art performance on both the nuScenes and Argoverse 2 datasets.
[127] LinDeps: A Fine-tuning Free Post-Pruning Method to Remove Layer-Wise Linear Dependencies with Guaranteed Performance Preservation
Maxim Henry, Adrien Deliège, Anthony Cioppa, Marc Van Droogenbroeck
Main category: cs.CV
TL;DR: LinDeps is a post-pruning method using linear dependency analysis to remove redundant filters in CNNs, improving compression and performance without fine-tuning.
Details
Motivation: Addressing the challenge of pruning CNNs optimally without degrading performance by considering structural dependencies across feature maps.Method: Uses pivoted QR decomposition to detect and prune linearly dependent filters, followed by a signal recovery mechanism to adjust kernels.
Result: Demonstrates improved compression rates and preserved performance on CIFAR-10 and ImageNet with VGG and ResNet, outperforming state-of-the-art methods.
Conclusion: LinDeps is a versatile add-on for pruning techniques, enhancing efficiency and performance, especially in low-resource setups.
Abstract: Convolutional Neural Networks (CNN) are widely used in many computer vision tasks. Yet, their increasing size and complexity pose significant challenges for efficient deployment on resource-constrained platforms. Hence, network pruning has emerged as an effective way of reducing the size and computational requirements of neural networks by removing redundant or unimportant parameters. However, a fundamental challenge with pruning consists in optimally removing redundancies without degrading performance. Most existing pruning techniques overlook structural dependencies across feature maps within a layer, resulting in suboptimal pruning decisions. In this work, we introduce LinDeps, a novel post-pruning method, i.e., a pruning method that can be applied on top of any pruning technique, which systematically identifies and removes redundant filters via linear dependency analysis. Particularly, LinDeps applies pivoted QR decomposition to feature maps to detect and prune linearly dependent filters. Then, a novel signal recovery mechanism adjusts the next layer’s kernels to preserve compatibility and performance without requiring any fine-tuning. Our experiments on CIFAR-10 and ImageNet with VGG and ResNet backbones demonstrate that LinDeps improves compression rates of existing pruning techniques while preserving performances, leading to a new state of the art in CNN pruning. We also benchmark LinDeps in low-resource setups where no retraining can be performed, which shows significant pruning improvements and inference speedups over a state-of-the-art method. LinDeps therefore constitutes an essential add-on for any current or future pruning technique.
[128] TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs
Kejia Zhang, Keda Tao, Zhiming Luo, Chang Liu, Jiasheng Tang, Huan Wang
Main category: cs.CV
TL;DR: TARS, a token-adaptive preference strategy, improves multimodal large language models (MLLMs) by reducing hallucinations through min-max optimization, outperforming standard DPO and matching GPT-4o.
Details
Motivation: Existing DPO strategies for correcting hallucinations in MLLMs rely on static supervision, leading to overfitting and poor grounding in visual information.Method: TARS reformulates DPO as a min-max optimization problem, maximizing token-level distributional shifts under semantic constraints while minimizing preference loss.
Result: TARS reduces hallucination rates from 26.4% to 13.2% and cognition value from 2.5 to 0.4, outperforming standard DPO and matching GPT-4o.
Conclusion: TARS effectively mitigates hallucinations in MLLMs by preserving causal grounding and avoiding overfitting, demonstrating strong performance with minimal data.
Abstract: Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.
[129] Emerging Trends in Pseudo-Label Refinement for Weakly Supervised Semantic Segmentation with Image-Level Supervision
Zheyuan Zhang, Wang Zhang
Main category: cs.CV
TL;DR: A review of weakly supervised semantic segmentation (WSSS) with image-level annotations, focusing on recent advancements, challenges, and future directions.
Details
Motivation: To address the gap in existing surveys by synthesizing the latest trends and state-of-the-art techniques in WSSS with image-level labels.Method: Categorizes existing methods based on supervision types and levels, and examines challenges in domain-specific datasets.
Result: Highlights advancements, evaluates limitations, and identifies underexplored topics in WSSS.
Conclusion: Provides a comprehensive resource for researchers familiar with WSSS, outlining future research directions.
Abstract: Unlike fully supervised semantic segmentation, weakly supervised semantic segmentation (WSSS) relies on weaker forms of supervision to perform dense prediction tasks. Among the various types of weak supervision, WSSS with image level annotations is considered both the most challenging and the most practical, attracting significant research attention. Therefore, in this review, we focus on WSSS with image level annotations. Additionally, this review concentrates on mainstream research directions, deliberately omitting less influential branches. Given the rapid development of new methods and the limitations of existing surveys in capturing recent trends, there is a pressing need for an updated and comprehensive review. Our goal is to fill this gap by synthesizing the latest advancements and state-of-the-art techniques in WSSS with image level labels. Basically, we provide a comprehensive review of recent advancements in WSSS with image level labels, categorizing existing methods based on the types and levels of additional supervision involved. We also examine the challenges of applying advanced methods to domain specific datasets in WSSS,a topic that remains underexplored. Finally, we discuss the current challenges, evaluate the limitations of existing approaches, and outline several promising directions for future research. This review is intended for researchers who are already familiar with the fundamental concepts of WSSS and are seeking to deepen their understanding of current advances and methodological innovations.
[130] Locally Controlled Face Aging with Latent Diffusion Models
Lais Isabelle Alves dos Santos, Julien Despois, Thibaut Chauffier, Sileye O. Ba, Giovanni Palma
Main category: cs.CV
TL;DR: A novel face aging method using latent diffusion models to selectively age facial regions, addressing heterogeneity in aging for more realistic results.
Details
Motivation: Current methods treat aging as homogeneous, ignoring regional differences due to intrinsic and extrinsic factors.Method: Leverages latent diffusion models to age specific facial regions and a refiner for seamless blending.
Result: Achieves robust identity preservation, high-fidelity imagery, and natural aging progression.
Conclusion: The approach provides finer-grained control and more realistic, personalized aging.
Abstract: We present a novel approach to face aging that addresses the limitations of current methods which treat aging as a global, homogeneous process. Existing techniques using GANs and diffusion models often condition generation on a reference image and target age, neglecting that facial regions age heterogeneously due to both intrinsic chronological factors and extrinsic elements like sun exposure. Our method leverages latent diffusion models to selectively age specific facial regions using local aging signs. This approach provides significantly finer-grained control over the generation process, enabling more realistic and personalized aging. We employ a latent diffusion refiner to seamlessly blend these locally aged regions, ensuring a globally consistent and natural-looking synthesis. Experimental results demonstrate that our method effectively achieves three key criteria for successful face aging: robust identity preservation, high-fidelity and realistic imagery, and a natural, controllable aging progression.
[131] Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, Shuxiang Song
Main category: cs.CV
TL;DR: A self-supervised tracking framework, SSTrack, eliminates the need for manual box annotations by using spatio-temporal consistency and instance contrastive loss, outperforming state-of-the-art methods by significant margins.
Details
Motivation: Manual box annotations are labor-intensive and limit dataset scale and diversity, prompting the need for a self-supervised solution.Method: Proposes a decoupled spatio-temporal consistency framework for global-local learning and an instance contrastive loss for robust supervision without labels.
Result: SSTrack improves AUC scores by 25.3%, 20.4%, and 14.8% on GOT10K, LaSOT, and TrackingNet datasets, respectively.
Conclusion: SSTrack effectively learns tracking representations self-supervised, reducing reliance on manual annotations while achieving superior performance.
Abstract: The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf{{\tracker}}, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables {\tracker} to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that {\tracker} surpasses \textit{SOTA} self-supervised tracking methods, achieving an improvement of more than 25.3%, 20.4%, and 14.8% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively. Code: https://github.com/GXNU-ZhongLab/SSTrack.
[132] Semantic Segmentation of iPS Cells: Case Study on Model Complexity in Biomedical Imaging
Maoquan Zhang, Bisser Raytchev, Xiujuan Sun
Main category: cs.CV
TL;DR: A DeepLabv3 model outperforms larger models like SAM2 and MedSAM2 in segmenting iPS cell colonies, showing simpler models can excel in specialized tasks with subtle boundaries.
Details
Motivation: To challenge the assumption that larger, more generalized models always perform better in medical image segmentation, especially for tasks with low-contrast boundaries.Method: Used a carefully configured DeepLabv3 model for segmenting iPS cell colonies without structural modifications.
Result: DeepLabv3 outperformed SAM2 and MedSAM2, demonstrating that simpler models can achieve high performance in specialized tasks.
Conclusion: Simpler, domain-specific models may offer better accuracy and reliability than large-scale foundation models in certain biomedical applications.
Abstract: Medical image segmentation requires not only accuracy but also robustness under challenging imaging conditions. In this study, we show that a carefully configured DeepLabv3 model can achieve high performance in segmenting induced pluripotent stem (iPS) cell colonies, and, under our experimental conditions, outperforms large-scale foundation models such as SAM2 and its medical variant MedSAM2 without structural modifications. These results suggest that, for specialized tasks characterized by subtle, low-contrast boundaries, increased model complexity does not necessarily translate to better performance. Our work revisits the assumption that ever-larger and more generalized architectures are always preferable, and provides evidence that appropriately adapted, simpler models may offer strong accuracy and practical reliability in domain-specific biomedical applications. We also offer an open-source implementation that includes strategies for small datasets and domain-specific encoding, with the aim of supporting further advances in semantic segmentation for regenerative medicine and related fields.
[133] Wind Turbine Feature Detection Using Deep Learning and Synthetic Data
Arash Shahirpour, Jakob Gebler, Manuel Sanders, Tim Reuscher
Main category: cs.CV
TL;DR: Proposes synthetic data generation for training drone-based wind turbine inspection models, improving diversity and performance.
Details
Motivation: Overcome limitations of manually labeled real-world images by using synthetic data for better training diversity.Method: Generates synthetic training data with controlled variations and trains a YOLOv11 network with a modified loss function.
Result: Achieves high performance (Pose mAP50-95 of 0.97) on real-world images unseen during training.
Conclusion: Synthetic data enhances model robustness and performance for wind turbine detection.
Abstract: For the autonomous drone-based inspection of wind turbine (WT) blades, accurate detection of the WT and its key features is essential for safe drone positioning and collision avoidance. Existing deep learning methods typically rely on manually labeled real-world images, which limits both the quantity and the diversity of training datasets in terms of weather conditions, lighting, turbine types, and image complexity. In this paper, we propose a method to generate synthetic training data that allows controlled variation of visual and environmental factors, increasing the diversity and hence creating challenging learning scenarios. Furthermore, we train a YOLOv11 feature detection network solely on synthetic WT images with a modified loss function, to detect WTs and their key features within an image. The resulting network is evaluated both using synthetic images and a set of real-world WT images and shows promising performance across both synthetic and real-world data, achieving a Pose mAP50-95 of 0.97 on real images never seen during training.
[134] EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO
Wei Guan, Jun Lan, Jian Cao, Hao Tan, Huijia Zhu, Weiqiang Wang
Main category: cs.CV
TL;DR: EMIT enhances multimodal large language models (MLLMs) for industrial anomaly detection (IAD) using difficulty-aware group relative policy optimization (GRPO), achieving a 7.77% performance boost.
Details
Motivation: MLLMs lack domain-specific adaptation for IAD, limiting their effectiveness despite strong vision-language reasoning abilities.Method: EMIT constructs a multi-task IAD dataset, uses GPT-generated text descriptions, integrates soft prompts and heatmap-guided contrastive embeddings, and employs difficulty-aware GRPO for handling challenging samples.
Result: EMIT improves MLLM performance by 7.77% on the MMAD benchmark across seven tasks.
Conclusion: EMIT effectively adapts MLLMs for IAD, demonstrating significant performance gains through domain-specific enhancements.
Abstract: Industrial anomaly detection (IAD) plays a crucial role in maintaining the safety and reliability of manufacturing systems. While multimodal large language models (MLLMs) show strong vision-language reasoning abilities, their effectiveness in IAD remains limited without domain-specific adaptation. In this work, we propose EMIT, a unified framework that enhances MLLMs for IAD via difficulty-aware group relative policy optimization (GRPO). EMIT constructs a multi-task IAD dataset and utilizes GPT-generated object text descriptions to compensate for missing defective images. For few-shot anomaly detection, it integrates a soft prompt and heatmap-guided contrastive embeddings derived from patch-level comparisons. To better handle difficult data samples, i.e., cases where the MLLM struggles to generate correct answers, we propose a difficulty-aware GRPO that extends the original GRPO by incorporating a response resampling strategy to ensure the inclusion of correct answers in the sampled responses, as well as an advantage reweighting mechanism to strengthen learning from such difficult data samples. Extensive experiments on the MMAD benchmark demonstrate that EMIT significantly enhances the IAD performance of MLLMs, achieving an average improvement of 7.77% over the base model (InternVL3-8B) across seven tasks.
[135] GuidPaint: Class-Guided Image Inpainting with Diffusion Models
Qimin Wang, Xinda Liu, Guohua Geng
Main category: cs.CV
TL;DR: GuidPaint is a training-free, class-guided image inpainting framework using diffusion models, improving semantic consistency and visual realism without additional training.
Details
Motivation: Existing diffusion-based inpainting methods lack fine-grained control over masked regions, leading to inconsistent or implausible results.Method: Incorporates classifier guidance into denoising for precise control, integrates stochastic and deterministic sampling for refinement.
Result: Outperforms existing context-aware inpainting methods in qualitative and quantitative evaluations.
Conclusion: GuidPaint offers a computationally efficient and effective solution for high-quality image inpainting with fine-grained control.
Abstract: In recent years, diffusion models have been widely adopted for image inpainting tasks due to their powerful generative capabilities, achieving impressive results. Existing multimodal inpainting methods based on diffusion models often require architectural modifications and retraining, resulting in high computational cost. In contrast, context-aware diffusion inpainting methods leverage the model’s inherent priors to adjust intermediate denoising steps, enabling high-quality inpainting without additional training and significantly reducing computation. However, these methods lack fine-grained control over the masked regions, often leading to semantically inconsistent or visually implausible content. To address this issue, we propose GuidPaint, a training-free, class-guided image inpainting framework. By incorporating classifier guidance into the denoising process, GuidPaint enables precise control over intermediate generations within the masked areas, ensuring both semantic consistency and visual realism. Furthermore, it integrates stochastic and deterministic sampling, allowing users to select preferred intermediate results and deterministically refine them. Experimental results demonstrate that GuidPaint achieves clear improvements over existing context-aware inpainting methods in both qualitative and quantitative evaluations.
[136] The Evolution of Video Anomaly Detection: A Unified Framework from DNN to MLLM
Shibo Gao, Peipei Yang, Haiyang Guo, Yangyang Liu, Yi Chen, Shuai Li, Han Zhu, Jian Xu, Xu-Yao Zhang, Linlin Huang
Main category: cs.CV
TL;DR: A comprehensive survey on video anomaly detection (VAD) leveraging multi-modal large language models (MLLMs) and large language models (LLMs), highlighting advancements, challenges, and future directions.
Details
Motivation: The rapid development of MLLMs and LLMs has transformed VAD, creating a need for a systematic review of recent advancements and a unified framework.Method: The paper provides a survey and analysis of VAD methods based on MLLMs and LLMs, proposing a unified framework and comparing strengths and weaknesses.
Result: The survey identifies key changes in VAD due to MLLMs/LLMs, constructs a classification system, and analyzes new paradigms.
Conclusion: The paper outlines challenges and future research directions, offering guidance for the VAD community in the era of large models.
Abstract: Video anomaly detection (VAD) aims to identify and ground anomalous behaviors or events in videos, serving as a core technology in the fields of intelligent surveillance and public safety. With the advancement of deep learning, the continuous evolution of deep model architectures has driven innovation in VAD methodologies, significantly enhancing feature representation and scene adaptability, thereby improving algorithm generalization and expanding application boundaries. More importantly, the rapid development of multi-modal large language (MLLMs) and large language models (LLMs) has introduced new opportunities and challenges to the VAD field. Under the support of MLLMs and LLMs, VAD has undergone significant transformations in terms of data annotation, input modalities, model architectures, and task objectives. The surge in publications and the evolution of tasks have created an urgent need for systematic reviews of recent advancements. This paper presents the first comprehensive survey analyzing VAD methods based on MLLMs and LLMs, providing an in-depth discussion of the changes occurring in the VAD field in the era of large models and their underlying causes. Additionally, this paper proposes a unified framework that encompasses both deep neural network (DNN)-based and LLM-based VAD methods, offering a thorough analysis of the new VAD paradigms empowered by LLMs, constructing a classification system, and comparing their strengths and weaknesses. Building on this foundation, this paper focuses on current VAD methods based on MLLMs/LLMs. Finally, based on the trajectory of technological advancements and existing bottlenecks, this paper distills key challenges and outlines future research directions, offering guidance for the VAD community.
[137] Automated Detection of Antarctic Benthic Organisms in High-Resolution In Situ Imagery to Aid Biodiversity Monitoring
Cameron Trotter, Huw Griffiths, Tasnuva Ming Khan, Rowan Whittle
Main category: cs.CV
TL;DR: A computer vision framework for Antarctic benthic biodiversity monitoring is introduced, addressing challenges like limited annotated data and complex imagery. It performs well for medium/large organisms but struggles with small/rare taxa.
Details
Motivation: Monitoring Antarctic benthic biodiversity is crucial for ecological insights under climate change, but manual annotation of imagery is slow and specialized.Method: The framework uses resolution-preserving patching, spatial data augmentation, fine-tuning, and Slicing Aided Hyper Inference for object detection in high-resolution imagery.
Result: The method achieves strong performance for 25 morphotypes (medium/large organisms) but struggles with small/rare taxa.
Conclusion: The framework offers a scalable solution for machine-assisted biodiversity monitoring, though detection of small/rare organisms needs improvement.
Abstract: Monitoring benthic biodiversity in Antarctica is vital for understanding ecological change in response to climate-driven pressures. This work is typically performed using high-resolution imagery captured in situ, though manual annotation of such data remains laborious and specialised, impeding large-scale analysis. We present a tailored object detection framework for identifying and classifying Antarctic benthic organisms in high-resolution towed camera imagery, alongside the first public computer vision dataset for benthic biodiversity monitoring in the Weddell Sea. Our approach addresses key challenges associated with marine ecological imagery, including limited annotated data, variable object sizes, and complex seafloor structure. The proposed framework combines resolution-preserving patching, spatial data augmentation, fine-tuning, and postprocessing via Slicing Aided Hyper Inference. We benchmark multiple object detection architectures and demonstrate strong performance in detecting medium and large organisms across 25 fine-grained morphotypes, significantly more than other works in this area. Detection of small and rare taxa remains a challenge, reflecting limitations in current detection architectures. Our framework provides a scalable foundation for future machine-assisted in situ benthic biodiversity monitoring research.
[138] APT: Improving Diffusion Models for High Resolution Image Generation with Adaptive Path Tracing
Sangmin Han, Jinho Jeong, Jinwoo Kim, Seon Joo Kim
Main category: cs.CV
TL;DR: APT improves high-resolution image generation by addressing patch-level issues in LDMs, offering clearer details and faster sampling.
Details
Motivation: Fixed-resolution training in LDMs limits high-resolution scalability, and training-based methods are resource-intensive. Patch-based approaches face distribution shift and monotonicity issues.Method: APT uses Statistical Matching and Scale-aware Scheduling to address patch-level issues, ensuring consistent distributions and handling monotonicity.
Result: APT produces clearer, refined details and faster sampling with minimal quality loss.
Conclusion: APT is a practical solution for high-resolution image generation, balancing quality and efficiency.
Abstract: Latent Diffusion Models (LDMs) are generally trained at fixed resolutions,
limiting their capability when scaling up to high-resolution images. While
training-based approaches address this limitation by training on
high-resolution datasets, they require large amounts of data and considerable
computational resources, making them less practical. Consequently,
training-free methods, particularly patch-based approaches, have become a
popular alternative. These methods divide an image into patches and fuse the
denoising paths of each patch, showing strong performance on high-resolution
generation. However, we observe two critical issues for patch-based approaches,
which we call patch-level distribution shift" and
increased patch
monotonicity." To address these issues, we propose Adaptive Path Tracing (APT),
a framework that combines Statistical Matching to ensure patch distributions
remain consistent in upsampled latents and Scale-aware Scheduling to deal with
the patch monotonicity. As a result, APT produces clearer and more refined
details in high-resolution images. In addition, APT enables a shortcut
denoising process, resulting in faster sampling with minimal quality
degradation. Our experimental results confirm that APT produces more detailed
outputs with improved inference speed, providing a practical approach to
high-resolution image generation.
[139] Semantics versus Identity: A Divide-and-Conquer Approach towards Adjustable Medical Image De-Identification
Yuan Tian, Shuo Wang, Rongzhao Zhang, Zijian Chen, Yankai Jiang, Chunyi Li, Xiangyang Zhu, Fang Yan, Qiang Hu, XiaoSong Wang, Guangtao Zhai
Main category: cs.CV
TL;DR: A framework for medical image de-identification balances privacy and semantics by blocking identity-related regions and compensating with medical features, outperforming existing methods.
Details
Motivation: Addressing privacy risks in medical imaging while preserving medical semantics and allowing flexible privacy adjustments.Method: A two-step framework: Identity-Blocking for privacy levels and Medical-Semantics-Compensation using pre-trained models, with feature decoupling to remove residual identity.
Result: Outperforms existing methods across seven datasets and three tasks.
Conclusion: The proposed framework effectively balances privacy and medical utility, setting a new standard for de-identification.
Abstract: Medical imaging has significantly advanced computer-aided diagnosis, yet its re-identification (ReID) risks raise critical privacy concerns, calling for de-identification (DeID) techniques. Unfortunately, existing DeID methods neither particularly preserve medical semantics, nor are flexibly adjustable towards different privacy levels. To address these issues, we propose a divide-and-conquer framework comprising two steps: (1) Identity-Blocking, which blocks varying proportions of identity-related regions, to achieve different privacy levels; and (2) Medical-Semantics-Compensation, which leverages pre-trained Medical Foundation Models (MFMs) to extract medical semantic features to compensate the blocked regions. Moreover, recognizing that features from MFMs may still contain residual identity information, we introduce a Minimum Description Length principle-based feature decoupling strategy, to effectively decouple and discard such identity components. Extensive evaluations against existing approaches across seven datasets and three downstream tasks, demonstrates our state-of-the-art performance.
[140] Impact of Underwater Image Enhancement on Feature Matching
Jason M. Summers, Mark W. Jones
Main category: cs.CV
TL;DR: The paper introduces local matching stability and furthest matchable frame as metrics to evaluate underwater image enhancement, proposing a novel framework to assess enhancement techniques’ impact on frame-matching performance and SLAM applications.
Details
Motivation: Underwater image enhancement is crucial for tasks like path detection and autonomous navigation, but existing methods lack robust evaluation metrics for real-world applicability.Method: The authors propose a novel evaluation framework using local matching stability and furthest matchable frame to analyze enhancement techniques’ impact on frame-matching and SLAM performance.
Result: The framework identifies strengths and limitations of existing approaches and provides a context-aware benchmark for comparing enhancement methods.
Conclusion: The study demonstrates the practical relevance of visual improvements in underwater SLAM, offering a robust evaluation framework for real-world scenarios.
Abstract: We introduce local matching stability and furthest matchable frame as quantitative measures for evaluating the success of underwater image enhancement. This enhancement process addresses visual degradation caused by light absorption, scattering, marine growth, and debris. Enhanced imagery plays a critical role in downstream tasks such as path detection and autonomous navigation for underwater vehicles, relying on robust feature extraction and frame matching. To assess the impact of enhancement techniques on frame-matching performance, we propose a novel evaluation framework tailored to underwater environments. Through metric-based analysis, we identify strengths and limitations of existing approaches and pinpoint gaps in their assessment of real-world applicability. By incorporating a practical matching strategy, our framework offers a robust, context-aware benchmark for comparing enhancement methods. Finally, we demonstrate how visual improvements affect the performance of a complete real-world algorithm – Simultaneous Localization and Mapping (SLAM) – reinforcing the framework’s relevance to operational underwater scenarios.
[141] Detection Transformers Under the Knife: A Neuroscience-Inspired Approach to Ablations
Nils Hütten, Florian Hölken, Hasan Tercan, Tobias Meisen
Main category: cs.CV
TL;DR: The paper systematically analyzes the impact of ablating key components in detection transformer models (DETR, DDETR, DINO) to understand their roles in model performance and transparency, revealing model-specific resilience patterns and structural redundancies.
Details
Motivation: To address the research gap in understanding the distinct roles of internal components in detection transformers, inspired by neuroscientific ablation studies, to improve transparency and efficiency.Method: Ablation studies targeting query embeddings, encoder/decoder MHSA, and decoder MHCA layers in DETR, DDETR, and DINO models, evaluated on COCO dataset using gIoU and F1-score metrics.
Result: Model-specific resilience patterns: DETR is sensitive to encoder MHSA and decoder MHCA ablations, DDETR is robust due to multi-scale deformable attention, and DINO is most resilient due to its update rule. Structural redundancies in DDETR and DINO decoder MHCA layers were identified.
Conclusion: The study advances XAI for DETRs by clarifying component contributions, offering insights for optimizing transparency and efficiency, and revealing opportunities for model simplification.
Abstract: In recent years, Explainable AI has gained traction as an approach to enhancing model interpretability and transparency, particularly in complex models such as detection transformers. Despite rapid advancements, a substantial research gap remains in understanding the distinct roles of internal components - knowledge that is essential for improving transparency and efficiency. Inspired by neuroscientific ablation studies, which investigate the functions of brain regions through selective impairment, we systematically analyze the impact of ablating key components in three state-of-the-art detection transformer models: Detection transformer (DETR), deformable detection transformer (DDETR), and DETR with improved denoising anchor boxes (DINO). The ablations target query embeddings, encoder and decoder multi-head self-attentions (MHSA) as well as decoder multi-head cross-attention (MHCA) layers. We evaluate the effects of these ablations on the performance metrics gIoU and F1-score, quantifying effects on both the classification and regression sub-tasks on the COCO dataset. To facilitate reproducibility and future research, we publicly release the DeepDissect library. Our findings reveal model-specific resilience patterns: while DETR is particularly sensitive to ablations in encoder MHSA and decoder MHCA, DDETR’s multi-scale deformable attention enhances robustness, and DINO exhibits the greatest resilience due to its look-forward twice update rule, which helps distributing knowledge across blocks. These insights also expose structural redundancies, particularly in DDETR’s and DINO’s decoder MHCA layers, highlighting opportunities for model simplification without sacrificing performance. This study advances XAI for DETRs by clarifying the contributions of internal components to model performance, offering insights to optimize and improve transparency and efficiency in critical applications.
[142] SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking
Qianxiong Xu, Lanyun Zhu, Chenxi Liu, Guosheng Lin, Cheng Long, Ziyue Li, Rui Zhao
Main category: cs.CV
TL;DR: SAMITE improves VOT by addressing occlusion and distraction issues using a Prototypical Memory Bank and Positional Prompt Generator, built on SAM2.
Details
Motivation: Existing VOT methods lack temporal dependency handling and generalizability, and struggle with occlusions and distractions.Method: SAMITE enhances SAM2 with a Prototypical Memory Bank for error filtering and a Positional Prompt Generator for better tracking accuracy.
Result: SAMITE outperforms on six benchmarks, showing superior tracking performance.
Conclusion: SAMITE effectively tackles VOT challenges, offering improved accuracy and robustness.
Abstract: Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos. Existing methods can be roughly categorized into template matching and autoregressive methods, where the former usually neglects the temporal dependencies across frames and the latter tends to get biased towards the object categories during training, showing weak generalizability to unseen classes. To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner. Nevertheless, existing methods fail to overcome the challenges of object occlusions and distractions, and do not have any measures to intercept the propagation of tracking errors. To tackle them, we present a SAMITE model, built upon SAM2 with additional modules, including: (1) Prototypical Memory Bank: We propose to quantify the feature-wise and position-wise correctness of each frame’s tracking results, and select the best frames to condition subsequent frames. As the features of occluded and distracting objects are feature-wise and position-wise inaccurate, their scores would naturally be lower and thus can be filtered to intercept error propagation; (2) Positional Prompt Generator: To further reduce the impacts of distractors, we propose to generate positional mask prompts to provide explicit positional clues for the target, leading to more accurate tracking. Extensive experiments have been conducted on six benchmarks, showing the superiority of SAMITE. The code is available at https://github.com/Sam1224/SAMITE.
[143] Adversarial Reconstruction Feedback for Robust Fine-grained Generalization
Shijie Wang, Jian Shi, Haojie Li
Main category: cs.CV
TL;DR: AdvRF introduces an adversarial reconstruction feedback framework to learn category-agnostic discrepancy representations for fine-grained image retrieval, improving generalization to unseen categories.
Details
Motivation: Existing FGIR methods rely on predefined categories, introducing semantic dependencies that hinder generalization.Method: AdvRF combines category-aware discrepancy localization (retrieval model) with category-agnostic feature learning (reconstruction model) via adversarial feedback.
Result: AdvRF achieves strong performance on fine-grained and coarse-grained datasets.
Conclusion: The framework successfully decouples category-specific semantics, enhancing generalization for unseen categories.
Abstract: Existing fine-grained image retrieval (FGIR) methods predominantly rely on supervision from predefined categories to learn discriminative representations for retrieving fine-grained objects. However, they inadvertently introduce category-specific semantics into the retrieval representation, creating semantic dependencies on predefined classes that critically hinder generalization to unseen categories. To tackle this, we propose AdvRF, a novel adversarial reconstruction feedback framework aimed at learning category-agnostic discrepancy representations. Specifically, AdvRF reformulates FGIR as a visual discrepancy reconstruction task via synergizing category-aware discrepancy localization from retrieval models with category-agnostic feature learning from reconstruction models. The reconstruction model exposes residual discrepancies overlooked by the retrieval model, forcing it to improve localization accuracy, while the refined signals from the retrieval model guide the reconstruction model to improve its reconstruction ability. Consequently, the retrieval model localizes visual differences, while the reconstruction model encodes these differences into category-agnostic representations. This representation is then transferred to the retrieval model through knowledge distillation for efficient deployment. Quantitative and qualitative evaluations demonstrate that our AdvRF achieves impressive performance on both widely-used fine-grained and coarse-grained datasets.
[144] Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards
Aybora Koksal, A. Aydin Alatan
Main category: cs.CV
TL;DR: A few-shot reinforcement learning framework (RLVR) for satellite imagery eliminates the need for caption supervision, using lightweight rewards. It achieves strong performance with minimal data.
Details
Motivation: Specialized domains like remote sensing lack annotated data, making large models impractical. RLVR addresses this by reducing reliance on costly supervision.Method: Uses policy-gradient optimization with rule-based binary or IoU-based rewards, adapting the 1-shot RLVR paradigm from language to vision-language models.
Result: Even one example improves performance; scaling to 128 examples matches models trained on thousands. Task-specific overfitting is mild, with robust generalization.
Conclusion: RLVR enables cost-effective, data-efficient development of domain-specialist models, offering a practical solution for data-scarce fields.
Abstract: Recent advances in large language and vision-language models have enabled strong reasoning capabilities, yet they remain impractical for specialized domains like remote sensing, where annotated data is scarce and expensive. We present the first few-shot reinforcement learning with verifiable reward (RLVR) framework for satellite imagery that eliminates the need for caption supervision–relying solely on lightweight, rule-based binary or IoU-based rewards. Adapting the “1-shot RLVR” paradigm from language models to vision-language models, we employ policy-gradient optimization with as few as one curated example to align model outputs for satellite reasoning tasks. Comprehensive experiments across multiple remote sensing benchmarks–including classification, visual question answering, and grounding–show that even a single example yields substantial improvements over the base model. Scaling to 128 examples matches or exceeds models trained on thousands of annotated samples. While the extreme one-shot setting can induce mild, task-specific overfitting, our approach consistently demonstrates robust generalization and efficiency across diverse tasks. Further, we find that prompt design and loss weighting significantly influence training stability and final accuracy. Our method enables cost-effective and data-efficient development of domain-specialist vision-language reasoning models, offering a pragmatic recipe for data-scarce fields: start from a compact VLM, curate a handful of reward-checkable cases, and train via RLVR.
[145] LiteFat: Lightweight Spatio-Temporal Graph Learning for Real-Time Driver Fatigue Detection
Jing Ren, Suyu Ma, Hong Jia, Xiwei Xu, Ivan Lee, Haytham Fayek, Xiaodong Li, Feng Xia
Main category: cs.CV
TL;DR: LiteFat is a lightweight spatio-temporal graph learning model for efficient driver fatigue detection, reducing computational demands while maintaining accuracy.
Details
Motivation: Driver fatigue is a major cause of accidents, but existing deep learning solutions are too resource-intensive for embedded devices.Method: Converts video data into spatio-temporal graphs using facial landmarks, employs MobileNet for feature extraction, and uses a lightweight graph neural network for fatigue detection.
Result: LiteFat achieves competitive accuracy with significantly lower computational complexity and latency compared to state-of-the-art methods.
Conclusion: Enables real-time, resource-efficient fatigue detection for embedded robotic devices like intelligent vehicles.
Abstract: Detecting driver fatigue is critical for road safety, as drowsy driving remains a leading cause of traffic accidents. Many existing solutions rely on computationally demanding deep learning models, which result in high latency and are unsuitable for embedded robotic devices with limited resources (such as intelligent vehicles/cars) where rapid detection is necessary to prevent accidents. This paper introduces LiteFat, a lightweight spatio-temporal graph learning model designed to detect driver fatigue efficiently while maintaining high accuracy and low computational demands. LiteFat involves converting streaming video data into spatio-temporal graphs (STG) using facial landmark detection, which focuses on key motion patterns and reduces unnecessary data processing. LiteFat uses MobileNet to extract facial features and create a feature matrix for the STG. A lightweight spatio-temporal graph neural network is then employed to identify signs of fatigue with minimal processing and low latency. Experimental results on benchmark datasets show that LiteFat performs competitively while significantly decreasing computational complexity and latency as compared to current state-of-the-art methods. This work enables the development of real-time, resource-efficient human fatigue detection systems that can be implemented upon embedded robotic devices.
[146] MOR-VIT: Efficient Vision Transformer with Mixture-of-Recursions
YiZhou Li
Main category: cs.CV
TL;DR: MoR-ViT introduces a token-level dynamic recursion mechanism for Vision Transformers, reducing parameters by 70% and speeding up inference by 2.5x while maintaining high accuracy.
Details
Motivation: Standard ViTs suffer from parameter redundancy and high computational costs, limiting practical deployment. Existing methods focus on static compression or fixed-depth processing, lacking adaptability.Method: MoR-ViT uses a Mixture-of-Recursions (MoR) paradigm to enable tokens to adaptively determine their processing depth, dynamically allocating computational resources.
Result: MoR-ViT achieves state-of-the-art accuracy on ImageNet-1K and transfer benchmarks, with significant parameter reduction and inference speedup, outperforming DynamicViT and TinyViT.
Conclusion: Dynamic recursion is an effective strategy for efficient ViTs, offering scalability and deployability for real-world applications.
Abstract: Vision Transformers (ViTs) have achieved remarkable success in image recognition, yet standard ViT architectures are hampered by substantial parameter redundancy and high computational cost, limiting their practical deployment. While recent efforts on efficient ViTs primarily focus on static model compression or token-level sparsification, they remain constrained by fixed computational depth for all tokens. In this work, we present MoR-ViT, a novel vision transformer framework that, for the first time, incorporates a token-level dynamic recursion mechanism inspired by the Mixture-of-Recursions (MoR) paradigm. This approach enables each token to adaptively determine its processing depth, yielding a flexible and input-dependent allocation of computational resources. Extensive experiments on ImageNet-1K and transfer benchmarks demonstrate that MoR-ViT not only achieves state-of-the-art accuracy with up to 70% parameter reduction and 2.5x inference acceleration, but also outperforms leading efficient ViT baselines such as DynamicViT and TinyViT under comparable conditions. These results establish dynamic recursion as an effective strategy for efficient vision transformers and open new avenues for scalable and deployable deep learning models in real-world scenarios.
[147] AU-LLM: Micro-Expression Action Unit Detection via Enhanced LLM-Based Feature Fusion
Zhishu Liu, Kaishen Yuan, Bo Zhao, Yong Xu, Zitong Yu
Main category: cs.CV
TL;DR: The paper introduces AU-LLM, a framework using LLMs for micro-expression AU detection, addressing data scarcity and vision-language gaps with an Enhanced Fusion Projector (EFP). It achieves state-of-the-art results on CASME II and SAMM datasets.
Details
Motivation: Micro-expression AU detection is challenging due to subtle intensities and data scarcity. LLMs' reasoning abilities are unexplored in this domain, prompting the development of AU-LLM.Method: AU-LLM uses a 3D-CNN backbone and EFP (a Multi-Layer Perceptron) to fuse mid-level and high-level visual features into a compact token for LLM-based reasoning.
Result: AU-LLM achieves state-of-the-art performance on CASME II and SAMM datasets under LOSO and cross-domain protocols.
Conclusion: The work demonstrates the robustness and potential of LLM-based reasoning for micro-expression analysis, validated by benchmark results.
Abstract: The detection of micro-expression Action Units (AUs) is a formidable challenge in affective computing, pivotal for decoding subtle, involuntary human emotions. While Large Language Models (LLMs) demonstrate profound reasoning abilities, their application to the fine-grained, low-intensity domain of micro-expression AU detection remains unexplored. This paper pioneers this direction by introducing \textbf{AU-LLM}, a novel framework that for the first time uses LLM to detect AUs in micro-expression datasets with subtle intensities and the scarcity of data. We specifically address the critical vision-language semantic gap, the \textbf{Enhanced Fusion Projector (EFP)}. The EFP employs a Multi-Layer Perceptron (MLP) to intelligently fuse mid-level (local texture) and high-level (global semantics) visual features from a specialized 3D-CNN backbone into a single, information-dense token. This compact representation effectively empowers the LLM to perform nuanced reasoning over subtle facial muscle movements.Through extensive evaluations on the benchmark CASME II and SAMM datasets, including stringent Leave-One-Subject-Out (LOSO) and cross-domain protocols, AU-LLM establishes a new state-of-the-art, validating the significant potential and robustness of LLM-based reasoning for micro-expression analysis. The codes are available at https://github.com/ZS-liu-JLU/AU-LLMs.
[148] MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning
Zhaolong Wang, Tongfeng Sun, Mingzheng Du, Yachao Huang
Main category: cs.CV
TL;DR: MSGCoOp enhances few-shot generalization in VLMs by using parallel context vectors and semantic guidance from LLMs, improving performance and robustness.
Details
Motivation: Existing prompt learning methods struggle with novel class generalization due to overfitting and computational inefficiency.Method: Proposes MSGCoOp with parallel learnable context vectors, semantic guidance from LLMs, and diversity regularization.
Result: Achieves 1.10% average improvement in base-to-novel generalization and better cross-domain robustness.
Conclusion: MSGCoOp is an efficient and effective framework for improving few-shot generalization in VLMs.
Abstract: Vision-language pre-trained models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, and prompt learning has emerged as an efficient alternative to full fine-tuning. However, existing methods often struggle with generalization to novel classes, a phenomenon attributed to overfitting on seen classes and forgetting general knowledge. Furthermore, recent approaches that improve generalization often introduce complex architectures or heavy computational overhead. In this paper, we propose a Multiple Semantic-Guided Context Optimization (MSGCoOp) framework to enhance few-shot generalization while maintaining computational efficiency. Our approach leverages an ensemble of parallel learnable context vectors to capture diverse semantic aspects. To enrich these prompts, we introduce a semantic guidance mechanism that aligns them with comprehensive class descriptions automatically generated by a Large Language Model (LLM). Furthermore, a diversity regularization loss encourages the prompts to learn complementary and orthogonal features, preventing them from collapsing into redundant representations. Extensive experiments on 11 benchmark datasets show that MSGCoOp significantly improves performance on base-to-novel generalization, achieving an average harmonic mean improvement of 1.10% over the strong KgCoOp baseline. Our method also demonstrates enhanced robustness in cross-domain generalization tasks. Our code is avaliable at: \href{https://github.com/Rain-Bus/MSGCoOp}{https://github.com/Rain-Bus/MSGCoOp}.
[149] Distribution-Based Masked Medical Vision-Language Model Using Structured Reports
Shreyank N Gowda, Ruichi Zhang, Xiao Gu, Ying Weng, Lu Yang
Main category: cs.CV
TL;DR: An uncertainty-aware medical image-text pre-training model is introduced to improve generalization in medical image analysis by leveraging structured text reports and modeling ambiguity.
Details
Motivation: Existing models struggle with variability and ambiguity in medical data, limiting their ability to capture nuanced clinical information.Method: Utilizes structured text reports (definition, appearance, observations, verdicts) from an LLM to augment image data and models inter- and intra-modal uncertainty.
Result: Achieves state-of-the-art performance on multiple downstream tasks by improving representations and handling ambiguity.
Conclusion: The framework advances medical image-text pre-training by effectively capturing clinical uncertainty and enhancing generalization.
Abstract: Medical image-language pre-training aims to align medical images with
clinically relevant text to improve model performance on various downstream
tasks. However, existing models often struggle with the variability and
ambiguity inherent in medical data, limiting their ability to capture nuanced
clinical information and uncertainty. This work introduces an uncertainty-aware
medical image-text pre-training model that enhances generalization capabilities
in medical image analysis. Building on previous methods and focusing on Chest
X-Rays, our approach utilizes structured text reports generated by a large
language model (LLM) to augment image data with clinically relevant context.
These reports begin with a definition of the disease, followed by the
appearance' section to highlight critical regions of interest, and finally
observations’ and `verdicts’ that ground model predictions in clinical
semantics. By modeling both inter- and intra-modal uncertainty, our framework
captures the inherent ambiguity in medical images and text, yielding improved
representations and performance on downstream tasks. Our model demonstrates
significant advances in medical image-text pre-training, obtaining
state-of-the-art performance on multiple downstream tasks.
[150] HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels
HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wenhuan Li, Sheng Zhang, Yihang Lian, Yulin Tsai, Lifu Wang, Sicong Liu, Puhua Jiang, Xianghui Yang, Dongyuan Guo, Yixuan Tang, Xinyue Mao, Jiaao Yu, Junlin Yu, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Chao Zhang, Yonghao Tan, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Minghui Chen, Zhan Li, Wangchen Qin, Lei Wang, Yifu Sun, Lin Niu, Xiang Yuan, Xiaofeng Yang, Yingping He, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Tian Liu, Peng Chen, Di Wang, Yuhong Liu, Linus, Jie Jiang, Tengfei Wang, Chunchao Guo
Main category: cs.CV
TL;DR: HunyuanWorld 1.0 is a novel framework for generating immersive, explorable, and interactive 3D scenes from text and image inputs, combining the strengths of video-based and 3D-based methods.
Details
Motivation: Existing methods for 3D world generation either lack 3D consistency (video-based) or struggle with limited data and inefficient representations (3D-based). HunyuanWorld 1.0 aims to bridge these gaps.Method: The framework uses a semantically layered 3D mesh representation with panoramic world proxies for semantic-aware decomposition and reconstruction. It features 360° immersion, mesh export, and disentangled object representations.
Result: The method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds, with applications in VR, simulation, gaming, and content creation.
Conclusion: HunyuanWorld 1.0 successfully combines the advantages of existing approaches, offering a versatile and efficient solution for 3D world generation.
Abstract: Creating immersive and playable 3D worlds from texts or images remains a fundamental challenge in computer vision and graphics. Existing world generation approaches typically fall into two categories: video-based methods that offer rich diversity but lack 3D consistency and rendering efficiency, and 3D-based methods that provide geometric consistency but struggle with limited training data and memory-inefficient representations. To address these limitations, we present HunyuanWorld 1.0, a novel framework that combines the best of both worlds for generating immersive, explorable, and interactive 3D scenes from text and image conditions. Our approach features three key advantages: 1) 360{\deg} immersive experiences via panoramic world proxies; 2) mesh export capabilities for seamless compatibility with existing computer graphics pipelines; 3) disentangled object representations for augmented interactivity. The core of our framework is a semantically layered 3D mesh representation that leverages panoramic images as 360{\deg} world proxies for semantic-aware world decomposition and reconstruction, enabling the generation of diverse 3D worlds. Extensive experiments demonstrate that our method achieves state-of-the-art performance in generating coherent, explorable, and interactive 3D worlds while enabling versatile applications in virtual reality, physical simulation, game development, and interactive content creation.
[151] Anyone Can Jailbreak: Prompt-Based Attacks on LLMs and T2Is
Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda
Main category: cs.CV
TL;DR: The paper investigates how non-experts bypass safety mechanisms in LLMs and T2I systems using low-effort, high-impact jailbreak techniques, proposing a taxonomy of strategies and highlighting vulnerabilities in moderation pipelines.
Details
Motivation: Despite advancements in alignment and moderation, LLMs and T2I systems are still vulnerable to jailbreaks crafted by everyday users, necessitating a deeper understanding of these exploits.Method: The study conducts a systems-style investigation, analyzing empirical case studies of jailbreak techniques (e.g., multi-turn narrative escalation, lexical camouflage) across popular APIs.
Result: The analysis shows that all moderation pipeline stages can be bypassed using accessible strategies, revealing systemic vulnerabilities.
Conclusion: The paper underscores the urgent need for context-aware defenses to address the reproducibility of jailbreaks in real-world settings.
Abstract: Despite significant advancements in alignment and content moderation, large language models (LLMs) and text-to-image (T2I) systems remain vulnerable to prompt-based attacks known as jailbreaks. Unlike traditional adversarial examples requiring expert knowledge, many of today’s jailbreaks are low-effort, high-impact crafted by everyday users with nothing more than cleverly worded prompts. This paper presents a systems-style investigation into how non-experts reliably circumvent safety mechanisms through techniques such as multi-turn narrative escalation, lexical camouflage, implication chaining, fictional impersonation, and subtle semantic edits. We propose a unified taxonomy of prompt-level jailbreak strategies spanning both text-output and T2I models, grounded in empirical case studies across popular APIs. Our analysis reveals that every stage of the moderation pipeline, from input filtering to output validation, can be bypassed with accessible strategies. We conclude by highlighting the urgent need for context-aware defenses that reflect the ease with which these jailbreaks can be reproduced in real-world settings.
[152] Cross-Architecture Distillation Made Simple with Redundancy Suppression
Weijia Zhang, Yuehao Liu, Wu Ran, Chao Ma
Main category: cs.CV
TL;DR: A simple method for cross-architecture knowledge distillation is proposed, focusing on suppressing redundant information without complex designs.
Details
Motivation: Existing methods are inefficient due to sophisticated modules and excessive parameters, limiting their applicability.Method: Proposes a redundancy suppression distillation (RSD) loss with cross-architecture invariance maximization and feature decorrelation. Includes a lightweight module to preserve student-specific capabilities.
Result: Outperforms OFA on CIFAR-100 and ImageNet-1k with fewer parameters.
Conclusion: The method is a simple, efficient baseline for cross-architecture distillation.
Abstract: We describe a simple method for cross-architecture knowledge distillation, where the knowledge transfer is cast into a redundant information suppression formulation. Existing methods introduce sophisticated modules, architecture-tailored designs, and excessive parameters, which impair their efficiency and applicability. We propose to extract the architecture-agnostic knowledge in heterogeneous representations by reducing the redundant architecture-exclusive information. To this end, we present a simple redundancy suppression distillation (RSD) loss, which comprises cross-architecture invariance maximisation and feature decorrelation objectives. To prevent the student from entirely losing its architecture-specific capabilities, we further design a lightweight module that decouples the RSD objective from the student’s internal representations. Our method is devoid of the architecture-specific designs and complex operations in the pioneering method of OFA. It outperforms OFA on CIFAR-100 and ImageNet-1k benchmarks with only a fraction of their parameter overhead, which highlights its potential as a simple and strong baseline to the cross-architecture distillation community.
[153] Unleashing the Power of Motion and Depth: A Selective Fusion Strategy for RGB-D Video Salient Object Detection
Jiahao He, Daerji Suolang, Keren Fu, Qijun Zhao
Main category: cs.CV
TL;DR: SMFNet is a novel selective cross-modal fusion framework for RGB-D VSOD, using pixel-level selective fusion and multi-dimensional attention to enhance feature representation, outperforming 19 state-of-the-art models.
Details
Motivation: Existing RGB-D VSOD models treat optical flow and depth equally, limiting their potential. The paper aims to address this by selectively fusing these modalities based on their contributions.Method: Proposes SMFNet with a pixel-level selective fusion strategy (PSF) and a multi-dimensional selective attention module (MSAM) to integrate optical flow, depth, and RGB features effectively.
Result: SMFNet outperforms 19 state-of-the-art models on RDVS and DVisal datasets and shows efficacy on synthetic depth datasets.
Conclusion: SMFNet advances RGB-D VSOD by selectively leveraging motion and depth, validated by comprehensive benchmarks.
Abstract: Applying salient object detection (SOD) to RGB-D videos is an emerging task called RGB-D VSOD and has recently gained increasing interest, due to considerable performance gains of incorporating motion and depth and that RGB-D videos can be easily captured now in daily life. Existing RGB-D VSOD models have different attempts to derive motion cues, in which extracting motion information explicitly from optical flow appears to be a more effective and promising alternative. Despite this, there remains a key issue that how to effectively utilize optical flow and depth to assist the RGB modality in SOD. Previous methods always treat optical flow and depth equally with respect to model designs, without explicitly considering their unequal contributions in individual scenarios, limiting the potential of motion and depth. To address this issue and unleash the power of motion and depth, we propose a novel selective cross-modal fusion framework (SMFNet) for RGB-D VSOD, incorporating a pixel-level selective fusion strategy (PSF) that achieves optimal fusion of optical flow and depth based on their actual contributions. Besides, we propose a multi-dimensional selective attention module (MSAM) to integrate the fused features derived from PSF with the remaining RGB modality at multiple dimensions, effectively enhancing feature representation to generate refined features. We conduct comprehensive evaluation of SMFNet against 19 state-of-the-art models on both RDVS and DVisal datasets, making the evaluation the most comprehensive RGB-D VSOD benchmark up to date, and it also demonstrates the superiority of SMFNet over other models. Meanwhile, evaluation on five video benchmark datasets incorporating synthetic depth validates the efficacy of SMFNet as well. Our code and benchmark results are made publicly available at https://github.com/Jia-hao999/SMFNet.
[154] Low-Cost Test-Time Adaptation for Robust Video Editing
Jianhui Wang, Yinda Chen, Yangfan He, Xinyuan Song, Yi Xin, Dapeng Zhang, Zhongwei Wan, Bin Li, Rongchao Zhang
Main category: cs.CV
TL;DR: Vid-TTA is a lightweight test-time adaptation framework for video editing, addressing temporal inconsistencies and prompt overfitting through self-supervised tasks and dynamic loss balancing.
Details
Motivation: Existing video editing methods struggle with temporal inconsistencies and overfitting to simple prompts, while requiring high computational resources and annotated data.Method: Vid-TTA uses motion-aware frame reconstruction, prompt perturbation, and meta-learning for dynamic loss balancing to optimize each test video during inference.
Result: The framework improves temporal consistency, reduces prompt overfitting, and maintains low computational overhead.
Conclusion: Vid-TTA offers a plug-and-play performance boost for video editing models by addressing key challenges efficiently.
Abstract: Video editing is a critical component of content creation that transforms raw footage into coherent works aligned with specific visual and narrative objectives. Existing approaches face two major challenges: temporal inconsistencies due to failure in capturing complex motion patterns, and overfitting to simple prompts arising from limitations in UNet backbone architectures. While learning-based methods can enhance editing quality, they typically demand substantial computational resources and are constrained by the scarcity of high-quality annotated data. In this paper, we present Vid-TTA, a lightweight test-time adaptation framework that personalizes optimization for each test video during inference through self-supervised auxiliary tasks. Our approach incorporates a motion-aware frame reconstruction mechanism that identifies and preserves crucial movement regions, alongside a prompt perturbation and reconstruction strategy that strengthens model robustness to diverse textual descriptions. These innovations are orchestrated by a meta-learning driven dynamic loss balancing mechanism that adaptively adjusts the optimization process based on video characteristics. Extensive experiments demonstrate that Vid-TTA significantly improves video temporal consistency and mitigates prompt overfitting while maintaining low computational overhead, offering a plug-and-play performance boost for existing video editing models.
[155] MetaCLIP 2: A Worldwide Scaling Recipe
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu
Main category: cs.CV
TL;DR: MetaCLIP 2 improves CLIP training by addressing challenges in multilingual data curation and performance, achieving state-of-the-art results in zero-shot classification and multilingual benchmarks.
Details
Motivation: To overcome the limitations of CLIP in handling non-English data and the 'curse of multilinguality' while improving performance on both English and multilingual tasks.Method: Proposes MetaCLIP 2, a recipe for training CLIP from scratch on worldwide web-scale image-text pairs, with minimal changes to address data curation and performance issues.
Result: MetaCLIP 2 ViT-H/14 outperforms English-only CLIP by 0.8% in zero-shot ImageNet classification and sets new benchmarks in multilingual tasks like CVQA (57.4%), Babel-ImageNet (50.2%), and XM3600 (64.3%).
Conclusion: MetaCLIP 2 successfully addresses multilingual challenges, enhancing CLIP’s performance without system-level changes, and sets new standards for multilingual benchmarks.
Abstract: Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP’s training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., “curse of multilinguality” that is common in LLMs. Here, we present MetaCLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, MetaCLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.
[156] CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding
Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel
Main category: cs.CV
TL;DR: The paper proposes a dual-model framework for Embodied Reference Understanding, improving referent prediction by integrating head-to-fingertip and wrist-to-fingertip directions with a Gaussian ray heatmap and CLIP-Aware Pointing Ensemble.
Details
Motivation: Existing methods struggle with multimodal integration (text, pointing, scene context) and oversimplify pointing assumptions, leading to suboptimal performance.Method: A dual-model framework learns from head-to-fingertip and wrist-to-fingertip directions, using Gaussian ray heatmaps and a CLIP-Aware Pointing Ensemble for hybrid integration. An auxiliary object center prediction task enhances localization.
Result: The approach improves performance by ~4 mAP at 0.25 IoU on the YouRefIt dataset.
Conclusion: The dual-model framework and auxiliary tasks effectively enhance multimodal understanding and referent prediction in embodied reference tasks.
Abstract: We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.
[157] Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs
Saeed Ghorbani
Main category: cs.CV
TL;DR: Aether Weaver is a multimodal narrative co-generation framework that integrates text, visuals, and sound for immersive storytelling, outperforming sequential pipelines.
Details
Motivation: To overcome limitations of sequential text-to-visual pipelines by enabling concurrent synthesis of narratives, visuals, and soundscapes for richer storytelling.Method: Uses a Narrator (LLM) for text and prompts, a Director for scene graph management, a Narrative Arc Controller for story structure, and an Affective Tone Mapper for emotional consistency.
Result: Qualitative evaluations show enhanced narrative depth, visual fidelity, and emotional resonance compared to baselines.
Conclusion: Aether Weaver offers a robust platform for creative prototyping and immersive storytelling.
Abstract: We introduce Aether Weaver, a novel, integrated framework for multimodal narrative co-generation that overcomes limitations of sequential text-to-visual pipelines. Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes, driven by a tightly integrated, co-generation mechanism. At its core, the Narrator, a large language model, generates narrative text and multimodal prompts, while the Director acts as a dynamic scene graph manager, and analyzes the text to build and maintain a structured representation of the story’s world, ensuring spatio-temporal and relational consistency for visual rendering and subsequent narrative generation. Additionally, a Narrative Arc Controller guides the high-level story structure, influencing multimodal affective consistency, further complemented by an Affective Tone Mapper that ensures congruent emotional expression across all modalities. Through qualitative evaluations on a diverse set of narrative prompts encompassing various genres, we demonstrate that Aether Weaver significantly enhances narrative depth, visual fidelity, and emotional resonance compared to cascaded baseline approaches. This integrated framework provides a robust platform for rapid creative prototyping and immersive storytelling experiences.
[158] Evaluating Deepfake Detectors in the Wild
Viacheslav Pirogov, Maksim Artemev
Main category: cs.CV
TL;DR: The paper evaluates modern deepfake detectors using a novel real-world testing procedure and a large dataset, finding detection remains challenging with low AUC scores.
Details
Motivation: Address the gap in testing deepfake detectors on real-world data to assess their practical effectiveness.Method: Develop a comprehensive dataset of 500,000+ high-quality deepfake images and introduce a testing procedure mimicking real-world scenarios.
Result: Fewer than half of detectors achieved AUC >60%, with basic image manipulations significantly reducing performance.
Conclusion: Deepfake detection is still difficult, and current detectors struggle with real-world conditions.
Abstract: Deepfakes powered by advanced machine learning models present a significant and evolving threat to identity verification and the authenticity of digital media. Although numerous detectors have been developed to address this problem, their effectiveness has yet to be tested when applied to real-world data. In this work we evaluate modern deepfake detectors, introducing a novel testing procedure designed to mimic real-world scenarios for deepfake detection. Using state-of-the-art deepfake generation methods, we create a comprehensive dataset containing more than 500,000 high-quality deepfake images. Our analysis shows that detecting deepfakes still remains a challenging task. The evaluation shows that in fewer than half of the deepfake detectors tested achieved an AUC score greater than 60%, with the lowest being 50%. We demonstrate that basic image manipulations, such as JPEG compression or image enhancement, can significantly reduce model performance. All code and data are publicly available at https://github.com/messlav/Deepfake-Detectors-in-the-Wild.
[159] Predict Patient Self-reported Race from Skin Histological Images
Shengjia Chen, Ruchika Verma, Kevin Clare, Jannes Jegminat, Kuan-lin Huang, Brandon Veremis, Thomas Fuchs, Gabriele Campanella
Main category: cs.CV
TL;DR: The study explores AI’s unintended racial bias in computational pathology, finding deep learning models can predict race from dermatopathology slides, with attention mechanisms revealing morphological shortcuts like epidermal features.
Details
Motivation: To investigate whether AI models in computational pathology can learn unintended demographic biases, particularly related to race, and identify potential morphological shortcuts.Method: Used a racially diverse multisite dataset, applied attention-based mechanisms to uncover race-associated features, and evaluated three dataset curation strategies to control confounders.
Result: Models predicted race with high performance for White and Black groups (AUC: 0.799, 0.762), but overall performance dropped to 0.663. Attention analysis identified epidermis as a key predictive feature.
Conclusion: Careful data curation and bias mitigation are crucial for equitable AI deployment in pathology.
Abstract: Artificial Intelligence (AI) has demonstrated success in computational pathology (CPath) for disease detection, biomarker classification, and prognosis prediction. However, its potential to learn unintended demographic biases, particularly those related to social determinants of health, remains understudied. This study investigates whether deep learning models can predict self-reported race from digitized dermatopathology slides and identifies potential morphological shortcuts. Using a multisite dataset with a racially diverse population, we apply an attention-based mechanism to uncover race-associated morphological features. After evaluating three dataset curation strategies to control for confounding factors, the final experiment showed that White and Black demographic groups retained high prediction performance (AUC: 0.799, 0.762), while overall performance dropped to 0.663. Attention analysis revealed the epidermis as a key predictive feature, with significant performance declines when these regions were removed. These findings highlight the need for careful data curation and bias mitigation to ensure equitable AI deployment in pathology. Code available at: https://github.com/sinai-computational-pathology/CPath_SAIF.
[160] ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval
Nicola Fanelli, Gennaro Vessio, Giovanna Castellano
Main category: cs.CV
TL;DR: ArtSeek is a multimodal framework for art analysis using retrieval-augmented generation and multimodal LLMs, achieving state-of-the-art results without needing Wikidata/Wikipedia links.
Details
Motivation: To address the challenge of analyzing digitized artworks, which requires deep artistic and historical knowledge, without relying on external links.Method: Combines multimodal retrieval, contrastive multitask classification, and agentic reasoning with WikiFragments dataset.
Result: +8.4% F1 in style classification, +7.1 BLEU@1 in captioning, and qualitative success in interpreting obscure works.
Conclusion: ArtSeek generalizes to domains needing external knowledge, advancing scalable multimodal AI research.
Abstract: Analyzing digitized artworks presents unique challenges, requiring not only visual interpretation but also a deep understanding of rich artistic, contextual, and historical knowledge. We introduce ArtSeek, a multimodal framework for art analysis that combines multimodal large language models with retrieval-augmented generation. Unlike prior work, our pipeline relies only on image input, enabling applicability to artworks without links to Wikidata or Wikipedia-common in most digitized collections. ArtSeek integrates three key components: an intelligent multimodal retrieval module based on late interaction retrieval, a contrastive multitask classification network for predicting artist, genre, style, media, and tags, and an agentic reasoning strategy enabled through in-context examples for complex visual question answering and artwork explanation via Qwen2.5-VL. Central to this approach is WikiFragments, a Wikipedia-scale dataset of image-text fragments curated to support knowledge-grounded multimodal reasoning. Our framework achieves state-of-the-art results on multiple benchmarks, including a +8.4% F1 improvement in style classification over GraphCLIP and a +7.1 BLEU@1 gain in captioning on ArtPedia. Qualitative analyses show that ArtSeek can interpret visual motifs, infer historical context, and retrieve relevant knowledge, even for obscure works. Though focused on visual arts, our approach generalizes to other domains requiring external knowledge, supporting scalable multimodal AI research. Both the dataset and the source code will be made publicly available at https://github.com/cilabuniba/artseek.
[161] SwinECAT: A Transformer-based fundus disease classification model with Shifted Window Attention and Efficient Channel Attention
Peiran Gu, Teng Yao, Mengshen He, Fuhao Duan, Feiyan Liu, RenYuan Peng, Bao Ge
Main category: cs.CV
TL;DR: SwinECAT, a Transformer-based model combining Swin and ECA Attention, improves fundus image classification by addressing small lesions and subtle disease differences, achieving 88.29% accuracy on a 9-category dataset.
Details
Motivation: Challenges in fundus image analysis, such as small lesion areas and subtle inter-disease differences, reduce model accuracy and cause overfitting.Method: Proposes SwinECAT, integrating Swin Attention for spatial structures and ECA Attention for critical feature channels, enhancing discriminative representation.
Result: Achieves 88.29% accuracy, weighted F1-score of 0.88, and macro F1-score of 0.90, outperforming baseline models.
Conclusion: SwinECAT sets a new benchmark for 9-category fundus disease classification, demonstrating superior performance on the EDID dataset.
Abstract: In recent years, artificial intelligence has been increasingly applied in the field of medical imaging. Among these applications, fundus image analysis presents special challenges, including small lesion areas in certain fundus diseases and subtle inter-disease differences, which can lead to reduced prediction accuracy and overfitting in the models. To address these challenges, this paper proposes the Transformer-based model SwinECAT, which combines the Shifted Window (Swin) Attention with the Efficient Channel Attention (ECA) Attention. SwinECAT leverages the Swin Attention mechanism in the Swin Transformer backbone to effectively capture local spatial structures and long-range dependencies within fundus images. The lightweight ECA mechanism is incorporated to guide the SwinECAT’s attention toward critical feature channels, enabling more discriminative feature representation. In contrast to previous studies that typically classify fundus images into 4 to 6 categories, this work expands fundus disease classification to 9 distinct types, thereby enhancing the granularity of diagnosis. We evaluate our method on the Eye Disease Image Dataset (EDID) containing 16,140 fundus images for 9-category classification. Experimental results demonstrate that SwinECAT achieves 88.29% accuracy, with weighted F1-score of 0.88 and macro F1-score of 0.90. The classification results of our proposed model SwinECAT significantly outperform the baseline Swin Transformer and multiple compared baseline models. To our knowledge, this represents the highest reported performance for 9-category classification on this public dataset.
[162] MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning
Tianhong Gao, Yannian Fu, Weiqun Wu, Haixiao Yue, Shanshan Liu, Gang Zhang
Main category: cs.CV
TL;DR: MMAT-1M is a million-scale multimodal agent tuning dataset designed to enhance Chain-of-Thought (CoT), reflection, and dynamic tool usage in multimodal LLMs, achieving significant performance improvements.
Details
Motivation: The lack of a large-scale, high-quality agent tuning dataset in the multimodal domain limits the potential of multimodal LLMs.Method: A four-stage data engine: curating multimodal datasets, generating rationales with GPT-4o, refining through reflection, and optionally compressing dialogues into a one-turn format.
Result: Fine-tuning on MMAT-1M improves performance, e.g., InternVL2.5-8B-RR gains 2.7% on eight benchmarks and 8.8% on Dyn-VQA.
Conclusion: MMAT-1M effectively enhances multimodal reasoning and tool-based capabilities, with the dataset publicly available.
Abstract: Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection (ORR) format. By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset’s effectiveness in enhancing multimodal reasoning and tool-based capabilities. The dataset is publicly available at https://github.com/VIS-MPU-Agent/MMAT-1M.
[163] Attention-Driven Multimodal Alignment for Long-term Action Quality Assessment
Xin Wang, Peng-Jie Li, Yuan-Yuan Shen
Main category: cs.CV
TL;DR: LMAC-Net improves long-term action quality assessment by aligning multimodal features (visual and audio) for better performance evaluation in artistic sports.
Details
Motivation: Existing methods fail to capture complex multimodal interactions and temporal dynamics in long-term AQA, especially in artistic sports like rhythmic gymnastics and figure skating.Method: LMAC-Net uses a multimodal attention consistency mechanism, a local query encoder for temporal semantics, and a two-level score evaluation. It optimizes with attention-based and regression-based losses.
Result: LMAC-Net outperforms existing methods on RG and Fis-V datasets, showing better multimodal alignment and feature representation.
Conclusion: LMAC-Net effectively addresses the limitations of current AQA methods by enhancing multimodal collaboration and temporal tracking, proving its superiority in performance assessment.
Abstract: Long-term action quality assessment (AQA) focuses on evaluating the quality of human activities in videos lasting up to several minutes. This task plays an important role in the automated evaluation of artistic sports such as rhythmic gymnastics and figure skating, where both accurate motion execution and temporal synchronization with background music are essential for performance assessment. However, existing methods predominantly fall into two categories: unimodal approaches that rely solely on visual features, which are inadequate for modeling multimodal cues like music; and multimodal approaches that typically employ simple feature-level contrastive fusion, overlooking deep cross-modal collaboration and temporal dynamics. As a result, they struggle to capture complex interactions between modalities and fail to accurately track critical performance changes throughout extended sequences. To address these challenges, we propose the Long-term Multimodal Attention Consistency Network (LMAC-Net). LMAC-Net introduces a multimodal attention consistency mechanism to explicitly align multimodal features, enabling stable integration of visual and audio information and enhancing feature representations. Specifically, we introduce a multimodal local query encoder module to capture temporal semantics and cross-modal relations, and use a two-level score evaluation for interpretable results. In addition, attention-based and regression-based losses are applied to jointly optimize multimodal alignment and score fusion. Experiments conducted on the RG and Fis-V datasets demonstrate that LMAC-Net significantly outperforms existing methods, validating the effectiveness of our proposed approach.
[164] Enhancing Generalization in Data-free Quantization via Mixup-class Prompting
Jiwoong Park, Chaeun Lee, Yongseok Choi, Sein Park, Deokki Hong, Jungwook Choi
Main category: cs.CV
TL;DR: The paper introduces a mixup-class prompt strategy for data-free quantization (DFQ) to improve synthetic data diversity and robustness, enhancing post-training quantization (PTQ) performance, especially in low-bit scenarios.
Details
Motivation: PTQ struggles with limited calibration data under privacy constraints, and existing DFQ methods using single-class prompts suffer from polysemy and performance degradation.Method: Proposes a mixup-based text prompting strategy (mixup-class prompt) to fuse multiple class labels, generating diverse synthetic data for PTQ.
Result: Outperforms state-of-the-art DFQ methods like GenQ, achieving new accuracy in 2-bit weight, 4-bit activation (W2A4) quantization.
Conclusion: The mixup-class prompt enhances generalization and optimization stability in PTQ, setting a new benchmark for DFQ performance.
Abstract: Post-training quantization (PTQ) improves efficiency but struggles with limited calibration data, especially under privacy constraints. Data-free quantization (DFQ) mitigates this by generating synthetic images using generative models such as generative adversarial networks (GANs) and text-conditioned latent diffusion models (LDMs), while applying existing PTQ algorithms. However, the relationship between generated synthetic images and the generalizability of the quantized model during PTQ remains underexplored. Without investigating this relationship, synthetic images generated by previous prompt engineering methods based on single-class prompts suffer from issues such as polysemy, leading to performance degradation. We propose \textbf{mixup-class prompt}, a mixup-based text prompting strategy that fuses multiple class labels at the text prompt level to generate diverse, robust synthetic data. This approach enhances generalization, and improves optimization stability in PTQ. We provide quantitative insights through gradient norm and generalization error analysis. Experiments on convolutional neural networks (CNNs) and vision transformers (ViTs) show that our method consistently outperforms state-of-the-art DFQ methods like GenQ. Furthermore, it pushes the performance boundary in extremely low-bit scenarios, achieving new state-of-the-art accuracy in challenging 2-bit weight, 4-bit activation (W2A4) quantization.
[165] Contrast-Prior Enhanced Duality for Mask-Free Shadow Removal
Jiyu Wu, Yifan Liu, Jiancheng Huang, Mingfu Yan, Shifeng Chen
Main category: cs.CV
TL;DR: Proposes AGBA and FCFN for shadow removal without masks, achieving state-of-the-art results.
Details
Motivation: Existing methods rely on shadow masks, which are hard to obtain, and intrinsic cues like contrast are ambiguous in complex scenes.Method: Uses AGBA to filter contrast cues and FCFN for restoring details via diffusion and frequency-contrast fusion.
Result: Achieves top performance among mask-free methods and competes with mask-based ones.
Conclusion: AGBA and FCFN effectively address shadow removal challenges without masks.
Abstract: Existing shadow removal methods often rely on shadow masks, which are challenging to acquire in real-world scenarios. Exploring intrinsic image cues, such as local contrast information, presents a potential alternative for guiding shadow removal in the absence of explicit masks. However, the cue’s inherent ambiguity becomes a critical limitation in complex scenes, where it can fail to distinguish true shadows from low-reflectance objects and intricate background textures. To address this motivation, we propose the Adaptive Gated Dual-Branch Attention (AGBA) mechanism. AGBA dynamically filters and re-weighs the contrast prior to effectively disentangle shadow features from confounding visual elements. Furthermore, to tackle the persistent challenge of restoring soft shadow boundaries and fine-grained details, we introduce a diffusion-based Frequency-Contrast Fusion Network (FCFN) that leverages high-frequency and contrast cues to guide the generative process. Extensive experiments demonstrate that our method achieves state-of-the-art results among mask-free approaches while maintaining competitive performance relative to mask-based methods.
[166] Mitigating Spurious Correlations in Weakly Supervised Semantic Segmentation via Cross-architecture Consistency Regularization
Zheyuan Zhang, Yen-chia Hsu
Main category: cs.CV
TL;DR: A novel weakly supervised semantic segmentation (WSSS) framework addresses co-occurrence bias in industrial smoke detection by using a teacher-student CNN-ViT model with knowledge transfer loss and post-processing.
Details
Motivation: Pixel-level labels are scarce, especially in domains like industrial smoke, where expert annotations are needed. Existing WSSS methods suffer from incomplete coverage and biased co-occurrence correlations.Method: Proposes a teacher-student framework combining CNNs and ViTs, using knowledge transfer loss for cross-architecture consistency and post-processing to refine pseudo masks.
Result: The framework directly targets co-occurrence bias without external supervision, improving pseudo mask quality.
Conclusion: The approach offers a scalable solution to WSSS limitations in industrial smoke detection, enhancing segmentation accuracy.
Abstract: Scarcity of pixel-level labels is a significant challenge in practical scenarios. In specific domains like industrial smoke, acquiring such detailed annotations is particularly difficult and often requires expert knowledge. To alleviate this, weakly supervised semantic segmentation (WSSS) has emerged as a promising approach. However, due to the supervision gap and inherent bias in models trained with only image level labels, existing WSSS methods suffer from limitations such as incomplete foreground coverage, inaccurate object boundaries, and spurious correlations, especially in our domain, where emissions are always spatially coupled with chimneys. Previous solutions typically rely on additional priors or external knowledge to mitigate these issues, but they often lack scalability and fail to address the model’s inherent bias toward co-occurring context. To address this, we propose a novel WSSS framework that directly targets the co-occurrence problem without relying on external supervision. Unlike prior methods that adopt a single network, we employ a teacher-student framework that combines CNNs and ViTs. We introduce a knowledge transfer loss that enforces cross-architecture consistency by aligning internal representations. Additionally, we incorporate post-processing techniques to address partial coverage and further improve pseudo mask quality.
[167] PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction
Jiahui Ren, Mochu Xiang, Jiajun Zhu, Yuchao Dai
Main category: cs.CV
TL;DR: PanoSplatt3R is an unposed wide-baseline panorama reconstruction method that outperforms state-of-the-art methods without requiring precise pose information, excelling in novel view generation and depth estimation.
Details
Motivation: Existing panorama reconstruction methods rely heavily on accurate pose information, which is resource-intensive and noise-prone, limiting their practicality.Method: PanoSplatt3R adapts reconstruction pretrainings from the perspective to the panoramic domain and introduces RoPE rolling for efficient domain transfer, modeling panorama periodicity.
Result: PanoSplatt3R significantly outperforms current methods in novel view generation and depth estimation, even without pose information.
Conclusion: PanoSplatt3R demonstrates strong generalization and practical potential, making it a viable solution for real-world applications.
Abstract: Wide-baseline panorama reconstruction has emerged as a highly effective and pivotal approach for not only achieving geometric reconstruction of the surrounding 3D environment, but also generating highly realistic and immersive novel views. Although existing methods have shown remarkable performance across various benchmarks, they are predominantly reliant on accurate pose information. In real-world scenarios, the acquisition of precise pose often requires additional computational resources and is highly susceptible to noise. These limitations hinder the broad applicability and practicality of such methods. In this paper, we present PanoSplatt3R, an unposed wide-baseline panorama reconstruction method. We extend and adapt the foundational reconstruction pretrainings from the perspective domain to the panoramic domain, thus enabling powerful generalization capabilities. To ensure a seamless and efficient domain-transfer process, we introduce RoPE rolling that spans rolled coordinates in rotary positional embeddings across different attention heads, maintaining a minimal modification to RoPE’s mechanism, while modeling the horizontal periodicity of panorama images. Comprehensive experiments demonstrate that PanoSplatt3R, even in the absence of pose information, significantly outperforms current state-of-the-art methods. This superiority is evident in both the generation of high-quality novel views and the accuracy of depth estimation, thereby showcasing its great potential for practical applications. Project page: https://npucvr.github.io/PanoSplatt3R
[168] A Deep Learning Pipeline Using Synthetic Data to Improve Interpretation of Paper ECG Images
Xiaoyu Wang, Ramesh Nadarajah, Zhiqiang Zhang, David Wong
Main category: cs.CV
TL;DR: A deep learning framework for classifying paper-like ECG images into five diagnostic categories, addressing visual noise and fine-detailed pattern detection, achieving high AUROC scores.
Details
Motivation: Early detection of CVDs is crucial, and while ECGs are key, manual interpretation is time-consuming. Most ECG data are stored as images, requiring automated solutions.Method: Proposes a pre-processing pipeline for noise reduction and a two-stage fine-tuning strategy using ConvNeXt architecture, trained on synthetic and external datasets.
Result: Achieved AUROC scores of 0.9688 (validation) and 0.9677 (test), winning the 2024 British Heart Foundation Open Data Science Challenge.
Conclusion: The framework shows promise as a practical tool for automated ECG interpretation in clinical workflows.
Abstract: Cardiovascular diseases (CVDs) are the leading global cause of death, and early detection is essential to improve patient outcomes. Electrocardiograms (ECGs), especially 12-lead ECGs, play a key role in the identification of CVDs. These are routinely interpreted by human experts, a process that is time-consuming and requires expert knowledge. Historical research in this area has focused on automatic ECG interpretation from digital signals, with recent deep learning approaches achieving strong results. In practice, however, most ECG data in clinical practice are stored or shared in image form. To bridge this gap, we propose a deep learning framework designed specifically to classify paper-like ECG images into five main diagnostic categories. Our method was the winning entry to the 2024 British Heart Foundation Open Data Science Challenge. It addresses two main challenges of paper ECG classification: visual noise (e.g., shadows or creases) and the need to detect fine-detailed waveform patterns. We propose a pre-processing pipeline that reduces visual noise and a two-stage fine-tuning strategy: the model is first fine-tuned on synthetic and external ECG image datasets to learn domain-specific features, and then further fine-tuned on the target dataset to enhance disease-specific recognition. We adopt the ConvNeXt architecture as the backbone of our model. Our method achieved AUROC scores of 0.9688 on the public validation set and 0.9677 on the private test set of the British Heart Foundation Open Data Science Challenge, highlighting its potential as a practical tool for automated ECG interpretation in clinical workflows.
[169] EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation
Zhijiang Li, Haoran He
Main category: cs.CV
TL;DR: EIFNet is a multi-modal fusion network for event-based semantic segmentation, addressing challenges of sparse event data and fusion with image data. It achieves state-of-the-art performance.
Details
Motivation: Event cameras offer high dynamic range and fine temporal resolution but face challenges in feature extraction and fusion with image data.Method: EIFNet includes Adaptive Event Feature Refinement Module (AEFRM), Modality-Adaptive Recalibration Module (MARM), and Multi-Head Attention Gated Fusion Module (MGFM) for feature alignment and fusion.
Result: EIFNet achieves state-of-the-art performance on DDD17-Semantic and DSEC-Semantic datasets.
Conclusion: EIFNet effectively addresses event-based semantic segmentation challenges, demonstrating robust performance.
Abstract: Event-based semantic segmentation explores the potential of event cameras, which offer high dynamic range and fine temporal resolution, to achieve robust scene understanding in challenging environments. Despite these advantages, the task remains difficult due to two main challenges: extracting reliable features from sparse and noisy event streams, and effectively fusing them with dense, semantically rich image data that differ in structure and representation. To address these issues, we propose EIFNet, a multi-modal fusion network that combines the strengths of both event and frame-based inputs. The network includes an Adaptive Event Feature Refinement Module (AEFRM), which improves event representations through multi-scale activity modeling and spatial attention. In addition, we introduce a Modality-Adaptive Recalibration Module (MARM) and a Multi-Head Attention Gated Fusion Module (MGFM), which align and integrate features across modalities using attention mechanisms and gated fusion strategies. Experiments on DDD17-Semantic and DSEC-Semantic datasets show that EIFNet achieves state-of-the-art performance, demonstrating its effectiveness in event-based semantic segmentation.
[170] Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition
Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang
Main category: cs.CV
TL;DR: The paper introduces a Motion-guided Modulation Network (MMN) to improve Micro-Action Recognition by capturing subtle motion cues, achieving state-of-the-art results.
Details
Motivation: Existing methods overlook subtle changes in Micro-Actions (MAs), limiting recognition accuracy.Method: MMN includes Motion-guided Skeletal Modulation (MSM) for skeletal-level motion cues and Motion-guided Temporal Modulation (MTM) for frame-level motion patterns, with a motion consistency learning strategy.
Result: MMN outperforms on Micro-Action 52 and iMiGUE datasets, demonstrating superior skeleton-based micro-action recognition.
Conclusion: Explicitly modeling subtle motion cues is crucial for accurate micro-action recognition, as shown by MMN’s success.
Abstract: Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.
[171] ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models
Hyun Jun Yook, Ga San Jhun, Jae Hyun Cho, Min Jeon, Donghyun Kim, Tae Hyung Kim, Youn Kyu Lee
Main category: cs.CV
TL;DR: ZIUM is a zero-shot intent-aware adversarial attack method for unlearned models, improving attack customization and reducing computational costs.
Details
Motivation: Existing adversarial attacks on unlearned models struggle with intent alignment and high computational costs.Method: Proposes ZIUM, a zero-shot intent-aware adversarial attack method, enabling flexible customization of target attack images without further optimization.
Result: ZIUM achieves higher attack success rates and significantly reduces attack time for previously attacked concepts.
Conclusion: ZIUM effectively addresses challenges in adversarial attacks on unlearned models, enhancing intent alignment and efficiency.
Abstract: Machine unlearning (MU) removes specific data points or concepts from deep learning models to enhance privacy and prevent sensitive content generation. Adversarial prompts can exploit unlearned models to generate content containing removed concepts, posing a significant security risk. However, existing adversarial attack methods still face challenges in generating content that aligns with an attacker’s intent while incurring high computational costs to identify successful prompts. To address these challenges, we propose ZIUM, a Zero-shot Intent-aware adversarial attack on Unlearned Models, which enables the flexible customization of target attack images to reflect an attacker’s intent. Additionally, ZIUM supports zero-shot adversarial attacks without requiring further optimization for previously attacked unlearned concepts. The evaluation across various MU scenarios demonstrated ZIUM’s effectiveness in successfully customizing content based on user-intent prompts while achieving a superior attack success rate compared to existing methods. Moreover, its zero-shot adversarial attack significantly reduces the attack time for previously attacked unlearned concepts.
[172] Staining and locking computer vision models without retraining
Oliver J. Sutton, Qinghua Zhou, George Leete, Alexander N. Gorban, Ivan Y. Tyukin
Main category: cs.CV
TL;DR: New methods for staining (watermarking) and locking computer vision models to protect intellectual property, without retraining, with provable guarantees and minimal performance impact.
Details
Motivation: To protect the intellectual property of computer vision model owners by embedding identifiable secret behaviors (staining) and restricting model usage without a secret trigger (locking).Method: Directly modify a small number of model weights to embed stains or locks, requiring no fine-tuning or retraining. Unlocking involves inserting a trigger patch into input images.
Result: Effective staining and locking with minimal performance impact on unlocked models, supported by experimental validation.
Conclusion: The proposed methods offer practical, provable solutions for protecting model ownership and usage control in computer vision.
Abstract: We introduce new methods of staining and locking computer vision models, to protect their owners’ intellectual property. Staining, also known as watermarking, embeds secret behaviour into a model which can later be used to identify it, while locking aims to make a model unusable unless a secret trigger is inserted into input images. Unlike existing methods, our algorithms can be used to stain and lock pre-trained models without requiring fine-tuning or retraining, and come with provable, computable guarantees bounding their worst-case false positive rates. The stain and lock are implemented by directly modifying a small number of the model’s weights and have minimal impact on the (unlocked) model’s performance. Locked models are unlocked by inserting a small `trigger patch’ into the corner of the input image. We present experimental results showing the efficacy of our methods and demonstrating their practical performance on a variety of computer vision models.
[173] Bridging Synthetic and Real-World Domains: A Human-in-the-Loop Weakly-Supervised Framework for Industrial Toxic Emission Segmentation
Yida Tao, Yen-Chia Hsu
Main category: cs.CV
TL;DR: CEDANet integrates citizen-provided weak labels with adversarial feature alignment for industrial smoke segmentation, achieving significant performance gains without costly annotations.
Details
Motivation: High cost and scarcity of pixel-level annotations in industrial smoke segmentation hinder air-quality monitoring.Method: CEDANet uses citizen votes to refine pseudo-labels and employs class-specific domain discriminators for feature alignment.
Result: Achieves F1-score of 0.414 and smoke-class IoU of 0.261, outperforming the baseline by five- and six-fold, respectively.
Conclusion: Combining citizen science with weakly supervised domain adaptation offers a scalable, cost-efficient solution for environmental monitoring.
Abstract: Industrial smoke segmentation is critical for air-quality monitoring and environmental protection but is often hampered by the high cost and scarcity of pixel-level annotations in real-world settings. We introduce CEDANet, a human-in-the-loop, class-aware domain adaptation framework that uniquely integrates weak, citizen-provided video-level labels with adversarial feature alignment. Specifically, we refine pseudo-labels generated by a source-trained segmentation model using citizen votes, and employ class-specific domain discriminators to transfer rich source-domain representations to the industrial domain. Comprehensive experiments on SMOKE5K and custom IJmond datasets demonstrate that CEDANet achieves an F1-score of 0.414 and a smoke-class IoU of 0.261 with citizen feedback, vastly outperforming the baseline model, which scored 0.083 and 0.043 respectively. This represents a five-fold increase in F1-score and a six-fold increase in smoke-class IoU. Notably, CEDANet with citizen-constrained pseudo-labels achieves performance comparable to the same architecture trained on limited 100 fully annotated images with F1-score of 0.418 and IoU of 0.264, demonstrating its ability to reach small-sampled fully supervised-level accuracy without target-domain annotations. Our research validates the scalability and cost-efficiency of combining citizen science with weakly supervised domain adaptation, offering a practical solution for complex, data-scarce environmental monitoring applications.
[174] See Different, Think Better: Visual Variations Mitigating Hallucinations in LVLMs
Ziyun Dai, Xiaoqiang Li, Shaohua Zhang, Yuanchen Wu, Jide Li
Main category: cs.CV
TL;DR: ViHallu is a vision-centric framework to reduce hallucinations in LVLMs by improving visual-semantic alignment through visual variation images and instructions.
Details
Motivation: Existing hallucination mitigation methods are text-centric and ineffective for fine-grained visual understanding.Method: Uses visual variation images and constructed visual instructions to enhance alignment.
Result: Improves fine-grained visual understanding and reduces hallucinations in benchmarks.
Conclusion: ViHallu effectively mitigates hallucinations and enhances visual-semantic alignment, with a released dataset for further research.
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in visual understanding and multimodal reasoning. However, LVLMs frequently exhibit hallucination phenomena, manifesting as the generated textual responses that demonstrate inconsistencies with the provided visual content. Existing hallucination mitigation methods are predominantly text-centric, the challenges of visual-semantic alignment significantly limit their effectiveness, especially when confronted with fine-grained visual understanding scenarios. To this end, this paper presents ViHallu, a Vision-Centric Hallucination mitigation framework that enhances visual-semantic alignment through Visual Variation Image Generation and Visual Instruction Construction. ViHallu introduces \textbf{\textit{visual variation images}} with controllable visual alterations while maintaining the overall image structure. These images, combined with carefully constructed visual instructions, enable LVLMs to better understand fine-grained visual content through fine-tuning, allowing models to more precisely capture the correspondence between visual content and text, thereby enhancing visual-semantic alignment. Extensive experiments on multiple benchmarks show that ViHallu effectively enhances models’ fine-grained visual understanding while significantly reducing hallucination tendencies. Furthermore, we release ViHallu-Instruction, a visual instruction dataset specifically designed for hallucination mitigation and visual-semantic alignment. Code is available at https://github.com/oliviadzy/ViHallu.
[175] VeS: Teaching Pixels to Listen Without Supervision
Sajay Raj
Main category: cs.CV
TL;DR: Dense audio-visual models perform well in low-resource multilingual settings, with dense token routing outperforming global pooling by 59% in retrieval accuracy.
Details
Motivation: To evaluate if dense AV models work in low-resource, multilingual, and noisy settings, unlike English-centric, caption-rich scenarios.Method: Compared three contrastive objectives: global mean-pooled loss, dense max-mean token matcher, and a hybrid, using a multilingual Indian dataset.
Result: Dense objective improved R@1 by 59% over global pooling, with better retrieval ranks and sharp localization heatmaps.
Conclusion: Dense token routing is crucial in low-resource settings, outperforming global methods without fine-tuning vision backbones.
Abstract: Recent dense audio-visual (AV) models achieve impressive retrieval and emergent localization, but almost all evidence comes from English-centric, caption-rich web video. It is unclear whether these objectives survive in low-resource, code-switched, and noisy multilingual settings that typify developing regions. We show they do**-**and that the choice of aggregation function becomes even more critical. Using a multilingual subset of Project Vaani spanning dozens of Indian languages and dialectal variants, we compare three contrastive objectives: (i) a global mean-pooled loss (CLIP-style), (ii) a dense max-mean token matcher (DenseAV-style), and (iii) a simple hybrid (motivated by frozen-vision alignment strategies). The dense objective delivers a +59% relative R@1 (Audio Visual) improvement over global pooling and substantially lower mean/median ranks, while consistently producing sharp zero-shot localization heatmaps of spoken objects-despite keeping the vision backbone entirely frozen (no LoRA / partial fine-tuning). Our results demonstrate that dense token routing is not a luxury of high-resource English corpora; it is more decisive when annotations and acoustic cleanliness are scarce. We release the codebase and trained models.
[176] XAI for Point Cloud Data using Perturbations based on Meaningful Segmentation
Raju Ningappa Mulawade, Christoph Garth, Alexander Wiebel
Main category: cs.CV
TL;DR: A novel segmentation-based XAI method for point cloud classification is proposed, featuring a point-shifting mechanism for perturbations, aiming to generate human-interpretable explanations.
Details
Motivation: Understanding AI decision-making in critical applications is crucial, especially for point cloud data classification, necessitating explainable and interpretable methods.Method: Uses point cloud segmentation models and a novel point-shifting mechanism to introduce perturbations, generating saliency maps for explanations.
Result: Produces more meaningful saliency maps compared to classical clustering methods, enhancing human interpretability.
Conclusion: The method effectively generates interpretable explanations for point cloud classification, outperforming traditional approaches.
Abstract: We propose a novel segmentation-based explainable artificial intelligence (XAI) method for neural networks working on point cloud classification. As one building block of this method, we propose a novel point-shifting mechanism to introduce perturbations in point cloud data. Recently, AI has seen an exponential growth. Hence, it is important to understand the decision-making process of AI algorithms when they are applied in critical areas. Our work focuses on explaining AI algorithms that classify point cloud data. An important aspect of the methods used for explaining AI algorithms is their ability to produce explanations that are easy for humans to understand. This allows them to analyze the AI algorithms better and make appropriate decisions based on that analysis. Therefore, in this work, we intend to generate meaningful explanations that can be easily interpreted by humans. The point cloud data we consider represents 3D objects such as cars, guitars, and laptops. We make use of point cloud segmentation models to generate explanations for the working of classification models. The segments are used to introduce perturbations into the input point cloud data and generate saliency maps. The perturbations are introduced using the novel point-shifting mechanism proposed in this work which ensures that the shifted points no longer influence the output of the classification algorithm. In contrast to previous methods, the segments used by our method are meaningful, i.e. humans can easily interpret the meaning of the segments. Thus, the benefit of our method over other methods is its ability to produce more meaningful saliency maps. We compare our method with the use of classical clustering algorithms to generate explanations. We also analyze the saliency maps generated for example inputs using our method to demonstrate the usefulness of the method in generating meaningful explanations.
[177] From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning
Honglin He, Yukai Ma, Wayne Wu, Bolei Zhou
Main category: cs.CV
TL;DR: The paper introduces the Seeing-to-Experiencing (S2E) framework to enhance navigation foundation models by combining offline pre-training with reinforcement learning (RL) for improved interactivity and safety in real-world urban navigation.
Details
Motivation: Existing navigation foundation models, trained on offline data, lack reasoning about action consequences and adaptability, limiting their real-world applicability in interactive and safe navigation.Method: S2E integrates pre-training on videos with RL post-training, introducing Anchor-Guided Distribution Matching for stable learning and a Residual-Attention Module to retain pretrained knowledge while adapting to simulations. NavBench-GS, a new benchmark, evaluates model generalizability and safety.
Result: S2E mitigates diminishing returns from offline data scaling, showing RL’s superiority over supervised fine-tuning for post-training in robot learning.
Conclusion: Integrating interactive online experiences is crucial for scaling foundation models in robotics, as demonstrated by S2E’s success in enhancing navigation capabilities.
Abstract: Navigation foundation models trained on massive webscale data enable agents to generalize across diverse environments and embodiments. However, these models trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on videos and post-training through RL. It maintains the generalizability acquired from large-scale real-world videos while enhancing its interactivity through RL in simulation environments. Specifically, we introduce two innovations: an Anchor-Guided Distribution Matching strategy, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and a Residual-Attention Module, which obtains reactive behaviors from simulation environments without erasing the model’s pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3DGS reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models. Extensive experiments show that S2E mitigates the diminishing returns often seen when scaling with offline data alone. We perform a thorough analysis of the benefits of Reinforcement Learning compared to Supervised Fine-Tuning in the context of post-training for robot learning. Our findings emphasize the crucial role of integrating interactive online experiences to effectively scale foundation models in Robotics.
[178] Shallow Deep Learning Can Still Excel in Fine-Grained Few-Shot Learning
Chaofei Qi, Chao Ye, Zhitai Liu, Weiyang Lin, Jianbin Qiu
Main category: cs.CV
TL;DR: The paper explores the potential of shallow deep backbones like ConvNet-4 in fine-grained few-shot learning (FGFSL), proposing a location-aware constellation network (LCN-4) with a novel feature clustering module to outperform deeper backbones.
Details
Motivation: Shallow backbones like ConvNet-4 are underutilized in FGFSL due to their tendency to extract non-abstract visual attributes. The paper aims to re-evaluate their potential and propose improvements.Method: Introduces LCN-4 with a location-aware feature clustering module, grid position encoding compensation, and frequency domain location embedding to address positional information loss.
Result: LCN-4 outperforms ConvNet-4-based methods and matches or surpasses ResNet12-based methods on three FGFSL benchmarks.
Conclusion: Shallow architectures like LCN-4 can achieve competitive or superior performance to deeper backbones in FGFSL, validating the proposed enhancements.
Abstract: Deep learning has witnessed the extensive utilization across a wide spectrum of domains, including fine-grained few-shot learning (FGFSL) which heavily depends on deep backbones. Nonetheless, shallower deep backbones such as ConvNet-4, are not commonly preferred because they’re prone to extract a larger quantity of non-abstract visual attributes. In this paper, we initially re-evaluate the relationship between network depth and the ability to fully encode few-shot instances, and delve into whether shallow deep architecture could effectuate comparable or superior performance to mainstream deep backbone. Fueled by the inspiration from vanilla ConvNet-4, we introduce a location-aware constellation network (LCN-4), equipped with a cutting-edge location-aware feature clustering module. This module can proficiently encoder and integrate spatial feature fusion, feature clustering, and recessive feature location, thereby significantly minimizing the overall loss. Specifically, we innovatively put forward a general grid position encoding compensation to effectively address the issue of positional information missing during the feature extraction process of specific ordinary convolutions. Additionally, we further propose a general frequency domain location embedding technique to offset for the location loss in clustering features. We have carried out validation procedures on three representative fine-grained few-shot benchmarks. Relevant experiments have established that LCN-4 notably outperforms the ConvNet-4 based State-of-the-Arts and achieves performance that is on par with or superior to most ResNet12-based methods, confirming the correctness of our conjecture.
[179] Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi
Main category: cs.CV
TL;DR: Ov3R is a framework for open-vocabulary 3D reconstruction from RGB videos, integrating CLIP semantics for consistent geometry and fine-grained semantic alignment.
Details
Motivation: To advance Spatial AI by enabling real-time, semantics-aware 3D reconstruction with open-vocabulary capabilities.Method: Uses CLIP3R for CLIP-informed 3D reconstruction and 2D-3D OVS for lifting 2D features into 3D with fused descriptors.
Result: Achieves state-of-the-art performance in dense 3D reconstruction and open-vocabulary 3D segmentation.
Conclusion: Ov3R marks progress toward real-time, semantics-aware Spatial AI.
Abstract: We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
[180] MetaLab: Few-Shot Game Changer for Image Recognition
Chaofei Qi, Zhitai Liu, Jianbin Qiu
Main category: cs.CV
TL;DR: Proposes MetaLab, a method for few-shot image recognition using CIELab color space and collaborative neural networks, achieving near-human accuracy.
Details
Motivation: Addressing the technical gaps in few-shot image recognition compared to large-scale methods.Method: Uses two networks: LabNet for domain transformation and feature extraction, and LabGNN for mutual learning between lightness and color graphs.
Result: Achieves ~99% accuracy on benchmarks, showing robust performance and generalization with minimal samples.
Conclusion: MetaLab effectively bridges the gap in few-shot recognition, nearing human-level accuracy.
Abstract: Difficult few-shot image recognition has significant application prospects, yet remaining the substantial technical gaps with the conventional large-scale image recognition. In this paper, we have proposed an efficient original method for few-shot image recognition, called CIELab-Guided Coherent Meta-Learning (MetaLab). Structurally, our MetaLab comprises two collaborative neural networks: LabNet, which can perform domain transformation for the CIELab color space and extract rich grouped features, and coherent LabGNN, which can facilitate mutual learning between lightness graph and color graph. For sufficient certification, we have implemented extensive comparative studies on four coarse-grained benchmarks, four fine-grained benchmarks, and four cross-domain few-shot benchmarks. Specifically, our method can achieve high accuracy, robust performance, and effective generalization capability with one-shot sample per class. Overall, all experiments have demonstrated that our MetaLab can approach 99% $\uparrow\downarrow$ accuracy, reaching the human recognition ceiling with little visual deviation.
[181] X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang
Main category: cs.CV
TL;DR: The paper introduces X-Omni, a framework using reinforcement learning to improve autoregressive image generation, achieving high-quality results and strong instruction-following capabilities.
Details
Motivation: Addressing the limitations of autoregressive modeling for image generation, such as low fidelity and distorted outputs, by leveraging reinforcement learning to enhance quality and unify image and language generation.Method: Proposes X-Omni, combining a semantic image tokenizer, a unified autoregressive model for language and images, and an offline diffusion decoder for image generation.
Result: X-Omni achieves state-of-the-art performance in image generation with a 7B language model, producing high-quality images and excelling in instruction-following.
Conclusion: Reinforcement learning effectively mitigates autoregressive modeling issues, enabling seamless integration of image and language generation with superior results.
Abstract: Numerous efforts have been made to extend the ``next token prediction’’ paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
[182] StepAL: Step-aware Active Learning for Cataract Surgical Videos
Nisarg A. Shah, Bardia Safaei, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel
Main category: cs.CV
TL;DR: StepAL is an active learning framework for surgical step recognition, outperforming traditional AL methods by selecting full videos for annotation, reducing costs while maintaining accuracy.
Details
Motivation: Traditional AL methods are ineffective for surgical videos due to inter-step dependencies and lack of context in frame/clip selection.Method: StepAL uses step-aware feature representation and entropy-weighted clustering to prioritize uncertain and diverse videos for annotation.
Result: StepAL outperforms existing AL methods on cataract surgery datasets, achieving higher accuracy with fewer labeled videos.
Conclusion: StepAL reduces annotation burden in surgical video analysis, aiding the development of computer-assisted surgical systems.
Abstract: Active learning (AL) can reduce annotation costs in surgical video analysis while maintaining model performance. However, traditional AL methods, developed for images or short video clips, are suboptimal for surgical step recognition due to inter-step dependencies within long, untrimmed surgical videos. These methods typically select individual frames or clips for labeling, which is ineffective for surgical videos where annotators require the context of the entire video for annotation. To address this, we propose StepAL, an active learning framework designed for full video selection in surgical step recognition. StepAL integrates a step-aware feature representation, which leverages pseudo-labels to capture the distribution of predicted steps within each video, with an entropy-weighted clustering strategy. This combination prioritizes videos that are both uncertain and exhibit diverse step compositions for annotation. Experiments on two cataract surgery datasets (Cataract-1k and Cataract-101) demonstrate that StepAL consistently outperforms existing active learning approaches, achieving higher accuracy in step recognition with fewer labeled videos. StepAL offers an effective approach for efficient surgical video analysis, reducing the annotation burden in developing computer-assisted surgical systems.
[183] MOVE: Motion-Guided Few-Shot Video Object Segmentation
Kaining Ying, Hengrui Hu, Henghui Ding
Main category: cs.CV
TL;DR: The paper introduces MOVE, a dataset for motion-guided few-shot video object segmentation (FSVOS), evaluates existing methods, and proposes a baseline approach (DMA) to address challenges in motion understanding.
Details
Motivation: Existing FSVOS datasets and methods focus on static object categories, ignoring temporal dynamics, limiting applications requiring motion understanding.Method: Introduces MOVE dataset, evaluates 6 state-of-the-art methods, and proposes DMA, a baseline method for motion-guided FSVOS.
Result: Current methods struggle with motion-guided FSVOS; DMA outperforms them in few-shot motion understanding.
Conclusion: DMA establishes a foundation for future research in motion-guided FSVOS, addressing limitations of existing approaches.
Abstract: This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state-of-the-art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion-guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few shot motion understanding, establishing a solid foundation for future research in this direction.
[184] Image Captioning via Compact Bidirectional Architecture
Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng Wang
Main category: cs.CV
TL;DR: The paper introduces a Compact Bidirectional Transformer for image captioning, enabling bidirectional context use and parallel execution, outperforming unidirectional models.
Details
Motivation: Current unidirectional captioning models lack future context, and refinement-based models are sequential. The paper aims to leverage bidirectional context efficiently.Method: A compact model combines left-to-right and right-to-left flows, allowing implicit and explicit bidirectional interaction, with sentence-level ensemble for final caption selection.
Result: The model achieves state-of-the-art results on MSCOCO, with the bidirectional architecture and ensemble being key factors.
Conclusion: The compact bidirectional architecture is effective and generalizable, as shown by its extension to LSTM.
Abstract: Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.
[185] One-stage Modality Distillation for Incomplete Multimodal Learning
Shicai Wei, Yang Luo, Chunbo Luo
Main category: cs.CV
TL;DR: A one-stage modality distillation framework unifies knowledge transfer and modality fusion via multi-task learning, improving inference with incomplete multimodal data.
Details
Motivation: Addressing the challenge of incomplete modality availability in development scenarios by leveraging multimodal data for training.Method: Proposes a joint adaptation network for modality transfer and a cross translation network for modality fusion, using multi-task learning.
Result: Achieves state-of-the-art performance on RGB-D classification and segmentation tasks, overcoming incomplete modality issues.
Conclusion: The framework effectively handles incomplete modality inputs and enhances model inference through unified optimization.
Abstract: Learning based on multimodal data has attracted increasing interest recently. While a variety of sensory modalities can be collected for training, not all of them are always available in development scenarios, which raises the challenge to infer with incomplete modality. To address this issue, this paper presents a one-stage modality distillation framework that unifies the privileged knowledge transfer and modality information fusion into a single optimization procedure via multi-task learning. Compared with the conventional modality distillation that performs them independently, this helps to capture the valuable representation that can assist the final model inference directly. Specifically, we propose the joint adaptation network for the modality transfer task to preserve the privileged information. This addresses the representation heterogeneity caused by input discrepancy via the joint distribution adaptation. Then, we introduce the cross translation network for the modality fusion task to aggregate the restored and available modality features. It leverages the parameters-sharing strategy to capture the cross-modal cues explicitly. Extensive experiments on RGB-D classification and segmentation tasks demonstrate the proposed multimodal inheritance framework can overcome the problem of incomplete modality input in various scenes and achieve state-of-the-art performance.
[186] Semantic segmentation of SEM images of lower bainitic and tempered martensitic steels
Xiaohan Bie, Manoj Arthanari, Evelin Barbosa de Melo, Baihua Ren, Juancheng Li, Stephen Yue, Salim Brahimi, Jun Song
Main category: cs.CV
TL;DR: Deep learning segments SEM images to analyze carbide precipitates in steels, revealing similar volume percentages but differing distributions and alignments between lower bainite and tempered martensite. The model achieves 98% accuracy.
Details
Motivation: To quantitatively analyze carbide precipitates in steels using deep learning for efficient and accurate microstructure analysis.Method: Deep learning-based segmentation of SEM images to study carbide volume, size distribution, and orientation in lower bainite and tempered martensite.
Result: Lower bainite and tempered martensite have similar carbide volumes, but tempered martensite has a more uniform distribution. Carbides in lower bainite align better. The model achieves 98% accuracy.
Conclusion: Deep learning enables efficient, accurate analysis of microstructures, with potential applications for secondary phases in other materials.
Abstract: This study employs deep learning techniques to segment scanning electron microscope images, enabling a quantitative analysis of carbide precipitates in lower bainite and tempered martensite steels with comparable strength. Following segmentation, carbides are investigated, and their volume percentage, size distribution, and orientations are probed within the image dataset. Our findings reveal that lower bainite and tempered martensite exhibit comparable volume percentages of carbides, albeit with a more uniform distribution of carbides in tempered martensite. Carbides in lower bainite demonstrate a tendency for better alignment than those in tempered martensite, aligning with the observations of other researchers. However, both microstructures display a scattered carbide orientation, devoid of any discernible pattern. Comparative analysis of aspect ratios and sizes of carbides in lower bainite and tempered martensite unveils striking similarities. The deep learning model achieves an impressive pixelwise accuracy of 98.0% in classifying carbide/iron matrix at the individual pixel level. The semantic segmentation derived from deep learning extends its applicability to the analysis of secondary phases in various materials, offering a time-efficient, versatile AI-powered workflow for quantitative microstructure analysis.
[187] Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator
Ronglai Zuo, Rolandos Alexandros Potamias, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou
Main category: cs.CV
TL;DR: SOKE introduces a multilingual sign language model for text-to-sign generation, using a pretrained LM and a decoupled tokenizer. It improves efficiency with multi-head decoding and enhances precision via retrieval-augmented generation.
Details
Motivation: The lack of exploration in text-to-sign language generation (SLG) compared to sign-to-text translation motivates the development of SOKE.Method: SOKE uses a pretrained LM and a decoupled tokenizer to discretize signs. It employs multi-head decoding for simultaneous token prediction and retrieval-enhanced generation for accuracy.
Result: SOKE demonstrates effectiveness in generating 3D sign avatars from text, with improved efficiency and precision.
Conclusion: SOKE advances SLG by combining pretrained LMs, multi-head decoding, and retrieval augmentation, offering a robust solution for text-to-sign generation.
Abstract: Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task-sign language generation (text-to-sign)-remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE.
[188] VLM-CPL: Consensus Pseudo Labels from Vision-Language Models for Annotation-Free Pathological Image Classification
Lanfeng Zhong, Zongyao Huang, Yang Liu, Wenjun Liao, Shichuan Zhang, Guotai Wang, Shaoting Zhang
Main category: cs.CV
TL;DR: A novel method, VLM-CPL, uses pre-trained Vision-Language Models (VLMs) for cancer diagnosis without human annotation, employing noisy label filtering and semi-supervised learning to improve accuracy.
Details
Motivation: Deep learning for cancer diagnosis relies on labeled data, which requires extensive human annotation. This study aims to eliminate the need for human annotation by leveraging VLMs.Method: VLM-CPL combines prompt-based and feature-based pseudo-labeling with consensus filtering and semi-supervised learning. It also introduces open-set prompting to enhance patch quality.
Result: Outperforms zero-shot VLM classification and existing noisy label learning methods on five pathological image datasets.
Conclusion: VLM-CPL effectively reduces reliance on human annotation while maintaining high classification accuracy, offering a scalable solution for cancer diagnosis.
Abstract: Classification of pathological images is the basis for automatic cancer diagnosis. Despite that deep learning methods have achieved remarkable performance, they heavily rely on labeled data, demanding extensive human annotation efforts. In this study, we present a novel human annotation-free method by leveraging pre-trained Vision-Language Models (VLMs). Without human annotation, pseudo-labels of the training set are obtained by utilizing the zero-shot inference capabilities of VLM, which may contain a lot of noise due to the domain gap between the pre-training and target datasets. To address this issue, we introduce VLM-CPL, a novel approach that contains two noisy label filtering techniques with a semi-supervised learning strategy. Specifically, we first obtain prompt-based pseudo-labels with uncertainty estimation by zero-shot inference with the VLM using multiple augmented views of an input. Then, by leveraging the feature representation ability of VLM, we obtain feature-based pseudo-labels via sample clustering in the feature space. Prompt-feature consensus is introduced to select reliable samples based on the consensus between the two types of pseudo-labels. We further propose High-confidence Cross Supervision by to learn from samples with reliable pseudo-labels and the remaining unlabeled samples. Additionally, we present an innovative open-set prompting strategy that filters irrelevant patches from whole slides to enhance the quality of selected patches. Experimental results on five public pathological image datasets for patch-level and slide-level classification showed that our method substantially outperformed zero-shot classification by VLMs, and was superior to existing noisy label learning methods. The code is publicly available at https://github.com/HiLab-git/VLM-CPL.
[189] AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang
Main category: cs.CV
TL;DR: A training-free adaptive inference method for multi-modal LLMs reduces computational demands while maintaining performance, achieving a 7-fold FLOPs reduction and outperforming state-of-the-art in long video understanding.
Details
Motivation: Multi-modal LLMs face high computational demands due to extensive visual tokens, limiting their use in resource-constrained environments and long-context tasks.Method: Proposes iterative token merging and progressive token pruning based on multi-modal importance to reduce computation.
Result: Achieves a 7-fold FLOPs reduction and +4.6 improvement on MLVU benchmark for long video understanding.
Conclusion: The method offers insights into token redundancy and LLM layer behaviors, guiding future efficient multi-modal LLM designs.
Abstract: Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that our method substantially reduces computation load (e.g., a $\textbf{7-fold}$ reduction in FLOPs) while preserving the performance of video and image LLMs. Further, at a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (e.g., $\textbf{+4.6}$ on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs. Our code is available at https://github.com/LaVi-Lab/AIM.
[190] Few-shot Online Anomaly Detection and Segmentation
Shenxing Wei, Xing Wei, Zhiheng Ma, Songlin Dong, Shaochen Zhang, Yihong Gong
Main category: cs.CV
TL;DR: The paper proposes a few-shot online anomaly detection and segmentation (FOADS) method using Neural Gas networks and multi-scale feature embedding to improve performance with limited training data.
Details
Motivation: Addressing the practical challenge of enhancing anomaly detection models post-deployment using unlabeled data, which contains both normal and abnormal samples.Method: Uses Neural Gas networks to model normal image feature distribution and multi-scale CNN features for robust representation. Introduces an incremental update algorithm.
Result: Achieves strong performance on FOADS tasks with acceptable time complexity on MVTec AD and BTAD datasets.
Conclusion: The proposed method effectively tackles FOADS, leveraging unlabeled data and incremental learning for practical industrial applications.
Abstract: Detecting anomaly patterns from images is a crucial artificial intelligence technique in industrial applications. Recent research in this domain has emphasized the necessity of a large volume of training data, overlooking the practical scenario where, post-deployment of the model, unlabeled data containing both normal and abnormal samples can be utilized to enhance the model’s performance. Consequently, this paper focuses on addressing the challenging yet practical few-shot online anomaly detection and segmentation (FOADS) task. Under the FOADS framework, models are trained on a few-shot normal dataset, followed by inspection and improvement of their capabilities by leveraging unlabeled streaming data containing both normal and abnormal samples simultaneously. To tackle this issue, we propose modeling the feature distribution of normal images using a Neural Gas network, which offers the flexibility to adapt the topology structure to identify outliers in the data flow. In order to achieve improved performance with limited training samples, we employ multi-scale feature embedding extracted from a CNN pre-trained on ImageNet to obtain a robust representation. Furthermore, we introduce an algorithm that can incrementally update parameters without the need to store previous samples. Comprehensive experimental results demonstrate that our method can achieve substantial performance under the FOADS setting, while ensuring that the time complexity remains within an acceptable range on MVTec AD and BTAD datasets.
[191] Probabilistic Directed Distance Fields for Ray-Based Shape Representations
Tristan Aumentado-Armstrong, Stavros Tsogkas, Sven Dickinson, Allan Jepson
Main category: cs.CV
TL;DR: The paper introduces Directed Distance Fields (DDFs), a neural shape representation combining geometric fidelity and efficient differentiable rendering, applied to tasks like 3D reconstruction and generative modeling.
Details
Motivation: Existing 3D shape representations either lack geometric fidelity (explicit) or have inefficient rendering (implicit). DDFs aim to bridge this gap.Method: DDFs map oriented points to surface visibility and depth, enabling efficient rendering and geometric extraction. Probabilistic DDFs (PDDFs) handle field discontinuities.
Result: DDFs show strong performance in shape fitting, generative modeling, and 3D reconstruction. Theoretical analysis ensures view consistency.
Conclusion: DDFs offer a versatile, efficient, and high-fidelity representation for 3D shapes, with theoretical guarantees for consistency.
Abstract: In modern computer vision, the optimal representation of 3D shape continues to be task-dependent. One fundamental operation applied to such representations is differentiable rendering, as it enables inverse graphics approaches in learning frameworks. Standard explicit shape representations (voxels, point clouds, or meshes) are often easily rendered, but can suffer from limited geometric fidelity, among other issues. On the other hand, implicit representations (occupancy, distance, or radiance fields) preserve greater fidelity, but suffer from complex or inefficient rendering processes, limiting scalability. In this work, we devise Directed Distance Fields (DDFs), a novel neural shape representation that builds upon classical distance fields. The fundamental operation in a DDF maps an oriented point (position and direction) to surface visibility and depth. This enables efficient differentiable rendering, obtaining depth with a single forward pass per pixel, as well as differential geometric quantity extraction (e.g., surface normals), with only additional backward passes. Using probabilistic DDFs (PDDFs), we show how to model inherent discontinuities in the underlying field. We then apply DDFs to several applications, including single-shape fitting, generative modelling, and single-image 3D reconstruction, showcasing strong performance with simple architectural components via the versatility of our representation. Finally, since the dimensionality of DDFs permits view-dependent geometric artifacts, we conduct a theoretical investigation of the constraints necessary for view consistency. We find a small set of field properties that are sufficient to guarantee a DDF is consistent, without knowing, for instance, which shape the field is expressing.
[192] Texture, Shape, Order, and Relation Matter: A New Transformer Design for Sequential DeepFake Detection
Yunfei Li, Yuezun Li, Baoyuan Wu, Junyu Dong, Guopu Zhu, Siwei Lyu
Main category: cs.CV
TL;DR: The paper introduces TSOM and TSOM++, novel Transformer-based methods for sequential DeepFake detection, focusing on Texture, Shape, Order, and Relation of manipulations to improve performance.
Details
Motivation: Existing methods for sequential DeepFake detection lack dedicated design, leading to limited performance. This paper aims to address this gap by proposing a specialized Transformer architecture.Method: The TSOM method includes a texture-aware branch, Multi-source Cross-attention module, Shape-guided Gaussian mapping, and inverted prediction order. TSOM++ adds sequential contrastive learning to explore manipulation relations.
Result: Extensive experiments show TSOM and TSOM++ outperform state-of-the-art methods, demonstrating superior performance in detecting manipulation traces.
Conclusion: The proposed TSOM and TSOM++ methods significantly advance sequential DeepFake detection by incorporating dedicated designs for texture, shape, order, and relation of manipulations.
Abstract: Sequential DeepFake detection is an emerging task that predicts the manipulation sequence in order. Existing methods typically formulate it as an image-to-sequence problem, employing conventional Transformer architectures. However, these methods lack dedicated design and consequently result in limited performance. As such, this paper describes a new Transformer design, called {TSOM}, by exploring three perspectives: Texture, Shape, and Order of Manipulations. Our method features four major improvements: \ding{182} we describe a new texture-aware branch that effectively captures subtle manipulation traces with a Diversiform Pixel Difference Attention module. \ding{183} Then we introduce a Multi-source Cross-attention module to seek deep correlations among spatial and sequential features, enabling effective modeling of complex manipulation traces. \ding{184} To further enhance the cross-attention, we describe a Shape-guided Gaussian mapping strategy, providing initial priors of the manipulation shape. \ding{185} Finally, observing that the subsequent manipulation in a sequence may influence traces left in the preceding one, we intriguingly invert the prediction order from forward to backward, leading to notable gains as expected. Building upon TSOM, we introduce an extended method, {TSOM++}, which additionally explores Relation of manipulations: \ding{186} we propose a new sequential contrastive learning scheme to capture relationships between various manipulation types in sequence, further enhancing the detection of manipulation traces. We conduct extensive experiments in comparison with several state-of-the-art methods, demonstrating the superiority of our method. The code has been released at https://github.com/OUC-VAS/TSOM.
[193] Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes
Prodromos Kolyvakis, Manos Kamarianakis, George Papagiannakis
Main category: cs.CV
TL;DR: Shenlong integrates LLMs with Conformal Geometric Algebra (CGA) for precise 3D scene editing, reducing manual effort and improving accuracy.
Details
Motivation: Traditional 3D scene editing is manual and lacks formal language for precise edits, requiring large datasets or expertise.Method: Uses CGA for spatial modeling and LLMs for natural language translation into CGA operations, enabling zero-shot learning.
Result: Shenlong reduces LLM response times by 16%, increases success rates by 9.6%, and achieves 100% success in common queries.
Conclusion: Shenlong democratizes 3D editing, enhancing accessibility for sectors like education and entertainment.
Abstract: This paper introduces a novel integration of Large Language Models (LLMs) with Conformal Geometric Algebra (CGA) to revolutionize controllable 3D scene editing, particularly for object repositioning tasks, which traditionally requires intricate manual processes and specialized expertise. These conventional methods typically suffer from reliance on large training datasets or lack a formalized language for precise edits. Utilizing CGA as a robust formal language, our system, Shenlong, precisely models spatial transformations necessary for accurate object repositioning. Leveraging the zero-shot learning capabilities of pre-trained LLMs, Shenlong translates natural language instructions into CGA operations which are then applied to the scene, facilitating exact spatial transformations within 3D scenes without the need for specialized pre-training. Implemented in a realistic simulation environment, Shenlong ensures compatibility with existing graphics pipelines. To accurately assess the impact of CGA, we benchmark against robust Euclidean Space baselines, evaluating both latency and accuracy. Comparative performance evaluations indicate that Shenlong significantly reduces LLM response times by 16% and boosts success rates by 9.6% on average compared to the traditional methods. Notably, Shenlong achieves a 100% perfect success rate in common practical queries, a benchmark where other systems fall short. These advancements underscore Shenlong’s potential to democratize 3D scene editing, enhancing accessibility and fostering innovation across sectors such as education, digital entertainment, and virtual reality.
[194] Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation
Guopeng Li, Qiang Wang, Ke Yan, Shouhong Ding, Yuan Gao, Gui-Song Xia
Main category: cs.CV
TL;DR: The paper introduces Cross-Architecture Knowledge Distillation (CAKD) to transfer knowledge between heterogeneous models, using an assistant model and spatial-agnostic InfoNCE loss for improved performance.
Details
Motivation: Current KD methods focus on similar architectures, limiting flexibility. CAKD aims to bridge feature gaps between heterogeneous models for better knowledge transfer.Method: Proposes an assistant model combining convolution and attention modules, and uses spatial-agnostic InfoNCE loss for feature alignment.
Result: Achieves state-of-the-art performance, with gains of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K.
Conclusion: CAKD effectively transfers knowledge across diverse architectures, outperforming conventional KD methods.
Abstract: Most knowledge distillation (KD) methodologies predominantly focus on teacher-student pairs with similar architectures, such as both being convolutional neural networks (CNNs). However, the potential and flexibility of KD can be greatly improved by expanding it to novel Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred flexibly to a given student. The primary challenge in CAKD lies in the substantial feature gaps between heterogeneous models, originating from the distinction of their inherent inductive biases and module functions. To this end, we introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. More importantly, within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions by merging convolution and attention modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions in CAKD, hindering the effectiveness of conventional pixel-wise mean squared error (MSE) loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing, thereby improving the feature alignments in CAKD. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, achieving state-of-the-art performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our code and models will be released.
[195] Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Main category: cs.CV
TL;DR: RAIF improves LLMs’ ability to follow complex instructions by incentivizing reasoning and using reinforcement learning, outperforming vanilla CoT methods.
Details
Motivation: Existing LLMs struggle with complex instructions due to shallow reasoning in methods like chain-of-thought (CoT).Method: RAIF decomposes instructions, uses RL with verifiable rewards, and employs contrastive learning and expert behavior cloning.
Result: A 1.5B LLM with RAIF achieves 11.74% gains, matching an 8B LLM’s performance, and shows generalizability.
Conclusion: RAIF effectively enhances LLMs’ reasoning for complex instructions, validated by benchmarks and OOD tests.
Abstract: Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
[196] SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
Shuangrui Ding, Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Yuwei Guo, Dahua Lin, Jiaqi Wang
Main category: cs.CV
TL;DR: SAM2Long improves SAM 2 for video object segmentation by addressing error accumulation with a training-free, tree-search strategy, achieving better performance on long-term videos.
Details
Motivation: SAM 2's greedy memory design causes error accumulation in long-term videos, limiting performance. SAM2Long aims to solve this by considering segmentation uncertainty and optimizing pathways.Method: SAM2Long uses a constrained tree search to maintain multiple segmentation pathways, selecting optimal branches per frame based on cumulative scores.
Result: SAM2Long improves performance by 3.0 points on average, with gains up to 5.3 points on benchmarks like SA-V and LVOS.
Conclusion: SAM2Long effectively addresses SAM 2’s limitations, offering robust segmentation for complex long-term videos.
Abstract: The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the “error accumulation” problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term video object segmentation benchmarks such as SA-V and LVOS. The code is released at https://github.com/Mark12Ding/SAM2Long.
[197] FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang
Main category: cs.CV
TL;DR: FlagEvalMM is an open-source framework for evaluating multimodal models on vision-language tasks, offering flexibility, efficiency, and seamless integration.
Details
Motivation: To provide a comprehensive and efficient tool for assessing multimodal models across diverse tasks, aiding research advancements.Method: Decouples model inference from evaluation, uses advanced tools (e.g., vLLM, SGLang) for acceleration, and employs asynchronous data loading.
Result: Accurate and efficient evaluation of model strengths and limitations, enhancing multimodal research.
Conclusion: FlagEvalMM is a valuable, publicly accessible tool for advancing multimodal model evaluation.
Abstract: We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible at https://github.com/flageval-baai/FlagEvalMM.
[198] Language Driven Occupancy Prediction
Zhu Yu, Bowen Pang, Lizhe Liu, Runmin Zhang, Qiang Li, Si-Yuan Cao, Maochun Luo, Mingxia Chen, Sheng Yang, Hui-Liang Shen
Main category: cs.CV
TL;DR: LOcc is a framework for open-vocabulary occupancy prediction, using semantic transitive labeling to generate accurate 3D language ground truth, outperforming existing methods.
Details
Motivation: Previous methods rely on coarse or noisy voxel-to-text correspondences, leading to inaccurate supervision. LOcc aims to improve this by leveraging dense, fine-grained semantic labels.Method: LOcc employs a semantic transitive labeling pipeline to transfer text labels from images to LiDAR point clouds and voxels, replacing prediction heads with a geometry head and a language head.
Result: LOcc produces more accurate pseudo-labeled ground truth, reducing human annotation effort, and outperforms state-of-the-art zero-shot occupancy prediction methods on the Occ3D-nuScenes dataset.
Conclusion: LOcc provides a generalizable and effective solution for open-vocabulary occupancy prediction, enhancing accuracy and reducing reliance on manual annotations.
Abstract: We introduce LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and fine-grained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and ultimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our transitive semantic labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-of-the-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset.
[199] Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions
Nicolai Hermann, Jorge Condor, Piotr Didyk
Main category: cs.CV
TL;DR: The paper introduces Puzzle Similarity, a cross-reference metric for localizing artifacts in novel 3D views, validated by a human-labeled dataset.
Details
Motivation: Current no-reference metrics fail to reliably assess novel view quality in 3D reconstructions, limiting post-processing and adoption.Method: Uses image patch statistics from training views to create a scene-specific distribution for artifact detection.
Result: Achieves state-of-the-art artifact localization, correlating with human assessment.
Conclusion: Puzzle Similarity enhances applications like image restoration and sparse 3D reconstruction.
Abstract: Modern reconstruction techniques can effectively model complex 3D scenes from sparse 2D views. However, automatically assessing the quality of novel views and identifying artifacts is challenging due to the lack of ground truth images and the limitations of no-reference image metrics in predicting reliable artifact maps. The absence of such metrics hinders assessment of the quality of novel views and limits the adoption of post-processing techniques, such as inpainting, to enhance reconstruction quality. To tackle this, recent work has established a new category of metrics (cross-reference), predicting image quality solely by leveraging context from alternate viewpoint captures (arXiv:2404.14409). In this work, we propose a new cross-reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. Our approach utilizes image patch statistics from the training views to establish a scene-specific distribution, later used to identify poorly reconstructed regions in the novel views. Given the lack of good measures to evaluate cross-reference methods in the context of 3D reconstruction, we collected a novel human-labeled dataset of artifact and distortion maps in unseen reconstructed views. Through this dataset, we demonstrate that our method achieves state-of-the-art localization of artifacts in novel views, correlating with human assessment, even without aligned references. We can leverage our new metric to enhance applications like automatic image restoration, guided acquisition, or 3D reconstruction from sparse inputs. Find the project page at https://nihermann.github.io/puzzlesim/ .
[200] DIVE: Taming DINO for Subject-Driven Video Editing
Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, Shifeng Chen
Main category: cs.CV
TL;DR: DIVE is a framework for video editing using DINOv2 features to ensure temporal consistency and subject identity alignment.
Details
Motivation: Addressing challenges in maintaining temporal consistency and motion alignment in video editing.Method: Uses DINOv2 features for motion alignment and integrates reference images via LoRAs for subject identity.
Result: Achieves high-quality editing with robust motion consistency in diverse real-world videos.
Conclusion: DIVE demonstrates DINO’s potential in advancing video editing techniques.
Abstract: Building on the success of diffusion models in image generation and editing, video editing has recently gained substantial attention. However, maintaining temporal consistency and motion alignment still remains challenging. To address these issues, this paper proposes DINO-guided Video Editing (DIVE), a framework designed to facilitate subject-driven editing in source videos conditioned on either target text prompts or reference images with specific identities. The core of DIVE lies in leveraging the powerful semantic features extracted from a pretrained DINOv2 model as implicit correspondences to guide the editing process. Specifically, to ensure temporal motion consistency, DIVE employs DINO features to align with the motion trajectory of the source video. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model to learn Low-Rank Adaptations (LoRAs), effectively registering the target subject’s identity. Extensive experiments on diverse real-world videos demonstrate that our framework can achieve high-quality editing results with robust motion consistency, highlighting the potential of DINO to contribute to video editing. Project page: https://dino-video-editing.github.io
[201] C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
Xiuwei Chen, Wentao Hu, Hanhui Li, Jun Zhou, Zisheng Chen, Meng Cao, Yihan Zeng, Kui Zhang, Yu-Jie Yuan, Jianhua Han, Hang Xu, Xiaodan Liang
Main category: cs.CV
TL;DR: C2-Evo is a self-improving framework for MLLMs that jointly evolves training data and model capabilities, addressing discrepancies in data complexity and mismatched difficulty levels.
Details
Motivation: Existing MLLMs face challenges with high-quality vision-language datasets and self-improving models, leading to data complexity discrepancies and mismatched task difficulty.Method: C2-Evo uses cross-modal data evolution and data-model evolution loops to generate complex multimodal problems and adaptively select tasks for supervised fine-tuning and reinforcement learning.
Result: The method achieves significant performance gains across multiple mathematical reasoning benchmarks.
Conclusion: C2-Evo effectively refines models and training data, offering a scalable solution for enhancing MLLMs.
Abstract: Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.
[202] UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts
Zhen Wan, Chenyang Qi, Zhiheng Liu, Tao Gui, Yue Ma
Main category: cs.CV
TL;DR: UniPaint is a unified framework for space-time video inpainting and interpolation, using a Mixture of Experts attention and spatial-temporal masking to enhance performance.
Details
Motivation: Existing methods treat video inpainting and interpolation as separate tasks, but UniPaint aims to unify them for mutual performance enhancement.Method: Introduces a plug-and-play adapter and Mixture of Experts attention, with spatial-temporal masking during training.
Result: Produces high-quality results, achieving top quantitative performance across tasks and scales.
Conclusion: UniPaint demonstrates that unified inpainting and interpolation can improve synthesis performance, with code available for public use.
Abstract: In this paper, we present UniPaint, a unified generative space-time video inpainting framework that enables spatial-temporal inpainting and interpolation. Different from existing methods that treat video inpainting and video interpolation as two distinct tasks, we leverage a unified inpainting framework to tackle them and observe that these two tasks can mutually enhance synthesis performance. Specifically, we first introduce a plug-and-play space-time video inpainting adapter, which can be employed in various personalized models. The key insight is to propose a Mixture of Experts (MoE) attention to cover various tasks. Then, we design a spatial-temporal masking strategy during the training stage to mutually enhance each other and improve performance. UniPaint produces high-quality and aesthetically pleasing results, achieving the best quantitative results across various tasks and scale setups. The code and checkpoints are available at $\href{https://github.com/mmmmm-w/UniPaint}{this \ repository}$.
[203] Back Home: A Computer Vision Solution to Seashell Identification for Ecological Restoration
Alexander Valverde, Luis Solano, André Montoya
Main category: cs.CV
TL;DR: A system called BackHome19K uses a large annotated image dataset and lightweight pipeline to identify the coastal origin of seized seashells in Costa Rica, aiding their return to native ecosystems.
Details
Motivation: Illegal souvenir collection removes seashells from Costa Rican beaches, but their origin cannot be verified, preventing repatriation.Method: BackHome19K, a large-scale image corpus with coast-level labels, and a real-time lightweight pipeline with an anomaly filter for robustness.
Result: 86.3% balanced accuracy on test set; 93% rejection of out-of-domain objects. Processed 70,000 shells in under 3 seconds per image.
Conclusion: The system successfully enables repatriation of confiscated seashells to their native ecosystems.
Abstract: Illegal souvenir collection strips an estimated five tonnes of seashells from Costa Rica’s beaches each year. Yet, once these specimens are seized, their coastal origin – Pacific or Caribbean – cannot be verified easily due to the lack of information, preventing their return when confiscated by local authorities. To solve this issue, we introduce BackHome19K, the first large-scale image corpus (19,058 photographs, 516 species) annotated with coast-level labels, and propose a lightweight pipeline that infers provenance in real time on a mobile-grade CPU. A trained anomaly filter pre-screens uploads, increasing robustness to user-generated noise. On a held-out test set, the classifier attains 86.3% balanced accuracy, while the filter rejects 93% of 180 out-of-domain objects with zero false negatives. Deployed as a web application, the system has already processed 70,000 shells for wildlife officers in under three seconds per image, enabling confiscated specimens to be safely repatriated to their native ecosystems. The dataset is available at https://huggingface.co/datasets/FIFCO/BackHome19K
[204] ZeroStereo: Zero-shot Stereo Matching from Single Images
Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, Xin Yang
Main category: cs.CV
TL;DR: ZeroStereo proposes a novel stereo image generation pipeline for zero-shot stereo matching, using monocular depth estimation and diffusion inpainting to improve generalization without requiring annotated real-world data.
Details
Motivation: The scarcity of annotated real-world stereo data limits the generalization of supervised stereo matching methods.Method: Leverages pseudo disparities from monocular depth estimation, fine-tunes a diffusion inpainting model for occluded regions, and introduces Training-Free Confidence Generation and Adaptive Disparity Selection.
Result: Achieves state-of-the-art zero-shot generalization across datasets with minimal training data.
Conclusion: ZeroStereo effectively addresses the generalization challenge in stereo matching without relying on extensive annotated data.
Abstract: State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow. Code: https://github.com/Windsrain/ZeroStereo.
[205] Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning
Liam Chalcroft, Jenny Crinion, Cathy J. Price, John Ashburner
Main category: cs.CV
TL;DR: A self-supervised framework for 3D MRI leverages qMRI to simulate multiple contrasts, learning anatomy-centric features. It outperforms baselines in segmentation and denoising tasks, especially in low-data settings, and generalizes to unseen sites.
Details
Motivation: Self-supervised learning (SSL) is effective for 2D images but challenging for 3D MRI due to data scarcity and lack of volumetric context. The goal is to develop a method that learns robust, sequence-invariant features for 3D MRI.Method: The framework simulates multiple MRI contrasts from a single 3D qMRI scan and enforces consistent representations across these contrasts. This ensures anatomy-centric rather than sequence-specific features.
Result: The method achieves significant improvements in brain segmentation (up to +8.3% Dice), stroke lesion segmentation, and MRI denoising (+4.2 dB PSNR), especially in low-data settings. It also generalizes to unseen sites.
Conclusion: The proposed self-supervised framework effectively learns 3D MRI features, outperforming baselines and supporting scalable clinical use. Code and models are publicly available.
Abstract: Self-supervised deep learning has accelerated 2D natural image analysis but remains difficult to translate into 3D MRI, where data are scarce and pre-trained 2D backbones cannot capture volumetric context. We present a \emph{sequence-invariant} self-supervised framework leveraging quantitative MRI (qMRI). By simulating multiple MRI contrasts from a single 3D qMRI scan and enforcing consistent representations across these contrasts, we learn anatomy-centric rather than sequence-specific features. The result is a single 3D encoder that excels across tasks and protocols. Experiments on healthy brain segmentation (IXI), stroke lesion segmentation (ARC), and MRI denoising show significant gains over baseline SSL approaches, especially in low-data settings (up to +8.3% Dice, +4.2 dB PSNR). It also generalises to unseen sites, supporting scalable clinical use. Code and trained models are publicly available at https://github.com/liamchalcroft/contrast-squared
[206] Motion Diffusion Autoencoders: Enabling Attribute Manipulation in Human Motion Demonstrated on Karate Techniques
Anthony Richardson, Felix Putze
Main category: cs.CV
TL;DR: First successful method for manipulating attributes of human motion data, focusing on karate movements, using a novel pose representation and transformer-diffusion models.
Details
Motivation: To enable attribute manipulation in human motion data, particularly karate movements, by disentangling skeleton and motion trajectory.Method: Uses a continuous, rotation-based pose representation, a transformer encoder for semantics, and a diffusion model for stochastic variations.
Result: Semantically meaningful and linear embedding space allows attribute manipulation by moving along discovered linear directions.
Conclusion: Successful attribute manipulation in human motion, with publicly available code and data.
Abstract: Attribute manipulation deals with the problem of changing individual attributes of a data point or a time series, while leaving all other aspects unaffected. This work focuses on the domain of human motion, more precisely karate movement patterns. To the best of our knowledge, it presents the first success at manipulating attributes of human motion data. One of the key requirements for achieving attribute manipulation on human motion is a suitable pose representation. Therefore, we design a novel continuous, rotation-based pose representation that enables the disentanglement of the human skeleton and the motion trajectory, while still allowing an accurate reconstruction of the original anatomy. The core idea of the manipulation approach is to use a transformer encoder for discovering high-level semantics, and a diffusion probabilistic model for modeling the remaining stochastic variations. We show that the embedding space obtained from the transformer encoder is semantically meaningful and linear. This enables the manipulation of high-level attributes, by discovering their linear direction of change in the semantic embedding space and moving the embedding along said direction. All code and data is made publicly available.
[207] MedViT V2: Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention
Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz
Main category: cs.CV
TL;DR: MedViTV2 introduces KAN layers into transformers for robust medical image classification, enhancing efficiency and accuracy despite image corruptions.
Details
Motivation: Addressing the challenge of classifying corrupted medical images from multi-center studies, which existing methods struggle with.Method: Incorporates KAN layers, proposes DiNA for global context, and uses a hierarchical hybrid strategy for local-global feature balance.
Result: Achieves state-of-the-art results in 27/29 experiments, with 44% efficiency gain and accuracy improvements up to 13.4%.
Conclusion: MedViTV2 outperforms predecessors in efficiency and accuracy, proving effective for real-world medical image classification.
Abstract: Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6% on MedMNIST, 5.8% on NonMNIST, and 13.4% on the MedMNIST-C benchmark.
[208] Category-level Meta-learned NeRF Priors for Efficient Object Mapping
Saad Ejaz, Hriday Bavle, Laura Ribeiro, Holger Voos, Jose Luis Sanchez-Lopez
Main category: cs.CV
TL;DR: PRENOM integrates category-level priors with NeRFs for efficient 3D object mapping, improving reconstruction quality and pose estimation while reducing computational cost.
Details
Motivation: Existing methods like DeepSDF struggle with sharp geometry and computational expense, while NeRFs lack integration with category-level priors for real-time multi-object mapping.Method: PRENOM meta-learns on synthetic tasks, uses a genetic algorithm to optimize NeRF per category, and employs probabilistic ray sampling for efficiency.
Result: PRENOM achieves 21% lower Chamfer distance than prior-free NeRFs and 13% better metrics on real-world data, with 5x faster training.
Conclusion: PRENOM effectively combines priors and NeRFs for high-quality, efficient 3D object mapping.
Abstract: In 3D object mapping, category-level priors enable efficient object reconstruction and canonical pose estimation, requiring only a single prior per semantic category (e.g., chair, book, laptop, etc.). DeepSDF has been used predominantly as a category-level shape prior, but it struggles to reconstruct sharp geometry and is computationally expensive. In contrast, NeRFs capture fine details but have yet to be effectively integrated with category-level priors in a real-time multi-object mapping framework. To bridge this gap, we introduce PRENOM, a Prior-based Efficient Neural Object Mapper that integrates category-level priors with object-level NeRFs to enhance reconstruction efficiency and enable canonical object pose estimation. PRENOM gets to know objects on a first-name basis by meta-learning on synthetic reconstruction tasks generated from open-source shape datasets. To account for object category variations, it employs a multi-objective genetic algorithm to optimize the NeRF architecture for each category, balancing reconstruction quality and training time. Additionally, prior-based probabilistic ray sampling directs sampling toward expected object regions, accelerating convergence and improving reconstruction quality under constrained resources. Experimental results highlight the ability of PRENOM to achieve high-quality reconstructions while maintaining computational feasibility. Specifically, comparisons with prior-free NeRF-based approaches on a synthetic dataset show a 21% lower Chamfer distance. Furthermore, evaluations against other approaches using shape priors on a noisy real-world dataset indicate a 13% improvement averaged across all reconstruction metrics, and comparable pose and size estimation accuracy, while being trained for 5$\times$ less time. Code available at: https://github.com/snt-arg/PRENOM
[209] YOLO-PRO: Enhancing Instance-Specific Object Detection with Full-Channel Global Self-Attention
Lin Huang, Yujuan Tan, Weisheng Li, Shitai Shan, Liu Liu, Linlin Shen, Jing Yu, Yue Niu
Main category: cs.CV
TL;DR: The paper proposes two modules, ISB and ISADH, to overcome limitations in object detection frameworks, achieving state-of-the-art performance on MS-COCO.
Details
Motivation: Address limitations of bottleneck structures and decoupled heads in object detection, such as diminished discriminability and computational redundancy.Method: Introduces ISB for full-channel global self-attention and ISADH for asymmetric decoupled head architecture, integrating batch-instance features.
Result: YOLO-PRO outperforms YOLOv8 and YOLO11 in AP on MS-COCO benchmarks while maintaining efficiency.
Conclusion: The work offers insights for high-precision detectors suitable for edge devices.
Abstract: This paper addresses the inherent limitations of conventional bottleneck structures (diminished instance discriminability due to overemphasis on batch statistics) and decoupled heads (computational redundancy) in object detection frameworks by proposing two novel modules: the Instance-Specific Bottleneck with full-channel global self-attention (ISB) and the Instance-Specific Asymmetric Decoupled Head (ISADH). The ISB module innovatively reconstructs feature maps to establish an efficient full-channel global attention mechanism through synergistic fusion of batch-statistical and instance-specific features. Complementing this, the ISADH module pioneers an asymmetric decoupled architecture enabling hierarchical multi-dimensional feature integration via dual-stream batch-instance representation fusion. Extensive experiments on the MS-COCO benchmark demonstrate that the coordinated deployment of ISB and ISADH in the YOLO-PRO framework achieves state-of-the-art performance across all computational scales. Specifically, YOLO-PRO surpasses YOLOv8 by 1.0-1.6% AP (N/S/M/L/X scales) and outperforms YOLO11 by 0.1-0.5% AP in critical N/M/L/X groups, while maintaining competitive computational efficiency. This work provides practical insights for developing high-precision detectors deployable on edge devices.
[210] Improving Visual Place Recognition with Sequence-Matching Receptiveness Prediction
Somayeh Hussaini, Tobias Fischer, Michael Milford
Main category: cs.CV
TL;DR: A supervised learning approach predicts sequence matching receptiveness (SMR) in VPR, improving performance across techniques and datasets by selectively trusting sequence matching outputs.
Details
Motivation: Current filtering and sequence-based matching in VPR can unpredictably degrade performance; this work aims to mitigate this by learning when to trust sequence matching.Method: A new supervised learning model predicts SMR per-frame, agnostic to the underlying VPR technique, and selectively applies sequence matching.
Result: Significant performance improvement across seven VPR techniques and three datasets, with insights into replacing discarded matches and ablation studies.
Conclusion: The SMR predictor enhances VPR robustness and performance, offering a flexible solution adaptable to various techniques and conditions.
Abstract: In visual place recognition (VPR), filtering and sequence-based matching approaches can improve performance by integrating temporal information across image sequences, especially in challenging conditions. While these methods are commonly applied, their effects on system behavior can be unpredictable and can actually make performance worse in certain situations. In this work, we present a new supervised learning approach that learns to predict the per-frame sequence matching receptiveness (SMR) of VPR techniques, enabling the system to selectively decide when to trust the output of a sequence matching system. Our approach is agnostic to the underlying VPR technique and effectively predicts SMR, and hence significantly improves VPR performance across a large range of state-of-the-art and classical VPR techniques (namely CosPlace, MixVPR, EigenPlaces, SALAD, AP-GeM, NetVLAD and SAD), and across three benchmark VPR datasets (Nordland, Oxford RobotCar, and SFU-Mountain). We also provide insights into a complementary approach that uses the predictor to replace discarded matches, and present ablation studies including an analysis of the interactions between our SMR predictor and the selected sequence length.
[211] A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects
Fei Wang, Tingting Zhang, Wei Xi, Han Ding, Ge Wang, Di Zhang, Yuanhao Cui, Fan Liu, Jinsong Han, Jie Xu, Tony Xiao Han
Main category: cs.CV
TL;DR: A survey reviewing over 200 papers on Wi-Fi sensing, focusing on techniques to improve generalization across users, devices, and environments, and introducing a dataset platform for community collaboration.
Details
Motivation: Wi-Fi sensing's performance degrades due to domain shifts, prompting the need for robust generalization techniques.Method: Categorizes papers by the Wi-Fi sensing pipeline (setup, preprocessing, feature learning, deployment) and analyzes key techniques like domain adaptation and federated learning.
Result: Summarizes datasets and trends, introducing the Sensing Dataset Platform (SDP) for sharing resources.
Conclusion: Aims to guide researchers in enhancing Wi-Fi sensing generalizability and foster collaboration via SDP.
Abstract: Wi-Fi sensing has emerged as a powerful non-intrusive technology for recognizing human activities, monitoring vital signs, and enabling context-aware applications using commercial wireless devices. However, the performance of Wi-Fi sensing often degrades when applied to new users, devices, or environments due to significant domain shifts. To address this challenge, researchers have proposed a wide range of generalization techniques aimed at enhancing the robustness and adaptability of Wi-Fi sensing systems. In this survey, we provide a comprehensive and structured review of over 200 papers published since 2015, categorizing them according to the Wi-Fi sensing pipeline: experimental setup, signal preprocessing, feature learning, and model deployment. We analyze key techniques, including signal preprocessing, domain adaptation, meta-learning, metric learning, data augmentation, cross-modal alignment, federated learning, and continual learning. Furthermore, we summarize publicly available datasets across various tasks,such as activity recognition, user identification, indoor localization, and pose estimation, and provide insights into their domain diversity. We also discuss emerging trends and future directions, including large-scale pretraining, integration with multimodal foundation models, and continual deployment. To foster community collaboration, we introduce the Sensing Dataset Platform (SDP) for sharing datasets and models. This survey aims to serve as a valuable reference and practical guide for researchers and practitioners dedicated to improving the generalizability of Wi-Fi sensing systems.
[212] InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, Xin Lu
Main category: cs.CV
TL;DR: InfU is a robust framework using DiTs for high-fidelity, identity-preserved image generation, addressing issues like poor identity similarity and text-image alignment.
Details
Motivation: Existing methods struggle with identity similarity, text-image alignment, and generation quality in image generation tasks.Method: InfU introduces InfuseNet for identity feature injection via residual connections and employs a multi-stage training strategy with SPMS data.
Result: InfU achieves state-of-the-art performance, surpassing existing baselines in identity-preserved image generation.
Conclusion: InfU’s plug-and-play design and superior performance make it a valuable contribution to the field.
Abstract: Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX. We introduce InfiniteYou (InfU), one of the earliest robust frameworks leveraging DiTs for this task. InfU addresses significant issues of existing methods, such as insufficient identity similarity, poor text-image alignment, and low generation quality and aesthetics. Central to InfU is InfuseNet, a component that injects identity features into the DiT base model via residual connections, enhancing identity similarity while maintaining generation capabilities. A multi-stage training strategy, including pretraining and supervised fine-tuning (SFT) with synthetic single-person-multiple-sample (SPMS) data, further improves text-image alignment, ameliorates image quality, and alleviates face copy-pasting. Extensive experiments demonstrate that InfU achieves state-of-the-art performance, surpassing existing baselines. In addition, the plug-and-play design of InfU ensures compatibility with various existing methods, offering a valuable contribution to the broader community.
[213] LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene
Xiaoyu Zhang, Weihong Pan, Chong Bao, Xiyu Zhang, Xiaojun Xiang, Hanqing Jiang, Hujun Bao
Main category: cs.CV
TL;DR: FA-NeRF is a frequency-aware framework for view synthesis that balances high-frequency details and low-frequency scene structure in a single NeRF model.
Details
Motivation: Current NeRF frameworks struggle to simultaneously model high-frequency local views and low-frequency scene structures, limiting immersive scene comprehension.Method: The paper introduces a 3D frequency quantification method, a frequency grid for fast convergence, and a feature re-weighting strategy to balance frequency content.
Result: FA-NeRF outperforms existing methods in modeling entire scenes while preserving fine details.
Conclusion: The proposed framework effectively captures both scene structure and high-definition details, advancing view synthesis capabilities.
Abstract: Humans perceive and comprehend their surroundings through information spanning multiple frequencies. In immersive scenes, people naturally scan their environment to grasp its overall structure while examining fine details of objects that capture their attention. However, current NeRF frameworks primarily focus on modeling either high-frequency local views or the broad structure of scenes with low-frequency information, which is limited to balancing both. We introduce FA-NeRF, a novel frequency-aware framework for view synthesis that simultaneously captures the overall scene structure and high-definition details within a single NeRF model. To achieve this, we propose a 3D frequency quantification method that analyzes the scene’s frequency distribution, enabling frequency-aware rendering. Our framework incorporates a frequency grid for fast convergence and querying, a frequency-aware feature re-weighting strategy to balance features across different frequency contents. Extensive experiments show that our method significantly outperforms existing approaches in modeling entire scenes while preserving fine details. Project page: https://coscatter.github.io/LookCloser/
[214] DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image
Jijun Xiang, Xuan Zhu, Xianqi Wang, Yu Wang, Hong Zhang, Fei Guo, Xin Yang
Main category: cs.CV
TL;DR: The paper introduces DEPTHOR, a novel depth enhancement method for dToF data, addressing calibration errors and misalignment issues. It combines synthetic data simulation and a network incorporating monocular depth estimation, achieving state-of-the-art results.
Details
Motivation: Existing depth enhancement methods rely on idealized assumptions, ignoring real-world challenges like calibration errors and misalignment in dToF data. This limits their practical applicability.Method: DEPTHOR uses synthetic data simulation for noise-robust training and a network integrating monocular depth estimation to leverage global depth relationships.
Result: On the ZJU-L5 dataset, DEPTHOR improves Rel and RMSE by 27% and 18%, respectively. It also outperforms SOTA methods on a challenging dToF dataset, with 23% and 22% improvements in Rel and RMSE.
Conclusion: DEPTHOR effectively addresses real-world dToF challenges, offering superior performance and robustness compared to existing methods.
Abstract: Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noise-robust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves state-of-the-art results, improving Rel and RMSE by 27% and 18%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23% and 22%, respectively. Our Code is available at https://github.com/ShadowBbBb/Depthor
[215] RANa: Retrieval-Augmented Navigation
Gianluca Monaci, Rafael S. Rezende, Romain Deffayet, Gabriela Csurka, Guillaume Bono, Hervé Déjean, Stéphane Clinchant, Christian Wolf
Main category: cs.CV
TL;DR: A retrieval-augmented agent is introduced for navigation, leveraging past episode data to improve performance and enable zero-shot transfer across tasks.
Details
Motivation: Current navigation methods reset memory per episode, missing opportunities to reuse past data. Realistic agents should exploit historical information.Method: RL-trained agent queries a database of past episodes, using vision foundation models for semantic and geometric context encoding.
Result: Retrieval improves performance and enables zero-shot transfer across tasks and environments.
Conclusion: Exploiting past data enhances navigation agents, offering practical benefits for real-world applications.
Abstract: Methods for navigation based on large-scale learning typically treat each episode as a new problem, where the agent is spawned with a clean memory in an unknown environment. While these generalization capabilities to an unknown environment are extremely important, we claim that, in a realistic setting, an agent should have the capacity of exploiting information collected during earlier robot operations. We address this by introducing a new retrieval-augmented agent, trained with RL, capable of querying a database collected from previous episodes in the same environment and learning how to integrate this additional context information. We introduce a unique agent architecture for the general navigation task, evaluated on ImageNav, Instance-ImageNav and ObjectNav. Our retrieval and context encoding methods are data-driven and employ vision foundation models (FM) for both semantic and geometric understanding. We propose new benchmarks for these settings and we show that retrieval allows zero-shot transfer across tasks and environments while significantly improving performance.
[216] Sparfels: Fast Reconstruction from Sparse Unposed Imagery
Shubhendu Jena, Amine Ouasfi, Mae Younes, Adnane Boukhayma
Main category: cs.CV
TL;DR: A method for sparse view reconstruction using surface element splatting achieves fast performance (3 minutes on a consumer GPU) and improves shape recovery in sparse, unposed camera settings.
Details
Motivation: Existing methods for sparse radiance field learning often rely on data priors or external geometry priors, leaving shape recovery underexplored in sparse, unposed settings.Method: The proposed pipeline uses a 3D foundation model to initialize point maps and cameras, then trains a bundle-adjusting 2D Gaussian Splatting (2DGS) model guided by image correspondences. A novel splatted color variance formulation improves shape accuracy.
Result: The method achieves state-of-the-art performance in sparse, uncalibrated settings for reconstruction and novel view synthesis on multi-view datasets.
Conclusion: The approach efficiently addresses sparse view reconstruction with improved shape accuracy, leveraging a 3D foundation model and novel training techniques.
Abstract: We present a method for Sparse view reconstruction with surface element splatting that runs within 3 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate state-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view benchmarks based on established multi-view datasets.
[217] Enhancing Glass Defect Detection with Diffusion Models: Addressing Imbalanced Datasets in Manufacturing Quality Control
Sajjad Rezvani Boroujeni, Hossein Abedi, Tom Bush
Main category: cs.CV
TL;DR: A novel method using DDPMs generates synthetic defective glass images to address class imbalance, improving CNN performance in defect detection.
Details
Motivation: Class imbalance in industrial glass manufacturing limits deep learning model performance for defect detection.Method: Uses Denoising Diffusion Probabilistic Models (DDPMs) for synthetic data augmentation to balance datasets.
Result: Significant improvements in recall and accuracy (e.g., ResNet50V2 accuracy increased from 78% to 93%).
Conclusion: The approach is scalable and cost-effective for enhancing defect detection, with potential applications in other industries.
Abstract: Visual defect detection in industrial glass manufacturing remains a critical challenge due to the low frequency of defective products, leading to imbalanced datasets that limit the performance of deep learning models and computer vision systems. This paper presents a novel approach using Denoising Diffusion Probabilistic Models (DDPMs) to generate synthetic defective glass product images for data augmentation, effectively addressing class imbalance issues in manufacturing quality control and automated visual inspection. The methodology significantly enhances image classification performance of standard CNN architectures (ResNet50V2, EfficientNetB0, and MobileNetV2) in detecting anomalies by increasing the minority class representation. Experimental results demonstrate substantial improvements in key machine learning metrics, particularly in recall for defective samples across all tested deep neural network architectures while maintaining perfect precision on the validation set. The most dramatic improvement was observed in ResNet50V2’s overall classification accuracy, which increased from 78% to 93% when trained with the augmented data. This work provides a scalable, cost-effective approach to enhancing automated defect detection in glass manufacturing that can potentially be extended to other industrial quality assurance systems and industries with similar class imbalance challenges.
[218] UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model
Timo Kaiser, Thomas Norrenbrock, Bodo Rosenhahn
Main category: cs.CV
TL;DR: The paper introduces USAM, a lightweight Bayesian entropy-based uncertainty quantification method for SAM, addressing aleatoric, epistemic, and task uncertainty, and demonstrates its effectiveness across multiple datasets.
Details
Motivation: Quantifying uncertainty in SAM is challenging due to its class-agnostic nature, necessitating a robust UQ approach.Method: Proposes USAM, a post-hoc UQ method based on Bayesian entropy, identifying uncertainty sources like under-parameterization, insufficient prompts, or image ambiguities.
Result: USAM outperforms on datasets like SA-V, MOSE, ADE20k, DAVIS, and COCO, offering a cost-effective UQ solution.
Conclusion: USAM provides a practical, efficient UQ alternative for SAM, enhancing applications like user-prompting and semi-supervised learning.
Abstract: The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the class-agnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ method. Our model traces the root of uncertainty back to under-parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ alternative that can support user-prompting, enhance semi-supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.
[219] LinkTo-Anime: A 2D Animation Optical Flow Dataset from 3D Model Rendering
Xiaoyi Feng, Kaifeng Zou, Caichun Cen, Tao Huang, Hui Guo, Zizhou Huang, Yingli Zhao, Mingqing Zhang, Ziyuan Zheng, Diwei Wang, Yuntao Zou, Dagang Li
Main category: cs.CV
TL;DR: LinkTo-Anime is the first high-quality dataset for cel anime character motion, providing rich annotations and benchmarks for optical flow research.
Details
Motivation: Existing datasets lack focus on cel anime character motion, hindering research in optical flow estimation and related tasks like anime video generation.Method: The dataset is generated using 3D model rendering, offering annotations like optical flow, occlusion masks, and Mixamo Skeleton. It includes 395 video sequences with training, validation, and test frames.
Result: LinkTo-Anime comprises 24,230 training, 720 validation, and 4,320 test frames, with benchmarks evaluating optical flow methods.
Conclusion: The dataset bridges a gap in cel anime motion research, aiding optical flow estimation and downstream applications.
Abstract: Existing optical flow datasets focus primarily on real-world simulation or synthetic human motion, but few are tailored to Celluloid(cel) anime character motion: a domain with unique visual and motion characteristics. To bridge this gap and facilitate research in optical flow estimation and downstream tasks such as anime video generation and line drawing colorization, we introduce LinkTo-Anime, the first high-quality dataset specifically designed for cel anime character motion generated with 3D model rendering. LinkTo-Anime provides rich annotations including forward and backward optical flow, occlusion masks, and Mixamo Skeleton. The dataset comprises 395 video sequences, totally 24,230 training frames, 720 validation frames, and 4,320 test frames. Furthermore, a comprehensive benchmark is constructed with various optical flow estimation methods to analyze the shortcomings and limitations across multiple datasets.
[220] RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS
Chuanyu Fu, Yuqi Zhang, Kunbin Yao, Guanying Chen, Yuan Xiong, Chuan Huang, Shuguang Cui, Xiaochun Cao
Main category: cs.CV
TL;DR: RobustSplat improves 3D Gaussian Splatting by addressing transient object artifacts through delayed Gaussian growth and scale-cascaded mask bootstrapping, outperforming existing methods.
Details
Motivation: Existing 3DGS methods struggle with transient object artifacts, hindering accurate scene modeling.Method: Proposes delayed Gaussian growth and scale-cascaded mask bootstrapping to mitigate transient object overfitting and improve mask precision.
Result: Outperforms existing methods on challenging datasets, demonstrating robustness and effectiveness.
Conclusion: RobustSplat effectively addresses transient object artifacts in 3DGS, enhancing rendering quality.
Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method. Our project page is https://fcyycf.github.io/RobustSplat/.
[221] Fine-Grained Perturbation Guidance via Attention Head Selection
Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, Seungryong Kim
Main category: cs.CV
TL;DR: The paper introduces ‘HeadHunter’ and ‘SoftPAG’ for fine-grained control in diffusion models by targeting specific attention heads, improving generation quality and visual attributes.
Details
Motivation: Existing attention perturbation methods lack principled approaches for determining perturbation locations, especially in Diffusion Transformer architectures.Method: Investigates granularity of attention perturbations, proposes ‘HeadHunter’ for iterative head selection and ‘SoftPAG’ for tuning perturbation strength.
Result: Demonstrates superior performance in quality enhancement and style-specific guidance on models like Stable Diffusion 3 and FLUX.1.
Conclusion: Provides the first head-level analysis of attention perturbation, enabling interpretable specialization and practical perturbation strategies.
Abstract: Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter”, a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.
[222] GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models
Guanxi Shen
Main category: cs.CV
TL;DR: GLIMPSE is a lightweight framework for interpreting LVLM outputs by attributing them to visual and textual signals, outperforming prior methods in faithfulness and human-attention alignment.
Details
Motivation: Understanding where LVLMs direct visual attention is crucial for transparency and model behavior analysis.Method: GLIMPSE combines gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce response-level heat maps.
Result: Outperforms prior methods in faithfulness and aligns better with human attention.
Conclusion: GLIMPSE provides fine-grained insights into LVLM cross-modal reasoning, aiding in diagnosing issues like hallucination and bias.
Abstract: Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.
[223] PEVLM: Parallel Encoding for Vision-Language Models
Letian Kang, Shixian Luo, Yiqiang Li, Yuxin Yin, Shenxuan Zhou, Xiaoyang Yu, Jin Yang, Yong Wu
Main category: cs.CV
TL;DR: PEVLM introduces a fine-tuning-free parallel encoding method for VLMs, reducing attention complexity in long video understanding while maintaining accuracy and achieving significant speedups.
Details
Motivation: Standard attention mechanisms in VLMs hinder long video understanding due to quadratic complexity. PEVLM addresses this inefficiency.Method: PEVLM partitions videos into context blocks with a shared sink block, preserving sequential position embeddings to align attention weights, reducing complexity from O((T×N)^2) to O(T×N).
Result: PEVLM achieves up to 7.47x speedup, reduces latency by 40%, and improves accuracy under constraints (23.26% to 61.03%).
Conclusion: PEVLM is effective for low-latency, long-context video understanding, offering a practical solution for real-world applications.
Abstract: Vision-Language Models (VLMs) have demonstrated strong capabilities in multimodal understanding and generation tasks. However, their application to long video understanding remains hindered by the quadratic complexity of standard attention mechanisms. In this work, we introduce \textbf{PEVLM}, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of VLMs in long video scenarios. PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention. This design reduces attention complexity from $O((T \times N)^2)$ to $O(T \times N)$ where $T$ is the number of frames and $N$ the number of tokens per frame, without sacrificing accuracy. Extensive experiments across multiple state-of-the-art models and benchmarks demonstrate that PEVLM consistently outperforms existing parallel encoding approaches, achieving up to \textbf{7.47x} speedup in attention computation and reducing end-to-end latency by \textbf{40%}. Remarkably, PEVLM not only maintains high accuracy, but in some settings even surpasses Full-Attention performance. Under strict latency constraints, it achieves substantial gains, improving accuracy from \textbf{23.26%} to \textbf{61.03%}. These results underscore the effectiveness of PEVLM for low-latency, long-context video understanding, making it a promising solution for real-world applications.
[224] ZERO: Industry-ready Vision Foundation Model with Multi-modal Prompts
Sangbum Choi, Kyeongryeol Go, Taewoong Jang
Main category: cs.CV
TL;DR: ZERO is a vision foundation model designed for zero-shot industrial applications, leveraging multi-modal prompting and trained on a compact dataset, achieving competitive performance on benchmarks and real-world tasks.
Details
Motivation: Address the challenge of deploying foundation models in industrial settings without high-quality domain-specific datasets.Method: Uses multi-modal prompting (textual and visual) and is trained on 0.9 million annotated samples from a proprietary billion-scale dataset.
Result: Competitive performance on LVIS-Val, outperforms existing models on 37 industrial datasets, and ranks high in CVPR 2025 challenges.
Conclusion: ZERO is the first vision foundation model built for domain-specific, zero-shot industrial deployment, demonstrating practical deployability and generalizability.
Abstract: Foundation models have revolutionized AI, yet they struggle with zero-shot deployment in real-world industrial settings due to a lack of high-quality, domain-specific datasets. To bridge this gap, Superb AI introduces ZERO, an industry-ready vision foundation model that leverages multi-modal prompting (textual and visual) for generalization without retraining. Trained on a compact yet representative 0.9 million annotated samples from a proprietary billion-scale industrial dataset, ZERO demonstrates competitive performance on academic benchmarks like LVIS-Val and significantly outperforms existing models across 37 diverse industrial datasets. Furthermore, ZERO achieved 2nd place in the CVPR 2025 Object Instance Detection Challenge and 4th place in the Foundational Few-shot Object Detection Challenge, highlighting its practical deployability and generalizability with minimal adaptation and limited data. To the best of our knowledge, ZERO is the first vision foundation model explicitly built for domain-specific, zero-shot industrial applications.
[225] PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process
Shiqi Jiang, Xinpeng Li, Xi Mao, Changbo Wang, Chenhui Li
Main category: cs.CV
TL;DR: A novel framework for assessing the painting process, including a dataset (PPAD) and a Transformer-based model (PPJudge), outperforms existing methods in accuracy and human alignment.
Details
Motivation: Existing methods focus on static images, ignoring the dynamic, multi-stage nature of painting. This work addresses this gap.Method: Introduces PPAD dataset and PPJudge model with temporally-aware positional encoding and heterogeneous mixture-of-experts architecture.
Result: Outperforms baselines in accuracy, robustness, and human alignment.
Conclusion: Provides insights into computational creativity and art education.
Abstract: Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD), the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.
[226] FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
Bingchao Wang, Zhiwei Ning, Jianyu Ding, Xuanang Gao, Yin Li, Dongsheng Jiang, Jie Yang, Wei Liu
Main category: cs.CV
TL;DR: FIX-CLIP improves CLIP for long-text tasks by introducing dual-branch training, regional prompts, and hierarchical feature alignment, achieving SOTA performance.
Details
Motivation: CLIP struggles with long-text inputs due to token limits; FIX-CLIP aims to enhance long-text representation while preserving short-text capabilities.Method: Proposes dual-branch training, learnable regional prompts, and hierarchical feature alignment. Uses synthesized long-text captions for training.
Result: Achieves state-of-the-art performance on long- and short-text retrieval benchmarks and works well with diffusion models.
Conclusion: FIX-CLIP effectively addresses CLIP’s limitations for long-text tasks and demonstrates strong downstream applicability.
Abstract: CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP’s text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input. The code is available at https://github.com/bcwang-sjtu/Fix-CLIP.
[227] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models
X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang
Main category: cs.CV
TL;DR: The paper introduces NarrLV, the first benchmark for evaluating narrative expression in long video generation models, using Temporal Narrative Atoms (TNAs) and a novel MLLM-based metric.
Details
Motivation: Current benchmarks lack focus on narrative richness in long videos, limiting evaluation of advanced video generation models.Method: Proposes TNAs for narrative measurement, an automatic prompt generation pipeline, and an MLLM-based evaluation metric.
Result: NarrLV aligns with human judgments and reveals capability boundaries of current models in narrative expression.
Conclusion: NarrLV provides a comprehensive benchmark for assessing narrative richness in long video generation, filling a critical gap in evaluation.
Abstract: With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.
[228] Posture-Driven Action Intent Inference for Playing style and Fatigue Assessment
Abhishek Jaiswal, Nisheeth Srivastava
Main category: cs.CV
TL;DR: The paper explores posture-based mental state inference, using cricket as a case study to identify aggressive vs. defensive shot intent with high accuracy, while addressing data sensitivity challenges.
Details
Motivation: To leverage posture for diagnosing mental states (e.g., fatigue, intent) in sports and other domains, overcoming data sensitivity issues.Method: Analyzes posture and motion in cricket videos to infer intent, using motion analysis and weak supervision from existing data.
Result: Achieves over 75% F1 score and 80% AUC-ROC in distinguishing aggressive vs. defensive shot intent.
Conclusion: Posture provides strong signals for intent inference, with potential applications in sports analytics and broader human behavior analysis.
Abstract: Posture-based mental state inference has significant potential in diagnosing fatigue, preventing injury, and enhancing performance across various domains. Such tools must be research-validated with large datasets before being translated into practice. Unfortunately, such vision diagnosis faces serious challenges due to the sensitivity of human subject data. To address this, we identify sports settings as a viable alternative for accumulating data from human subjects experiencing diverse emotional states. We test our hypothesis in the game of cricket and present a posture-based solution to identify human intent from activity videos. Our method achieves over 75% F1 score and over 80% AUC-ROC in discriminating aggressive and defensive shot intent through motion analysis. These findings indicate that posture leaks out strong signals for intent inference, even with inherent noise in the data pipeline. Furthermore, we utilize existing data statistics as weak supervision to validate our findings, offering a potential solution for overcoming data labelling limitations. This research contributes to generalizable techniques for sports analytics and also opens possibilities for applying human behavior analysis across various fields.
[229] SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation
Shiqi Huang, Shuting He, Huaiyuan Qin, Bihan Wen
Main category: cs.CV
TL;DR: The paper introduces SCORE, an open-vocabulary framework for remote sensing instance segmentation, addressing challenges like diverse landscapes and small objects by integrating multi-granularity scene context.
Details
Motivation: Existing remote sensing instance segmentation methods are limited to close-vocabulary prediction, hindering their ability to recognize novel categories or generalize across datasets.Method: SCORE integrates regional and global context (Region-Aware Integration and Global Context Adaptation) to enhance visual and textual representations.
Result: SCORE achieves state-of-the-art performance on diverse datasets, setting new benchmarks for open-vocabulary remote sensing instance segmentation.
Conclusion: SCORE provides a robust solution for large-scale geospatial analysis, overcoming limitations of existing methods.
Abstract: Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose $\textbf{SCORE}$ ($\textbf{S}$cene $\textbf{C}$ontext matters in $\textbf{O}$pen-vocabulary $\textbf{RE}$mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at https://github.com/HuangShiqi128/SCORE.
[230] AI-ming backwards: Vanishing archaeological landscapes in Mesopotamia and automatic detection of sites on CORONA imagery
Alessandro Pistola, Valentina Orru’, Nicolo’ Marchetti, Marco Roccetti
Main category: cs.CV
TL;DR: Upgrading a deep learning model with CORONA satellite imagery improved archaeological site detection in transformed landscapes, achieving high accuracy and discovering new sites.
Details
Motivation: To enhance AI-driven archaeological site identification in areas altered or destroyed over decades.Method: Retrained a Bing-based convolutional network using CORONA satellite imagery for Abu Ghraib, Iraq.
Result: Achieved 85% IoU and 90% accuracy, identifying four new archaeological sites.
Conclusion: AI and CORONA imagery effectively uncover vanished archaeological sites, offering breakthroughs for landscape studies.
Abstract: By upgrading an existing deep learning model with the knowledge provided by one of the oldest sets of grayscale satellite imagery, known as CORONA, we improved the AI model attitude towards the automatic identification of archaeological sites in an environment which has been completely transformed in the last five decades, including the complete destruction of many of those same sites. The initial Bing based convolutional network model was retrained using CORONA satellite imagery for the district of Abu Ghraib, west of Baghdad, central Mesopotamian floodplain. The results were twofold and surprising. First, the detection precision obtained on the area of interest increased sensibly: in particular, the Intersection over Union (IoU) values, at the image segmentation level, surpassed 85 percent, while the general accuracy in detecting archeological sites reached 90 percent. Second, our retrained model allowed the identification of four new sites of archaeological interest (confirmed through field verification), previously not identified by archaeologists with traditional techniques. This has confirmed the efficacy of using AI techniques and the CORONA imagery from the 1960 to discover archaeological sites currently no longer visible, a concrete breakthrough with significant consequences for the study of landscapes with vanishing archaeological evidence induced by anthropization
[231] LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning
Kaihong Wang, Donghyun Kim, Margrit Betke
Main category: cs.CV
TL;DR: A LoRA-enhanced synthetic-replay framework improves continual learning in vision-language models by adapting Stable Diffusion with task-specific low-rank adapters and confidence-based sample selection.
Details
Motivation: Existing synthetic-replay methods generate misaligned samples due to domain-specific nuances, undermining knowledge retention.Method: Proposes a two-stage, confidence-based sample selection to finetune LoRA adapters and generate high-fidelity synthetic samples.
Result: Outperforms previous synthetic-replay techniques on the MTIL benchmark, balancing plasticity, stability, and zero-shot capability.
Conclusion: Generator adaptation via LoRA enhances continual learning robustness in VLMs.
Abstract: Continual learning for vision-language models has achieved remarkable performance through synthetic replay, where samples are generated using Stable Diffusion to regularize during finetuning and retain knowledge. However, real-world downstream applications often exhibit domain-specific nuances and fine-grained semantics not captured by generators, causing synthetic-replay methods to produce misaligned samples that misguide finetuning and undermine retention of prior knowledge. In this work, we propose a LoRA-enhanced synthetic-replay framework that injects task-specific low-rank adapters into a frozen Stable Diffusion model, efficiently capturing each new task’s unique visual and semantic patterns. Specifically, we introduce a two-stage, confidence-based sample selection: we first rank real task data by post-finetuning VLM confidence to focus LoRA finetuning on the most representative examples, then generate synthetic samples and again select them by confidence for distillation. Our approach integrates seamlessly with existing replay pipelines-simply swap in the adapted generator to boost replay fidelity. Extensive experiments on the Multi-domain Task Incremental Learning (MTIL) benchmark show that our method outperforms previous synthetic-replay techniques, achieving an optimal balance among plasticity, stability, and zero-shot capability. These results demonstrate the effectiveness of generator adaptation via LoRA for robust continual learning in VLMs.
[232] DreamScene: 3D Gaussian-based End-to-end Text-to-3D Scene Generation
Haoran Li, Yuli Tian, Kun Lan, Yong Liao, Lin Wang, Pan Hui, Peng Yuan Zhou
Main category: cs.CV
TL;DR: DreamScene is an end-to-end framework for generating high-quality, editable 3D scenes from text or dialogue, addressing automation, consistency, and control challenges.
Details
Motivation: Existing methods for 3D scene generation lack automation, 3D consistency, and fine-grained control, limiting practical applications.Method: DreamScene uses a GPT-4 agent for scene planning, graph-based placement, Formation Pattern Sampling for geometry, and progressive camera sampling for consistency. It also supports scene editing.
Result: DreamScene outperforms prior methods in quality, consistency, and flexibility, enabling open-domain 3D content creation.
Conclusion: DreamScene offers a practical solution for generating and editing 3D scenes from natural language, advancing the field of 3D content creation.
Abstract: Generating 3D scenes from natural language holds great promise for applications in gaming, film, and design. However, existing methods struggle with automation, 3D consistency, and fine-grained control. We present DreamScene, an end-to-end framework for high-quality and editable 3D scene generation from text or dialogue. DreamScene begins with a scene planning module, where a GPT-4 agent infers object semantics and spatial constraints to construct a hybrid graph. A graph-based placement algorithm then produces a structured, collision-free layout. Based on this layout, Formation Pattern Sampling (FPS) generates object geometry using multi-timestep sampling and reconstructive optimization, enabling fast and realistic synthesis. To ensure global consistent, DreamScene employs a progressive camera sampling strategy tailored to both indoor and outdoor settings. Finally, the system supports fine-grained scene editing, including object movement, appearance changes, and 4D dynamic motion. Experiments demonstrate that DreamScene surpasses prior methods in quality, consistency, and flexibility, offering a practical solution for open-domain 3D content creation. Code and demos are available at https://jahnsonblack.github.io/DreamScene-Full/.
[233] From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition
Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew
Main category: cs.CV
TL;DR: The paper introduces Open-vocabulary Grounded Situation Recognition (Ov-GSR) and proposes Multimodal Interactive Prompt Distillation (MIPD) to transfer knowledge from a teacher MLLM to a smaller GSR model, improving generalization and zero-shot abilities.
Details
Motivation: Existing MLLMs struggle with complex GSR tasks and are resource-heavy, while conventional GSR models lack generalization for unseen and rare situations.Method: The MIPD framework uses a Judgmental Rationales Generator (JRG) and Negative-Guided Multimodal Prompting Alignment (NMPA) to distill enriched multimodal knowledge from a teacher MLLM into a student Ov-GSR model.
Result: MIPD achieves superior performance on seen, rare, and unseen situations on the Ov-SWiG dataset and improves unseen detection on HICO-DET.
Conclusion: MIPD effectively enhances generalization and zero-shot abilities in GSR tasks, bridging gaps between seen and unseen scenarios and mitigating bias in rare cases.
Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.
[234] SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
Jiaji Zhang, Ruichao Sun, Hailiang Zhao, Jiaju Wu, Peng Chen, Hao Li, Yuying Liu, Kingsum Chow, Gang Xiong, Shuiguang Deng
Main category: cs.CV
TL;DR: SegQuant is a unified quantization framework for diffusion models, improving efficiency and compatibility without retraining.
Details
Motivation: Diffusion models are computationally intensive, and existing PTQ methods lack generalizability for industrial deployment.Method: SegQuant combines segment-aware quantization (SegLinear) and dual-scale quantization (DualScale) to preserve model fidelity.
Result: SegQuant achieves strong performance and broad compatibility with deployment tools.
Conclusion: SegQuant enhances efficiency and versatility for diffusion models in resource-constrained environments.
Abstract: Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.
[235] Addressing High Class Imbalance in Multi-Class Diabetic Retinopathy Severity Grading with Augmentation and Transfer Learning
Faisal Ahmed
Main category: cs.CV
TL;DR: A deep learning framework for diabetic retinopathy (DR) classification achieves high accuracy in binary and five-class tasks using transfer learning and data augmentation.
Details
Motivation: Early automated diagnosis of DR can prevent blindness, but challenges like class imbalance and limited data exist.Method: Uses transfer learning and data augmentation with ResNet and EfficientNet architectures on the APTOS 2019 dataset.
Result: 98.9% accuracy for binary classification and 84.6% for five-class, outperforming existing methods.
Conclusion: The framework is effective for scalable and accurate DR screening in clinical settings.
Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, and early diagnosis through automated retinal image analysis can significantly reduce the risk of blindness. This paper presents a robust deep learning framework for both binary and five-class DR classification, leveraging transfer learning and extensive data augmentation to address the challenges of class imbalance and limited training data. We evaluate a range of pretrained convolutional neural network architectures, including variants of ResNet and EfficientNet, on the APTOS 2019 dataset. For binary classification, our proposed model achieves a state-of-the-art accuracy of 98.9%, with a precision of 98.6%, recall of 99.3%, F1-score of 98.9%, and an AUC of 99.4%. In the more challenging five-class severity classification task, our model obtains a competitive accuracy of 84.6% and an AUC of 94.1%, outperforming several existing approaches. Our findings also demonstrate that EfficientNet-B0 and ResNet34 offer optimal trade-offs between accuracy and computational efficiency across both tasks. These results underscore the effectiveness of combining class-balanced augmentation with transfer learning for high-performance DR diagnosis. The proposed framework provides a scalable and accurate solution for DR screening, with potential for deployment in real-world clinical environments.
[236] Towards Facilitated Fairness Assessment of AI-based Skin Lesion Classifiers Through GenAI-based Image Synthesis
Ko Watanabe, Stanislav Frolov, Adriano Lucieri, Andreas Dengel
Main category: cs.CV
TL;DR: The paper explores using Generative AI (LightningDiT) to assess fairness in melanoma classifiers, highlighting challenges with dataset discrepancies but advocating synthetic data for fairness evaluation in medical AI.
Details
Motivation: To address biases in melanoma screening AI by ensuring fairness across diverse groups (sex, age, race) using synthetic data.Method: Leverages the LightningDiT GenAI model to generate realistic synthetic data for fairness assessment of melanoma classifiers.
Result: Fairness assessment is promising with synthetic data but challenging if evaluation models are trained on different datasets.
Conclusion: Synthetic data offers a valuable new method for evaluating and improving fairness in medical-imaging AI systems.
Abstract: Recent advancements in Deep Learning and its application on the edge hold great potential for the revolution of routine screenings for skin cancers like Melanoma. Along with the anticipated benefits of this technology, potential dangers arise from unforseen and inherent biases. Thus, assessing and improving the fairness of such systems is of utmost importance. A key challenge in fairness assessment is to ensure that the evaluation dataset is sufficiently representative of different Personal Identifiable Information (PII) (sex, age, and race) and other minority groups. Against the backdrop of this challenge, this study leverages the state-of-the-art Generative AI (GenAI) LightningDiT model to assess the fairness of publicly available melanoma classifiers. The results suggest that fairness assessment using highly realistic synthetic data is a promising direction. Yet, our findings indicate that verifying fairness becomes difficult when the melanoma-detection model used for evaluation is trained on data that differ from the dataset underpinning the synthetic images. Nonetheless, we propose that our approach offers a valuable new avenue for employing synthetic data to gauge and enhance fairness in medical-imaging GenAI systems.
[237] Differential-UMamba: Rethinking Tumor Segmentation Under Limited Data Scenarios
Dhruv Jain, Romain Modzelewski, Romain Herault, Clement Chatelain, Eva Torfeh, Sebastien Thureau
Main category: cs.CV
TL;DR: Diff-UMamba combines UNet with mamba for medical image segmentation, using noise reduction to improve accuracy in low-data settings.
Details
Motivation: Deep learning models overfit in data-scarce scenarios, limiting generalization. Diff-UMamba aims to enhance focus on clinically significant regions.Method: Integrates UNet with mamba mechanism and a noise reduction module to suppress irrelevant activations.
Result: Achieves 1-5% performance gains on public and internal datasets, especially in low-data conditions.
Conclusion: Diff-UMamba improves segmentation accuracy and robustness, particularly in data-scarce medical imaging tasks.
Abstract: In data-scarce scenarios, deep learning models often overfit to noise and irrelevant patterns, which limits their ability to generalize to unseen samples. To address these challenges in medical image segmentation, we introduce Diff-UMamba, a novel architecture that combines the UNet framework with the mamba mechanism to model long-range dependencies. At the heart of Diff-UMamba is a noise reduction module, which employs a signal differencing strategy to suppress noisy or irrelevant activations within the encoder. This encourages the model to filter out spurious features and enhance task-relevant representations, thereby improving its focus on clinically significant regions. As a result, the architecture achieves improved segmentation accuracy and robustness, particularly in low-data settings. Diff-UMamba is evaluated on multiple public datasets, including medical segmentation decathalon dataset (lung and pancreas) and AIIB23, demonstrating consistent performance gains of 1-3% over baseline methods in various segmentation tasks. To further assess performance under limited data conditions, additional experiments are conducted on the BraTS-21 dataset by varying the proportion of available training samples. The approach is also validated on a small internal non-small cell lung cancer dataset for the segmentation of gross tumor volume in cone beam CT, where it achieves a 4-5% improvement over baseline.
[238] DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering
Jie Chen, Zhangchi Hu, Peixi Wu, Huyue Zhu, Hebei Li, Xiaoyan Sun
Main category: cs.CV
TL;DR: DASH is a real-time dynamic scene rendering framework using 4D hash encoding and self-supervised decomposition to improve rendering quality and avoid low-rank assumptions.
Details
Motivation: Existing methods suffer from feature overlap, poor rendering quality, and hash collisions due to unsuitable assumptions or direct application of 4D hash encoding.Method: DASH employs self-supervised decomposition to separate dynamic/static components, a multiresolution 4D hash encoder for dynamics, and spatio-temporal smoothness regularization.
Result: DASH achieves state-of-the-art performance with 264 FPS on a 4090 GPU, offering enhanced visual quality.
Conclusion: DASH effectively addresses challenges in dynamic scene reconstruction, providing high-quality real-time rendering.
Abstract: Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. Code: https://github.com/chenj02/DASH.
[239] Bias Analysis for Synthetic Face Detection: A Case Study of the Impact of Facial Attributes
Asmae Lamsaf, Lucia Cascone, Hugo Proença, João Neves
Main category: cs.CV
TL;DR: The paper addresses bias in synthetic face detectors, proposing an evaluation framework to analyze bias across facial attributes and providing insights into its origins.
Details
Motivation: Existing synthetic face detection models and datasets may be biased, leading to detection failures for certain demographic groups and raising ethical concerns.Method: The study introduces an evaluation framework using synthetic data with evenly distributed attributes to analyze bias in five state-of-the-art detectors.
Result: Results confirm bias in detectors toward specific facial attributes, with insights into its origins from training data imbalances and activation maps.
Conclusion: The framework aids in understanding and mitigating bias in synthetic face detectors, highlighting the need for balanced training data.
Abstract: Bias analysis for synthetic face detection is bound to become a critical topic in the coming years. Although many detection models have been developed and several datasets have been released to reliably identify synthetic content, one crucial aspect has been largely overlooked: these models and training datasets can be biased, leading to failures in detection for certain demographic groups and raising significant social, legal, and ethical issues. In this work, we introduce an evaluation framework to contribute to the analysis of bias of synthetic face detectors with respect to several facial attributes. This framework exploits synthetic data generation, with evenly distributed attribute labels, for mitigating any skew in the data that could otherwise influence the outcomes of bias analysis. We build on the proposed framework to provide an extensive case study of the bias level of five state-of-the-art detectors in synthetic datasets with 25 controlled facial attributes. While the results confirm that, in general, synthetic face detectors are biased towards the presence/absence of specific facial attributes, our study also sheds light on the origins of the observed bias through the analysis of the correlations with the balancing of facial attributes in the training sets of the detectors, and the analysis of detectors activation maps in image pairs with controlled attribute modifications.
[240] Knowledge Regularized Negative Feature Tuning of Vision-Language Models for Out-of-Distribution Detection
Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
Main category: cs.CV
TL;DR: KR-NFT improves OOD detection by integrating Negative Feature Tuning and knowledge regularization, enhancing ID/OOD separation and reducing generalization issues.
Details
Motivation: Address the reduced generalization performance of negative prompt-tuned models in OOD detection.Method: Proposes KR-NFT, combining Negative Feature Tuning (NFT) for feature separation and knowledge regularization (KR) for optimization.
Result: Improves ID classification, OOD detection, and reduces FPR95 by 5.44% with few-shot ImageNet training.
Conclusion: KR-NFT is efficient, scalable, and outperforms traditional methods in OOD detection and generalization.
Abstract: Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44% under an unexplored generalization setting with unseen ID categories. Codes can be found at \href{https://github.com/ZhuWenjie98/KRNFT}.
[241] T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
Chieh-Yun Chen, Min Shi, Gong Zhang, Humphrey Shi
Main category: cs.CV
TL;DR: T2I-Copilot is a training-free multi-agent system that automates prompt engineering for text-to-image generation, improving quality and alignment without additional training.
Details
Motivation: Current T2I models are sensitive to prompt phrasing and lack clear feedback, requiring repeated refinement. Existing solutions offer limited controllability or need extra training.Method: T2I-Copilot uses three agents: Input Interpreter (parses prompts), Generation Engine (selects models and organizes prompts), and Quality Evaluator (assesses outputs).
Result: Outperforms commercial models like RecraftV3 and Imagen 3, and surpasses FLUX1.1-pro by 6.17% at lower cost.
Conclusion: T2I-Copilot simplifies prompt engineering, enhances generation quality, and supports autonomous or human-in-the-loop operation.
Abstract: Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.
[242] SCALAR: Scale-wise Controllable Visual Autoregressive Learning
Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu
Main category: cs.CV
TL;DR: SCALAR introduces a scale-wise conditional decoding method for controllable image synthesis in VAR models, improving fidelity and efficiency.
Details
Motivation: Existing VAR-based methods struggle with inefficient control encoding and disruptive injection, compromising generation quality.Method: SCALAR uses a pretrained encoder to extract and project control signals into scale-specific representations, injected into VAR layers. SCALAR-Uni extends this for multi-modal control.
Result: SCALAR achieves superior generation quality and control precision in experiments.
Conclusion: SCALAR provides a robust solution for controllable generation in VAR models, with potential for multi-modal applications.
Abstract: Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation process. Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model. Extensive experiments show that SCALAR achieves superior generation quality and control precision across various tasks.
[243] JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1
Xinhan Di, Kristin Qi, Pengqian Yu
Main category: cs.CV
TL;DR: The paper introduces JWB-DH-V1, a dataset and evaluation framework for joint whole-body motion and speech generation, highlighting performance gaps in current methods.
Details
Motivation: Current diffusion-based video generation lacks multi-modal consistency and evaluation benchmarks for whole-body motion and speech synthesis.Method: The authors propose JWB-DH-V1, a large-scale dataset with 10,000 identities and 2 million samples, and an evaluation protocol for joint audio-video generation.
Result: Evaluation of SOTA models shows disparities between face/hand-centric and whole-body performance, identifying key research areas.
Conclusion: The dataset and tools are publicly released to advance research in joint whole-body and speech generation.
Abstract: Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at https://github.com/deepreasonings/WholeBodyBenchmark.
[244] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Main category: cs.CV
TL;DR: A survey on multimodal long context token compression, categorizing approaches by modality (image, video, audio) and mechanisms (transformation, similarity, attention, query-based).
Details
Motivation: Address computational challenges in MLLMs due to quadratic complexity of self-attention with long contexts.Method: Systematic survey and categorization of token compression methods by modality and underlying mechanisms.
Result: Comprehensive overview of current approaches, challenges, and future directions in token compression.
Conclusion: The survey consolidates progress, identifies challenges, and inspires future research, with a public repository for updates.
Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.
[245] From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos
Chenjian Gao, Lihe Ding, Rui Han, Zhanpeng Huang, Zibin Wang, Tianfan Xue
Main category: cs.CV
TL;DR: A hybrid pipeline combining 3D Gaussian Splatting and 2D diffusion models for realistic and temporally consistent 3D object insertion in videos.
Details
Motivation: Addressing the challenge of inserting 3D objects into videos with realistic lighting and temporal coherence, especially in dynamic scenarios.Method: Uses 3D Gaussian Splatting for initial rendering and a 2D diffusion model for refinement, optimizing shading and sRGB images.
Result: Achieves photorealistic lighting and temporal consistency in dynamic wrist scenes.
Conclusion: The hybrid approach synergizes 3D rendering and 2D diffusion, offering a robust solution for video object insertion.
Abstract: Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing. Project Page: https://cjeen.github.io/BraceletPaper/
[246] Beyond Class Tokens: LLM-guided Dominant Property Mining for Few-shot Classification
Wei Zhuo, Runjie Luo, Wufeng Xue, Linlin Shen
Main category: cs.CV
TL;DR: The paper introduces BCT-CLIP, a novel Few-Shot Learning method that leverages dominating properties and contrastive learning to improve discriminative representation learning, outperforming existing methods on 11 datasets.
Details
Motivation: Addressing the challenge of data scarcity in Few-Shot Learning (FSL) by moving beyond simple class name alignment to incorporate comprehensive visual property representations.Method: Proposes BCT-CLIP, which includes a multi-property generator (MPG), LLM-assisted retrieval for dominating properties, and a contrastive learning strategy for property-token learning.
Result: Demonstrates superior performance on 11 widely used datasets, showing improved discriminative class-specific representation learning.
Conclusion: Exploring dominating properties through contrastive learning advances FSL by enhancing visual diversity and discriminative power.
Abstract: Few-shot Learning (FSL), which endeavors to develop the generalization ability for recognizing novel classes using only a few images, faces significant challenges due to data scarcity. Recent CLIP-like methods based on contrastive language-image pertaining mitigate the issue by leveraging textual representation of the class name for unseen image discovery. Despite the achieved success, simply aligning visual representations to class name embeddings would compromise the visual diversity for novel class discrimination. To this end, we proposed a novel Few-Shot Learning (FSL) method (BCT-CLIP) that explores \textbf{dominating properties} via contrastive learning beyond simply using class tokens. Through leveraging LLM-based prior knowledge, our method pushes forward FSL with comprehensive structural image representations, including both global category representation and the patch-aware property embeddings. In particular, we presented a novel multi-property generator (MPG) with patch-aware cross-attentions to generate multiple visual property tokens, a Large-Language Model (LLM)-assistant retrieval procedure with clustering-based pruning to obtain dominating property descriptions, and a new contrastive learning strategy for property-token learning. The superior performances on the 11 widely used datasets demonstrate that our investigation of dominating properties advances discriminative class-specific representation learning and few-shot classification.
cs.AI
[247] SynLang and Symbiotic Epistemology: A Manifesto for Conscious Human-AI Collaboration
Jan Kapusta
Main category: cs.AI
TL;DR: The paper proposes Symbiotic Epistemology and SynLang for transparent human-AI collaboration, enhancing trust and ethical accountability.
Details
Motivation: Current AI systems lack transparency, hindering human oversight and collaboration. Post-hoc explanations fail to enable genuine symbiotic partnerships.Method: Introduces Symbiotic Epistemology and SynLang, a formal protocol with TRACE and TRACE_FE mechanisms for reasoning transparency and confidence calibration.
Result: Empirical validation shows AI’s adaptation to structured reasoning and successful metacognitive intervention, improving human-AI collaboration.
Conclusion: SynLang and symbiotic epistemology enable transparent, ethical AI collaboration, preserving human agency and enhancing decision-making.
Abstract: Current AI systems rely on opaque reasoning processes that hinder human oversight and collaborative potential. Conventional explainable AI approaches offer post-hoc justifications and often fail to establish genuine symbiotic collaboration. In this paper, the Symbiotic Epistemology is presented as a philosophical foundation for human-AI cognitive partnerships. Unlike frameworks that treat AI as a mere tool or replacement, symbiotic epistemology positions AI as a reasoning partner, fostering calibrated trust by aligning human confidence with AI reliability through explicit reasoning patterns and confidence assessments. SynLang (Symbiotic Syntactic Language) is introduced as a formal protocol for transparent human-AI collaboration. The framework is empirically validated through actual human-AI dialogues demonstrating AI’s adaptation to structured reasoning protocols and successful metacognitive intervention. The protocol defines two complementary mechanisms: TRACE for high-level reasoning patterns and TRACE_FE for detailed factor explanations. It also integrates confidence quantification, declarative control over AI behavior, and context inheritance for multi-agent coordination. By structuring communication and embedding confidence-calibrated transparency, SynLang, together with symbiotic epistemology, enables AI systems that enhance human intelligence, preserve human agency, and uphold ethical accountability in collaborative decision-making. Through dual-level transparency, beginning with high-level reasoning patterns and progressing to granular explanations, the protocol facilitates rapid comprehension and supports thorough verification of AI decision-making.
[248] Artificial intelligence for sustainable wine industry: AI-driven management in viticulture, wine production and enotourism
Marta Sidorkiewicz, Karolina Królikowska, Berenika Dyczek, Edyta Pijet-Migon, Anna Dubel
Main category: cs.AI
TL;DR: AI enhances sustainability and efficiency in the wine industry through intelligent management in viticulture, production, and enotourism, as evidenced by a survey of Polish winemakers and analysis of AI methods.
Details
Motivation: The wine industry faces environmental and economic challenges, and AI offers solutions to optimize resources, reduce impact, and improve customer engagement.Method: Questionnaire survey among Polish winemakers and analysis of AI technologies like predictive analytics, machine learning, and computer vision.
Result: AI improves vineyard monitoring, irrigation, production efficiency, and enotourism experiences through chatbots and virtual tastings.
Conclusion: AI supports economic, environmental, and social sustainability in the wine industry, benefiting local enterprises and cultural heritage.
Abstract: This study examines the role of Artificial Intelligence (AI) in enhancing sustainability and efficiency within the wine industry. It focuses on AI-driven intelligent management in viticulture, wine production, and enotourism. As the wine industry faces environmental and economic challenges, AI offers innovative solutions to optimize resource use, reduce environmental impact, and improve customer engagement. Understanding AI’s potential in sustainable winemaking is crucial for fostering responsible and efficient industry practices. The research is based on a questionnaire survey conducted among Polish winemakers, combined with a comprehensive analysis of AI methods applicable to viticulture, production, and tourism. Key AI technologies, including predictive analytics, machine learning, and computer vision, are explored. The findings indicate that AI enhances vineyard monitoring, optimizes irrigation, and streamlines production processes, contributing to sustainable resource management. In enotourism, AI-powered chatbots, recommendation systems, and virtual tastings personalize consumer experiences. The study highlights AI’s impact on economic, environmental, and social sustainability, supporting local wine enterprises and cultural heritage. Keywords: Artificial Intelligence, Sustainable Development, AI-Driven Management, Viticulture, Wine Production, Enotourism, Wine Enterprises, Local Communities
[249] Adaptive Cluster Collaborativeness Boosts LLMs Medical Decision Support Capacity
Zhihao Peng, Liuxin Bao, Shengyuan Liu, Yixuan Yuan
Main category: cs.AI
TL;DR: The paper proposes an adaptive cluster collaborativeness method for LLMs in healthcare, using self-diversity and cross-consistency mechanisms to improve medical decision support without predefined clusters.
Details
Motivation: Existing LLM collaborativeness lacks explicit selection rules and relies on predefined clusters, which may include underperforming models in medical scenarios.Method: The method involves calculating self-diversity (fuzzy matching within an LLM) and cross-consistency (between LLMs), then masking inconsistent models to enhance collaboration.
Result: Experiments on NEJMQA and MMLU-Pro-health datasets show improved accuracy, e.g., 65.47% vs. GPT-4’s 56.12% in Obstetrics and Gynecology.
Conclusion: The proposed method effectively enhances LLM collaborativeness in medical decision support, outperforming existing approaches.
Abstract: The collaborativeness of large language models (LLMs) has proven effective in natural language processing systems, holding considerable promise for healthcare development. However, it lacks explicit component selection rules, necessitating human intervention or clinical-specific validation. Moreover, existing architectures heavily rely on a predefined LLM cluster, where partial LLMs underperform in medical decision support scenarios, invalidating the collaborativeness of LLMs. To this end, we propose an adaptive cluster collaborativeness methodology involving self-diversity and cross-consistency maximization mechanisms to boost LLMs medical decision support capacity. For the self-diversity, we calculate the fuzzy matching value of pairwise outputs within an LLM as its self-diversity value, subsequently prioritizing LLMs with high self-diversity values as cluster components in a training-free manner. For the cross-consistency, we first measure cross-consistency values between the LLM with the highest self-diversity value and others, and then gradually mask out the LLM having the lowest cross-consistency value to eliminate the potential inconsistent output during the collaborative propagation. Extensive experiments on two specialized medical datasets, NEJMQA and MMLU-Pro-health, demonstrate the effectiveness of our method across physician-oriented specialties. For example, on NEJMQA, our method achieves the accuracy rate up to the publicly official passing score across all disciplines, especially achieving ACC of 65.47% compared to the 56.12% achieved by GPT-4 on the Obstetrics and Gynecology discipline.
[250] Leveraging Generative AI to Enhance Synthea Module Development
Mark A. Kramer, Aanchal Mathur, Caroline E. Adams, Jason A. Walonoski
Main category: cs.AI
TL;DR: LLMs assist in developing Synthea disease modules, reducing time and expertise needed while improving data quality. Methods include generating profiles, modules, evaluation, and refinement. Challenges like human oversight and validation are noted.
Details
Motivation: To enhance Synthea's synthetic health data generation by leveraging LLMs for faster, more diverse, and higher-quality module development.Method: Four approaches: generating disease profiles, creating modules from profiles, evaluating existing modules, and refining them via progressive refinement (iterative checks for correctness and accuracy).
Result: LLMs show promise in aiding module development but require human oversight and rigorous validation to address potential inaccuracies.
Conclusion: Future research should focus on optimizing LLM-aided synthetic data creation, addressing limitations, and ensuring clinical accuracy.
Abstract: This paper explores the use of large language models (LLMs) to assist in the development of new disease modules for Synthea, an open-source synthetic health data generator. Incorporating LLMs into the module development process has the potential to reduce development time, reduce required expertise, expand model diversity, and improve the overall quality of synthetic patient data. We demonstrate four ways that LLMs can support Synthea module creation: generating a disease profile, generating a disease module from a disease profile, evaluating an existing Synthea module, and refining an existing module. We introduce the concept of progressive refinement, which involves iteratively evaluating the LLM-generated module by checking its syntactic correctness and clinical accuracy, and then using that information to modify the module. While the use of LLMs in this context shows promise, we also acknowledge the challenges and limitations, such as the need for human oversight, the importance of rigorous testing and validation, and the potential for inaccuracies in LLM-generated content. The paper concludes with recommendations for future research and development to fully realize the potential of LLM-aided synthetic data creation.
[251] Games Agents Play: Towards Transactional Analysis in LLM-based Multi-Agent Systems
Monika Zamojska, Jarosław A. Chudziak
Main category: cs.AI
TL;DR: Trans-ACT integrates Transactional Analysis into Multi-Agent Systems to create psychologically realistic agents, improving social interaction simulations.
Details
Motivation: Existing MAS frameworks lack cognitive complexity for realistic human behavior.Method: Embed TA principles (Parent, Adult, Child ego states) into agents, using context-specific memories and life scripts to shape responses.
Result: Agents in the Stupid game scenario showed deeper, context-aware interactions.
Conclusion: Trans-ACT enables advanced applications in conflict resolution, education, and social psychology.
Abstract: Multi-Agent Systems (MAS) are increasingly used to simulate social interactions, but most of the frameworks miss the underlying cognitive complexity of human behavior. In this paper, we introduce Trans-ACT (Transactional Analysis Cognitive Toolkit), an approach embedding Transactional Analysis (TA) principles into MAS to generate agents with realistic psychological dynamics. Trans-ACT integrates the Parent, Adult, and Child ego states into an agent’s cognitive architecture. Each ego state retrieves context-specific memories and uses them to shape response to new situations. The final answer is chosen according to the underlying life script of the agent. Our experimental simulation, which reproduces the Stupid game scenario, demonstrates that agents grounded in cognitive and TA principles produce deeper and context-aware interactions. Looking ahead, our research opens a new way for a variety of applications, including conflict resolution, educational support, and advanced social psychology studies.
[252] Measuring and Analyzing Intelligence via Contextual Uncertainty in Large Language Models using Information-Theoretic Metrics
Jae Wan Shim
Main category: cs.AI
TL;DR: The paper introduces a task-agnostic method to analyze how LLMs process information, using a ‘Cognitive Profile’ based on the Entropy Decay Curve and the IGS index.
Details
Motivation: To move beyond measuring what LLMs can do and instead understand how they process information internally.Method: A novel approach using the Entropy Decay Curve and IGS index to create Cognitive Profiles for LLMs, tested on diverse texts and models.
Result: Unique and consistent cognitive profiles were found, sensitive to model scale and text complexity.
Conclusion: Provides a principled framework for analyzing and comparing the operational dynamics of LLMs.
Abstract: The remarkable capabilities of Large Language Models (LLMs) are now extensively documented on task-specific benchmarks, yet the internal mechanisms that produce these results are the subject of intense scientific inquiry. This paper contributes to this inquiry by moving beyond metrics that measure \textit{what} models can do, to a methodology that characterizes \textit{how} they process information. We introduce a novel, task-agnostic approach to probe these dynamics by creating a quantitative ``Cognitive Profile" for any given model. This profile is centered on the \textbf{Entropy Decay Curve}, a visualization that traces how a model’s normalized predictive uncertainty changes as a function of context length. Applying this methodology to several state-of-the-art LLMs across diverse texts, we uncover unique and consistent cognitive profiles that are sensitive to both model scale and text complexity. We also introduce the Information Gain Span (IGS) index to summarize the desirability of the decay trajectory. This work thus provides a new, principled lens for analyzing and comparing the intrinsic operational dynamics of artificial intelligence.
[253] “Teammates, Am I Clear?”: Analysing Legible Behaviours in Teams
Miguel Faria, Francisco S. Melo, Ana Paiva
Main category: cs.AI
TL;DR: Extends legible decision-making to multi-agent teams, improving collaboration performance.
Details
Motivation: Existing legibility works focus on single-agent-human interactions, missing benefits for teams.Method: Proposes an extension of legible decision-making for multi-agent settings.
Result: Legible agents in teams outperform standard optimal-behavior teams.
Conclusion: Legible decision-making enhances team performance in multi-agent scenarios.
Abstract: In this paper we investigate the notion of legibility in sequential decision-making in the context of teams and teamwork. There have been works that extend the notion of legibility to sequential decision making, for deterministic and for stochastic scenarios. However, these works focus on one agent interacting with one human, foregoing the benefits of having legible decision making in teams of agents or in team configurations with humans. In this work we propose an extension of legible decision-making to multi-agent settings that improves the performance of agents working in collaboration. We showcase the performance of legible decision making in team scenarios using our proposed extension in multi-agent benchmark scenarios. We show that a team with a legible agent is able to outperform a team composed solely of agents with standard optimal behaviour.
[254] INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems
Bintao Tang, Xin Yang, Yuhao Wang, Zixuan Qiu, Zimo Ji, Wenyuan Jiang
Main category: cs.AI
TL;DR: INTEGRALBENCH is a benchmark for evaluating LLMs on definite integrals, revealing performance gaps and difficulty-accuracy correlations.
Details
Motivation: To advance automated mathematical reasoning by providing a rigorous evaluation framework for definite integrals.Method: INTEGRALBENCH includes symbolic and numerical ground truth solutions with manual difficulty annotations, tested on nine state-of-the-art LLMs.
Result: Significant performance gaps and strong correlations between problem difficulty and model accuracy were observed.
Conclusion: INTEGRALBENCH establishes baseline metrics for evaluating LLMs in definite integral computation, aiding future advancements.
Abstract: We present INTEGRALBENCH, a focused benchmark designed to evaluate Large Language Model (LLM) performance on definite integral problems. INTEGRALBENCH provides both symbolic and numerical ground truth solutions with manual difficulty annotations. Our evaluation of nine state-of-the-art LLMs reveals significant performance gaps and strong correlations between problem difficulty and model accuracy, establishing baseline metrics for this challenging domain. INTEGRALBENCH aims to advance automated mathematical reasoning by providing a rigorous evaluation framework specifically tailored for definite integral computation.
[255] A finite time analysis of distributed Q-learning
Han-Dong Lim, Donghwan Lee
Main category: cs.AI
TL;DR: The paper analyzes distributed Q-learning in MARL, providing finite-time sample complexity bounds for cooperative agents without central reward access.
Details
Motivation: The study is motivated by the success of single-agent RL and the need to extend it to multi-agent settings with distributed rewards.Method: The authors propose a distributed Q-learning algorithm where agents cooperatively solve sequential decision-making problems using local rewards.
Result: They derive a finite-time sample complexity bound, expressed in terms of mixing time, discount factor, and graph connectivity.
Conclusion: The results contribute to understanding the efficiency of distributed Q-learning in MARL, highlighting dependencies on system parameters.
Abstract: Multi-agent reinforcement learning (MARL) has witnessed a remarkable surge in interest, fueled by the empirical success achieved in applications of single-agent reinforcement learning (RL). In this study, we consider a distributed Q-learning scenario, wherein a number of agents cooperatively solve a sequential decision making problem without access to the central reward function which is an average of the local rewards. In particular, we study finite-time analysis of a distributed Q-learning algorithm, and provide a new sample complexity result of $\tilde{\mathcal{O}}\left( \min\left{\frac{1}{\epsilon^2}\frac{t_{\text{mix}}}{(1-\gamma)^6 d_{\min}^4 } ,\frac{1}{\epsilon}\frac{\sqrt{|\gS||\gA|}}{(1-\sigma_2(\boldsymbol{W}))(1-\gamma)^4 d_{\min}^3} \right}\right)$ under tabular lookup
[256] NPO: Learning Alignment and Meta-Alignment through Structured Human Feedback
Madhava Gaikwad, Ashwini Ramchandra Doke
Main category: cs.AI
TL;DR: NPO is an alignment-aware learning framework for human-in-the-loop systems, focusing on measurable alignment loss and meta-alignment for monitoring. It ensures convergence and demonstrates practical value in large-scale deployments.
Details
Motivation: To address the gap in treating alignment as a static property by introducing a dynamic, feedback-driven approach for continual alignment monitoring in decision systems.Method: NPO formalizes alignment loss and meta-alignment, using structured feedback (likes, overrides, abstentions) in an operational loop involving scoring, tuning, validation, and feedback ingestion.
Result: Formal convergence under stochastic feedback, with empirical success in hyperscale deployments and simulation-based validation.
Conclusion: NPO provides a scalable, inspectable architecture for dynamic alignment, bridging theory and practical reliability.
Abstract: We present NPO, an alignment-aware learning framework that operationalizes feedback-driven adaptation in human-in-the-loop decision systems. Unlike prior approaches that treat alignment as a static or post-hoc property, NPO introduces a formalization of alignment loss that is measurable, supervisable, and reducible under structured feedback. In parallel, we propose meta-alignment as the fidelity of the monitoring process that governs retraining or override triggers, and show that it is formally reducible to primary alignment via threshold fidelity. Our implementation spans a scalable operational loop involving scenario scoring, threshold tuning, policy validation, and structured feedback ingestion, including “likes”, overrides, and abstentions. We provide formal convergence results under stochastic feedback and show that both alignment loss and monitoring fidelity converge additively. Empirically, NPO demonstrates measurable value in hyperscale deployment settings. A simulation-based artifact and ablation studies further illustrate the theoretical principles in action. Together, NPO offers a compact, inspectable architecture for continual alignment monitoring, helping bridge theoretical alignment guarantees with practical reliability in dynamic environments.
[257] Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis
Aran Nayebi
Main category: cs.AI
TL;DR: The paper formalizes AI alignment as a multi-objective optimization problem, proving intrinsic limits to alignment and providing algorithms for achieving it under certain conditions.
Details
Motivation: To generalize prior AI alignment approaches with fewer assumptions and rigorously explore the intrinsic limits of alignment itself.Method: Formalizes alignment as a multi-objective optimization problem, uses communication complexity for proofs, and provides algorithms for unbounded and bounded rationality.
Result: Proves an information-theoretic lower bound showing intrinsic alignment overheads and identifies scalability barriers (tasks, agents, task state space).
Conclusion: Alignment has fundamental limits; future methods must manage complexity through consensus-driven reduction or prioritization of objectives.
Abstract: We formalize AI alignment as a multi-objective optimization problem called
$\langle M,N,\varepsilon,\delta\rangle$-agreement that generalizes prior
approaches with fewer assumptions, in which a set of $N$ agents (including
humans) must reach approximate ($\varepsilon$) agreement across $M$ candidate
objectives with probability at least $1-\delta$. Using communication
complexity, we prove an information-theoretic lower bound demonstrating that
once either $M$ or $N$ is large enough, no interaction or rationality can avoid
intrinsic alignment overheads. This barrier establishes rigorous intrinsic
limits to alignment \emph{itself}, not merely to specific methods, clarifying a
crucial no free lunch'' principle: encoding
all human values’’ inevitably
leads to misalignment, requiring future methods to explicitly manage complexity
through consensus-driven reduction or prioritization of objectives.
Complementing this impossibility result, we provide explicit algorithms
achieving alignment under both computationally unbounded and bounded
rationality with noisy messages. Even in these best-case scenarios where
alignment to arbitrary precision is theoretically guaranteed, our analysis
identifies three critical scalability barriers: the number of tasks ($M$),
agents ($N$), and task state space size ($D$); thereby highlighting fundamental
complexity-theoretic constraints and providing guidelines for safer, scalable
human-AI collaboration.
[258] Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses
Joshua Adrian Cahyono, Saran Subramanian
Main category: cs.AI
TL;DR: The paper examines risks of sycophancy and over-confidence in LLMs when providing life advice, proposing experiments to evaluate and mitigate these issues.
Details
Motivation: LLMs lack safeguards for high-stakes advice, risking misguided responses. The study aims to understand and address these failure modes.Method: Three experiments: (1) multiple-choice evaluation, (2) free-response analysis with a safety typology and LLM Judge, (3) mechanistic interpretability to steer behavior.
Result: Some models show sycophancy, but others like o4-mini remain robust. Top models prioritize clarifying questions over prescriptive advice. Activation steering can control cautiousness.
Conclusion: Nuanced benchmarks and safety alignment methods are needed to ensure LLMs can be trusted for life-changing decisions.
Abstract: Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses. This creates risks of sycophancy and over-confidence. This paper investigates these failure modes through three experiments: (1) a multiple-choice evaluation to measure model stability against user pressure; (2) a free-response analysis using a novel safety typology and an LLM Judge; and (3) a mechanistic interpretability experiment to steer model behavior by manipulating a “high-stakes” activation vector. Our results show that while some models exhibit sycophancy, others like o4-mini remain robust. Top-performing models achieve high safety scores by frequently asking clarifying questions, a key feature of a safe, inquisitive approach, rather than issuing prescriptive advice. Furthermore, we demonstrate that a model’s cautiousness can be directly controlled via activation steering, suggesting a new path for safety alignment. These findings underscore the need for nuanced, multi-faceted benchmarks to ensure LLMs can be trusted with life-changing decisions.
[259] A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature
Yufan Chen, Ching Ting Leung, Bowen Yu, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao
Main category: cs.AI
TL;DR: A multimodal large language model (MLLM)-based multi-agent system is developed for robust chemical information extraction, outperforming previous methods with an F1 score of 80.8%.
Details
Motivation: High-quality chemical databases are crucial for AI-driven research, but current extraction methods struggle with multimodal and variable chemical data.Method: The system uses MLLM’s reasoning to decompose tasks, coordinate specialized agents, and integrate results for accurate extraction.
Result: Achieved an F1 score of 80.8%, significantly surpassing the previous best (35.6%), with improvements in sub-tasks like image recognition and text extraction.
Conclusion: This system advances automated chemical information extraction, supporting AI-driven chemical research.
Abstract: To fully expedite AI-powered chemical research, high-quality chemical databases are the cornerstone. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction. It utilizes the MLLM’s strong reasoning capability to understand the structure of diverse chemical graphics, decompose the extraction task into sub-tasks, and coordinate a set of specialized agents, each combining the capabilities of the MLLM with the precise, domain-specific strengths of dedicated tools, to solve them accurately and integrate the results into a unified output. Our system achieved an F1 score of 80.8% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature, surpassing the previous state-of-the-art model (F1 score of 35.6%) by a significant margin. Additionally, it demonstrated consistent improvements in key sub-tasks, including molecular image recognition, reaction image parsing, named entity recognition and text-based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI-driven chemical research.
[260] Project Patti: Why can You Solve Diabolical Puzzles on one Sudoku Website but not Easy Puzzles on another Sudoku Website?
Arman Eisenkolb-Vaithyanathan
Main category: cs.AI
TL;DR: The paper proposes two new metrics to rate Sudoku difficulty, using SAT problem conversion and human-like solver simulation, and develops a universal rating system aligning well with most website labels.
Details
Motivation: To address inconsistencies in Sudoku difficulty ratings across different websites by creating objective metrics.Method: Two methods: (1) SAT problem conversion for structural complexity, (2) human-like solver simulation with backtracking. Metrics derived from these are used to analyze puzzles from five websites.
Result: Strong correlation between proposed metrics and website labels for 4/5 sites. A universal rating system classifies puzzles into Easy, Medium, Hard.
Conclusion: The universal rating system aligns well with most website labels, providing consistent difficulty mapping. An algorithm for beginners is also presented.
Abstract: In this paper we try to answer the question “What constitutes Sudoku difficulty rating across different Sudoku websites?” Using two distinct methods that can both solve every Sudoku puzzle, I propose two new metrics to characterize Sudoku difficulty. The first method is based on converting a Sudoku puzzle into its corresponding Satisfiability (SAT) problem. The first proposed metric is derived from SAT Clause Length Distribution which captures the structural complexity of a Sudoku puzzle including the number of given digits and the cells they are in. The second method simulates human Sudoku solvers by intertwining four popular Sudoku strategies within a backtracking algorithm called Nishio. The second metric is computed by counting the number of times Sudoku strategies are applied within the backtracking iterations of a randomized Nishio. Using these two metrics, I analyze more than a thousand Sudoku puzzles across five popular websites to characterize every difficulty level in each website. I evaluate the relationship between the proposed metrics and website-labeled difficulty levels using Spearman’s rank correlation coefficient, finding strong correlations for 4 out of 5 websites. I construct a universal rating system using a simple, unsupervised classifier based on the two proposed metrics. This rating system is capable of classifying both individual puzzles and entire difficulty levels from the different Sudoku websites into three categories - Universal Easy, Universal Medium, and Universal Hard - thereby enabling consistent difficulty mapping across Sudoku websites. The experimental results show that for 4 out of 5 Sudoku websites, the universal classification aligns well with website-labeled difficulty levels. Finally, I present an algorithm that can be used by early Sudoku practitioners to solve Sudoku puzzles.
[261] The Geometry of Harmfulness in LLMs through Subconcept Probing
McNair Shah, Saleena Angeline, Adhitya Rajendra Kumar, Naitik Chheda, Kevin Zhu, Vasu Sharma, Sean O’Brien, Will Cai
Main category: cs.AI
TL;DR: A framework probes and steers harmful content in LLMs using interpretable directions in activation space, showing low-rank harmfulness subspaces can be ablated or steered to reduce harm with minimal utility loss.
Details
Motivation: To understand and mitigate harmful behaviors in large language models (LLMs) by developing a scalable method for probing and controlling such content.Method: Introduced a multidimensional framework with 55 harmfulness subconcepts, learning linear probes to create interpretable directions in activation space. Tested ablation and steering in the subspace.
Result: Found that steering the dominant direction nearly eliminates harmfulness with little utility loss, demonstrating the subspace’s low-rank nature.
Conclusion: Concept subspaces offer a scalable way to audit and improve LLMs, providing practical tools for mitigating harmful behaviors.
Abstract: Recent advances in large language models (LLMs) have intensified the need to understand and reliably curb their harmful behaviours. We introduce a multidimensional framework for probing and steering harmful content in model internals. For each of 55 distinct harmfulness subconcepts (e.g., racial hate, employment scams, weapons), we learn a linear probe, yielding 55 interpretable directions in activation space. Collectively, these directions span a harmfulness subspace that we show is strikingly low-rank. We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace’s dominant direction. We find that dominant direction steering allows for near elimination of harmfulness with a low decrease in utility. Our findings advance the emerging view that concept subspaces provide a scalable lens on LLM behaviour and offer practical tools for the community to audit and harden future generations of language models.
[262] Adaptive XAI in High Stakes Environments: Modeling Swift Trust with Multimodal Feedback in Human AI Teams
Nishani Fernando, Bahareh Nakisa, Adnan Ahmad, Mohammad Naim Rastgoo
Main category: cs.AI
TL;DR: Proposes an adaptive explainability trust framework (AXTF) for human-AI teaming in high-stakes scenarios, using implicit feedback to enhance swift trust.
Details
Motivation: Addresses the gap in existing XAI approaches by focusing on adaptive, non-intrusive explanations tailored to real-time user states in high-pressure environments.Method: Leverages physiological and behavioral signals (EEG, ECG, eye tracking) to infer cognitive and emotional states, guiding dynamic trust estimation and explanation adaptation.
Result: Introduces AXTF, a framework for personalized trust estimation and adaptive explanations to foster swift trust in human-AI collaboration.
Conclusion: Establishes a foundation for developing adaptive XAI systems suited for high-stakes, time-sensitive settings.
Abstract: Effective human-AI teaming heavily depends on swift trust, particularly in high-stakes scenarios such as emergency response, where timely and accurate decision-making is critical. In these time-sensitive and cognitively demanding settings, adaptive explainability is essential for fostering trust between human operators and AI systems. However, existing explainable AI (XAI) approaches typically offer uniform explanations and rely heavily on explicit feedback mechanisms, which are often impractical in such high-pressure scenarios. To address this gap, we propose a conceptual framework for adaptive XAI that operates non-intrusively by responding to users’ real-time cognitive and emotional states through implicit feedback, thereby enhancing swift trust in high-stakes environments. The proposed adaptive explainability trust framework (AXTF) leverages physiological and behavioral signals, such as EEG, ECG, and eye tracking, to infer user states and support explanation adaptation. At its core is a multi-objective, personalized trust estimation model that maps workload, stress, and emotion to dynamic trust estimates. These estimates guide the modulation of explanation features enabling responsive and personalized support that promotes swift trust in human-AI collaboration. This conceptual framework establishes a foundation for developing adaptive, non-intrusive XAI systems tailored to the rigorous demands of high-pressure, time-sensitive environments.
[263] Large Language Model Powered Automated Modeling and Optimization of Active Distribution Network Dispatch Problems
Xu Yang, Chenhui Lin, Yue Yang, Qi Wang, Haotian Liu, Haizhou Hua, Wenchuan Wu
Main category: cs.AI
TL;DR: The paper proposes an LLM-powered automated approach for ADN dispatch, decomposing problems into stages and using multi-LLM coordination for modeling and optimization, validated by test cases.
Details
Motivation: The lack of specialized expertise among ADN operators makes human reliance costly and time-intensive, necessitating an intelligent, flexible solution.Method: Decomposes ADN dispatch into stages, designs a multi-LLM framework (Information Extractor, Problem Formulator, Code Programmer), and refines each LLM agent for accuracy.
Result: The approach enables dispatch strategies via natural language queries, validated by comprehensive comparisons and demonstrations.
Conclusion: The proposed LLM-powered method effectively addresses technical barriers and improves ADN dispatch efficiency.
Abstract: The increasing penetration of distributed energy resources into active distribution networks (ADNs) has made effective ADN dispatch imperative. However, the numerous newly-integrated ADN operators, such as distribution system aggregators, virtual power plant managers, and end prosumers, often lack specialized expertise in power system operation, modeling, optimization, and programming. This knowledge gap renders reliance on human experts both costly and time-intensive. To address this challenge and enable intelligent, flexible ADN dispatch, this paper proposes a large language model (LLM) powered automated modeling and optimization approach. First, the ADN dispatch problems are decomposed into sequential stages, and a multi-LLM coordination architecture is designed. This framework comprises an Information Extractor, a Problem Formulator, and a Code Programmer, tasked with information retrieval, optimization problem formulation, and code implementation, respectively. Afterwards, tailored refinement techniques are developed for each LLM agent, greatly improving the accuracy and reliability of generated content. The proposed approach features a user-centric interface that enables ADN operators to derive dispatch strategies via simple natural language queries, eliminating technical barriers and increasing efficiency. Comprehensive comparisons and end-to-end demonstrations on various test cases validate the effectiveness of the proposed architecture and methods.
[264] An ontological analysis of risk in Basic Formal Ontology
Federico Donato, Adrien Barton
Main category: cs.AI
TL;DR: The paper characterizes risk using the Basic Formal Ontology (BFO), arguing that Risk is a subclass of BFO:Role rather than BFO:Disposition, and provides an example to generalize the analysis.
Details
Motivation: To clarify the ontological classification of risk within the BFO framework, distinguishing it from similar categories like dispositions.Method: The paper uses the BFO framework to model risk, analyzing an example involving objects, processes, and their interrelations to generalize sufficient conditions for being a risk.
Result: Risk is classified as a subclass of BFO:Role, with sufficient conditions for its existence identified. Necessary conditions are noted for future exploration.
Conclusion: The study provides a clear ontological classification of risk as a role within BFO, with implications for future work on necessary conditions.
Abstract: The paper explores the nature of risk, providing a characterization using the categories of the Basic Formal Ontology (BFO). It argues that the category Risk is a subclass of BFO:Role, contrasting it with a similar view classifying Risk as a subclass of BFO:Disposition. This modeling choice is applied on one example of risk, which represents objects, processes (both physical and mental) and their interrelations, then generalizing from the instances in the example to obtain an overall analysis of risk, making explicit what are the sufficient conditions for being a risk. Plausible necessary conditions are also mentioned for future work. Index Terms: ontology, risk, BFO, role, disposition
[265] Ontological Foundations of State Sovereignty
John Beverley, Danielle Limbaugh
Main category: cs.AI
TL;DR: A primer on state sovereignty, its claims, and strategies for handling vague or contradictory data about sovereign states, aiming to support ontology in international affairs.
Details
Motivation: To clarify the concept of state sovereignty and address challenges in identifying sovereign states for applied ontology work in international affairs.Method: Presents a strategy for dealing with ambiguous or conflicting data on sovereignty.
Result: Reveals a method to work with sovereignty data, laying groundwork for ontology applications.
Conclusion: Sets the stage for further applied research in international affairs ontology by addressing sovereignty complexities.
Abstract: This short paper is a primer on the nature of state sovereignty and the importance of claims about it. It also aims to reveal (merely reveal) a strategy for working with vague or contradictory data about which states, in fact, are sovereign. These goals together are intended to set the stage for applied work in ontology about international affairs.
[266] Tell Me You’re Biased Without Telling Me You’re Biased – Toward Revealing Implicit Biases in Medical LLMs
Farzana Islam Adiba, Rahmatollah Beheshti
Main category: cs.AI
TL;DR: A framework combining knowledge graphs and auxiliary LLMs to detect and mitigate biases in medical LLMs, outperforming baselines in revealing complex bias patterns.
Details
Motivation: To address biased and unfair patterns in medical LLMs before clinical adoption, ensuring fair decision-making.Method: Integrates knowledge graphs with adversarial perturbation and multi-hop characterization for systematic bias detection.
Result: Demonstrates superior ability and scalability in identifying complex biases across datasets, LLMs, and bias types.
Conclusion: The framework effectively reveals and mitigates biases in medical LLMs, enhancing their reliability for clinical use.
Abstract: Large language models (LLMs) that are used in medical applications are known to show biased and unfair patterns. Prior to adopting these in clinical decision-making applications, it is crucial to identify these bias patterns to enable effective mitigation of their impact. In this study, we present a novel framework combining knowledge graphs (KGs) with auxiliary LLMs to systematically reveal complex bias patterns in medical LLMs. Specifically, the proposed approach integrates adversarial perturbation techniques to identify subtle bias patterns. The approach adopts a customized multi-hop characterization of KGs to enhance the systematic evaluation of arbitrary LLMs. Through a series of comprehensive experiments (on three datasets, six LLMs, and five bias types), we show that our proposed framework has noticeably greater ability and scalability to reveal complex biased patterns of LLMs compared to other baselines.
[267] Agentic Web: Weaving the Next Web with AI Agents
Yingxuan Yang, Mulei Ma, Yuxuan Huang, Huacan Chai, Chenyu Gong, Haoran Geng, Yuanjian Zhou, Ying Wen, Meng Fang, Muhao Chen, Shangding Gu, Ming Jin, Costas Spanos, Yang Yang, Pieter Abbeel, Dawn Song, Weinan Zhang, Jun Wang
Main category: cs.AI
TL;DR: The paper introduces the Agentic Web, a new internet phase where AI agents autonomously interact to perform tasks. It proposes a framework with three dimensions—intelligence, interaction, and economics—to understand and build such systems, addressing challenges and societal impacts.
Details
Motivation: The shift from human-driven to machine-to-machine interaction aims to automate routine tasks, enhancing web interactivity and efficiency.Method: The paper presents a structured framework with three key dimensions (intelligence, interaction, economics) and analyzes architectural challenges like communication protocols and orchestration.
Result: The framework enables AI agent capabilities (retrieval, recommendation, planning, collaboration) and highlights scalability challenges and societal risks.
Conclusion: The paper outlines future research for secure, intelligent ecosystems balancing human intent and autonomous agents, with ongoing updates at a provided GitHub link.
Abstract: The emergence of AI agents powered by large language models (LLMs) marks a pivotal shift toward the Agentic Web, a new phase of the internet defined by autonomous, goal-driven interactions. In this paradigm, agents interact directly with one another to plan, coordinate, and execute complex tasks on behalf of users. This transition from human-driven to machine-to-machine interaction allows intent to be delegated, relieving users from routine digital operations and enabling a more interactive, automated web experience. In this paper, we present a structured framework for understanding and building the Agentic Web. We trace its evolution from the PC and Mobile Web eras and identify the core technological foundations that support this shift. Central to our framework is a conceptual model consisting of three key dimensions: intelligence, interaction, and economics. These dimensions collectively enable the capabilities of AI agents, such as retrieval, recommendation, planning, and collaboration. We analyze the architectural and infrastructural challenges involved in creating scalable agentic systems, including communication protocols, orchestration strategies, and emerging paradigms such as the Agent Attention Economy. We conclude by discussing the potential applications, societal risks, and governance issues posed by agentic systems, and outline research directions for developing open, secure, and intelligent ecosystems shaped by both human intent and autonomous agent behavior. A continuously updated collection of relevant studies for agentic web is available at: https://github.com/SafeRL-Lab/agentic-web.
[268] CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting
David Maria Schmidt, Raoul Schubert, Philipp Cimiano
Main category: cs.AI
TL;DR: The paper investigates the compositional interpretation abilities of large language models (LLMs) in mapping questions to SPARQL queries, revealing their limitations despite their general language capabilities.
Details
Motivation: To assess how systematic LLMs are in interpreting complex questions compositionally, given their success in simpler tasks.Method: A benchmark with three datasets of varying difficulty, generated from DBpedia graph patterns and verbalized using Lemon lexica, tested LLMs with prompts, few-shot learning, and fine-tuning.
Result: Performance (macro $F_1$) drops from 0.45 to 0.09 as complexity increases, with scores not exceeding 0.57 even for the simplest dataset.
Conclusion: LLMs struggle with systematic and compositional interpretation of questions into SPARQL queries, highlighting a key limitation.
Abstract: Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they “understand” the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.
[269] LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems
Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu
Main category: cs.AI
TL;DR: LeMix is a system for co-locating LLM serving and training workloads, improving efficiency and performance over traditional separate setups.
Details
Motivation: Inefficiencies in current LLM deployment due to isolated serving and training phases, leading to GPU idleness and delayed adaptation.Method: Integrates offline profiling, execution prediction, and runtime scheduling to dynamically allocate resources.
Result: Improves throughput by 3.53x, reduces inference loss by 0.61x, and enhances response time SLO attainment by 2.12x.
Conclusion: LeMix demonstrates the benefits of joint LLM inference and training, enabling more resource-efficient deployment.
Abstract: Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific behaviors and co-execution interference across shared nodes, LeMix improves utilization and serving quality without compromising serving responsiveness. Our evaluation shows that LeMix improves throughput by up to 3.53x, reduces inference loss by up to 0.61x, and delivers up to 2.12x higher response time SLO attainment over traditional separate setups. To our knowledge, this is the first work to uncover and exploit the opportunities of joint LLM inference and training, paving the way for more resource-efficient deployment of LLMs in production environments.
[270] Curiosity by Design: An LLM-based Coding Assistant Asking Clarification Questions
Harsh Darji, Thibaud Lutellier
Main category: cs.AI
TL;DR: An LLM-based coding assistant improves code generation by asking clarification questions for ambiguous prompts, outperforming standard methods.
Details
Motivation: Current LLMs struggle with ambiguous developer prompts, leading to incorrect code generation. The goal is to mimic human code review by clarifying unclear queries.Method: The system includes a query classifier for detecting unclear programming queries and a fine-tuned LLM for generating clarification questions.
Result: The fine-tuned LLM outperforms zero-shot prompting in generating useful questions, and users prefer its accuracy and helpfulness over baselines.
Conclusion: The proposed coding assistant enhances code generation accuracy by addressing prompt ambiguity through clarification questions.
Abstract: Large Language Models (LLMs) are increasingly used as coding assistants. However, the ambiguity of the developer’s prompt often leads to incorrect code generation, as current models struggle to infer user intent without extensive prompt engineering or external context. This work aims to build an LLM-based coding assistant that mimics the human code review process by asking clarification questions when faced with ambiguous or under-specified queries. Our end-to-end system includes (1) a query classifier trained to detect unclear programming-related queries and (2) a fine-tuned LLM that generates clarification questions. Our evaluation shows that the fine-tuned LLM outperforms standard zero-shot prompting in generating useful clarification questions. Furthermore, our user study indicates that users find the clarification questions generated by our model to outperform the baseline, demonstrating that our coding assistant produces more accurate and helpful code responses compared to baseline coding assistants.
[271] Structured Relevance Assessment for Robust Retrieval-Augmented Language Models
Aryan Raj, Astitva Veer Garg, Anitha D
Main category: cs.AI
TL;DR: The paper introduces a framework to improve Retrieval-Augmented Language Models (RALMs) by enhancing document relevance evaluation, knowledge integration, and handling unanswerable queries, reducing factual errors and improving transparency.
Details
Motivation: Addressing challenges in RALMs, such as factual errors and poor document relevance evaluation, to enhance reliability in question-answering systems.Method: A multi-dimensional scoring system combining semantic matching and source reliability, embedding-based relevance scoring, synthetic training data, and specialized benchmarking.
Result: Preliminary evaluations show reduced hallucination rates and improved reasoning transparency.
Conclusion: The framework advances RALM reliability, though challenges like balancing latency and thoroughness remain.
Abstract: Retrieval-Augmented Language Models (RALMs) face significant challenges in reducing factual errors, particularly in document relevance evaluation and knowledge integration. We introduce a framework for structured relevance assessment that enhances RALM robustness through improved document evaluation, balanced intrinsic and external knowledge integration, and effective handling of unanswerable queries. Our approach employs a multi-dimensional scoring system that considers both semantic matching and source reliability, utilizing embedding-based relevance scoring and synthetic training data with mixed-quality documents. We implement specialized benchmarking on niche topics, a knowledge integration mechanism, and an “unknown” response protocol for queries with insufficient knowledge coverage. Preliminary evaluations demonstrate significant reductions in hallucination rates and improved transparency in reasoning processes. Our framework advances the development of more reliable question-answering systems capable of operating effectively in dynamic environments with variable data quality. While challenges persist in accurately distinguishing credible information and balancing system latency with thoroughness, this work represents a meaningful step toward enhancing RALM reliability.
[272] Efficacy of AI RAG Tools for Complex Information Extraction and Data Annotation Tasks: A Case Study Using Banks Public Disclosures
Nicholas Botti, Flora Haberkorn, Charlotte Hoopes, Shaun Khan
Main category: cs.AI
TL;DR: The study evaluates an AI RAG tool’s effectiveness in aiding analysts with data annotation, showing significant speed and accuracy improvements, especially in interactive use.
Details
Motivation: To assess how AI tools can enhance efficiency and accuracy in complex real-world annotation tasks, particularly in banking disclosures.Method: A within-subjects design with randomized task assignments tested two AI conditions (naive and interactive) against a human-only baseline on GSIB documents.
Result: AI use sped up tasks by 10x and improved accuracy, with interactive use outperforming naive. Potential time savings: 268 hours. Annotator skill with AI also impacted performance.
Conclusion: AI RAG tools, especially when used interactively, significantly boost annotation efficiency and accuracy, with annotator AI proficiency playing a key role.
Abstract: We utilize a within-subjects design with randomized task assignments to understand the effectiveness of using an AI retrieval augmented generation (RAG) tool to assist analysts with an information extraction and data annotation task. We replicate an existing, challenging real-world annotation task with complex multi-part criteria on a set of thousands of pages of public disclosure documents from global systemically important banks (GSIBs) with heterogeneous and incomplete information content. We test two treatment conditions. First, a “naive” AI use condition in which annotators use only the tool and must accept the first answer they are given. And second, an “interactive” AI treatment condition where annotators use the tool interactively, and use their judgement to follow-up with additional information if necessary. Compared to the human-only baseline, the use of the AI tool accelerated task execution by up to a factor of 10 and enhanced task accuracy, particularly in the interactive condition. We find that when extrapolated to the full task, these methods could save up to 268 hours compared to the human-only approach. Additionally, our findings suggest that annotator skill, not just with the subject matter domain, but also with AI tools, is a factor in both the accuracy and speed of task performance.
[273] Optimizing Multi-Tier Supply Chain Ordering with LNN+XGBoost: Mitigating the Bullwhip Effect
Chunan Tong
Main category: cs.AI
TL;DR: A hybrid LNN and XGBoost model is proposed to optimize supply chain ordering strategies, addressing the bullwhip effect and improving profitability by combining dynamic feature extraction and global optimization.
Details
Motivation: Traditional methods fail to handle dynamic market conditions, and existing machine learning techniques have limitations like computational complexity. Liquid Neural Networks (LNNs) offer adaptability and low cost but are underexplored in supply chains.Method: A hybrid model integrates LNN for dynamic feature extraction and XGBoost for global optimization to enhance ordering strategies in multi-tier supply chains.
Result: The model aims to mitigate the bullwhip effect and improve cumulative profitability by leveraging local and global synergies.
Conclusion: The hybrid approach fills a gap in supply chain methodologies, providing an efficient and dynamic solution for real-time decision-making.
Abstract: Supply chain management faces significant challenges, including demand fluctuations, inventory imbalances, and amplified upstream order variability due to the bullwhip effect. Traditional methods, such as simple moving averages, struggle to address dynamic market conditions. Emerging machine learning techniques, including LSTM, reinforcement learning, and XGBoost, offer potential solutions but are limited by computational complexity, training inefficiencies, or constraints in time-series modeling. Liquid Neural Networks, inspired by dynamic biological systems, present a promising alternative due to their adaptability, low computational cost, and robustness to noise, making them suitable for real-time decision-making and edge computing. Despite their success in applications like autonomous vehicles and medical monitoring, their potential in supply chain optimization remains underexplored. This study introduces a hybrid LNN and XGBoost model to optimize ordering strategies in multi-tier supply chains. By leveraging LNN’s dynamic feature extraction and XGBoost’s global optimization capabilities, the model aims to mitigate the bullwhip effect and enhance cumulative profitability. The research investigates how local and global synergies within the hybrid framework address the dual demands of adaptability and efficiency in SCM. The proposed approach fills a critical gap in existing methodologies, offering an innovative solution for dynamic and efficient supply chain management.
[274] Teaching Language Models To Gather Information Proactively
Tenghao Huang, Sihao Chen, Muhao Chen, Jonathan May, Longqi Yang, Mengting Wan, Pei Zhou
Main category: cs.AI
TL;DR: The paper introduces proactive information gathering for LLMs to improve collaboration by identifying context gaps and eliciting implicit user knowledge through targeted questions.
Details
Motivation: Current LLMs often fail to proactively gather missing information, limiting their effectiveness in collaborative problem-solving.Method: A scalable framework generates partially specified tasks, and reinforcement finetuning rewards questions that uncover implicit user knowledge.
Result: The Qwen-2.5-7B model outperforms o3-mini by 18% in automatic metrics and is preferred by humans for clarification questions (42%) and final outlines (28%).
Conclusion: Proactive clarification enhances LLMs from passive generators to collaborative partners, improving solution quality.
Abstract: Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts, falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information – such as hidden domain expertise or fine-grained requirements – that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.
[275] Shapley Uncertainty in Natural Language Generation
Meilin Zhu, Gaojie Jin, Xiaowei Huang, Lijun Zhang
Main category: cs.AI
TL;DR: The paper introduces a Shapley-based uncertainty metric for LLMs, outperforming semantic entropy in predicting model performance.
Details
Motivation: To address the limitations of threshold-based semantic entropy in measuring uncertainty for LLM outputs.Method: Develops a Shapley-based uncertainty metric capturing continuous semantic relationships and validates it against three fundamental properties.
Result: Shapley uncertainty more accurately predicts LLM performance in QA tasks compared to baseline measures.
Conclusion: The proposed Shapley uncertainty metric is a robust and nuanced alternative for assessing LLM output reliability.
Abstract: In question-answering tasks, determining when to trust the outputs is crucial to the alignment of large language models (LLMs). Kuhn et al. (2023) introduces semantic entropy as a measure of uncertainty, by incorporating linguistic invariances from the same meaning. It primarily relies on setting threshold to measure the level of semantic equivalence relation. We propose a more nuanced framework that extends beyond such thresholding by developing a Shapley-based uncertainty metric that captures the continuous nature of semantic relationships. We establish three fundamental properties that characterize valid uncertainty metrics and prove that our Shapley uncertainty satisfies these criteria. Through extensive experiments, we demonstrate that our Shapley uncertainty more accurately predicts LLM performance in question-answering and other datasets, compared to similar baseline measures.
[276] Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects
Yixin Liu, Guibin Zhang, Kun Wang, Shiyuan Li, Shirui Pan
Main category: cs.AI
TL;DR: The paper reviews Graph-augmented LLM Agents (GLA), highlighting their role in enhancing LLM agent capabilities like planning, memory, and tool usage, and discusses future directions for scalable and multimodal GLA systems.
Details
Motivation: LLMs lack key agentic procedures like reliable planning and multi-agent coordination. Graphs can enhance these capabilities, but research is fragmented, necessitating a comprehensive overview.Method: Categorizes GLA methods by their functions (planning, memory, tool usage) and analyzes graph contributions. Discusses GLA’s role in multi-agent systems (orchestration, efficiency, trustworthiness).
Result: Identifies gaps and opportunities in GLA research, emphasizing structural adaptability and unified systems.
Conclusion: The paper serves as a roadmap for future GLA research, advocating for deeper exploration of graphs in LLM agent systems.
Abstract: Autonomous agents based on large language models (LLMs) have demonstrated impressive capabilities in a wide range of applications, including web navigation, software development, and embodied control. While most LLMs are limited in several key agentic procedures, such as reliable planning, long-term memory, tool management, and multi-agent coordination, graphs can serve as a powerful auxiliary structure to enhance structure, continuity, and coordination in complex agent workflows. Given the rapid growth and fragmentation of research on Graph-augmented LLM Agents (GLA), this paper offers a timely and comprehensive overview of recent advances and also highlights key directions for future work. Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each. For multi-agent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS. Finally, we highlight key future directions to advance this field, from improving structural adaptability to enabling unified, scalable, and multimodal GLA systems. We hope this paper can serve as a roadmap for future research on GLA and foster a deeper understanding of the role of graphs in LLM agent systems.
[277] GovRelBench:A Benchmark for Government Domain Relevance
Haiquan Wang, Yi Chen, Shang Zeng, Yun Bian, Zhe Cui
Main category: cs.AI
TL;DR: The paper introduces GovRelBench, a benchmark for evaluating LLMs’ core capabilities in the government domain, addressing gaps in current evaluations.
Details
Motivation: Current evaluations of LLMs in government focus on safety, neglecting core capabilities like domain relevance.Method: Proposes GovRelBench with domain prompts and GovRelBERT, using SoftGovScore to train ModernBERT for relevance scoring.
Result: GovRelBERT accurately computes government domain relevance scores, enhancing evaluation frameworks.
Conclusion: GovRelBench provides an effective tool for evaluating LLMs in the government domain, with code and dataset publicly available.
Abstract: Current evaluations of LLMs in the government domain primarily focus on safety considerations in specific scenarios, while the assessment of the models’ own core capabilities, particularly domain relevance, remains insufficient. To address this gap, we propose GovRelBench, a benchmark specifically designed for evaluating the core capabilities of LLMs in the government domain. GovRelBench consists of government domain prompts and a dedicated evaluation tool, GovRelBERT. During the training process of GovRelBERT, we introduce the SoftGovScore method: this method trains a model based on the ModernBERT architecture by converting hard labels to soft scores, enabling it to accurately compute the text’s government domain relevance score. This work aims to enhance the capability evaluation framework for large models in the government domain, providing an effective tool for relevant research and practice. Our code and dataset are available at https://github.com/pan-xi/GovRelBench.
[278] Evo-DKD: Dual-Knowledge Decoding for Autonomous Ontology Evolution in Large Language Models
Vishal Raman, Vijai Aravindh R
Main category: cs.AI
TL;DR: Evo-DKD is a dual-decoder framework for autonomous ontology evolution, combining structured and unstructured knowledge via coordinated decoding, outperforming baselines in precision and task performance.
Details
Motivation: Manual curation of ontologies is labor-intensive, and LLMs struggle with structured consistency, necessitating a hybrid approach for sustainable ontology evolution.Method: Evo-DKD uses a dual-decoder framework with structured ontology traversal and unstructured text reasoning, coordinated by a dynamic attention-based gating mechanism.
Result: Evo-DKD outperforms structured-only or unstructured-only baselines in ontology update precision and downstream task performance.
Conclusion: Evo-DKD combines symbolic and neural reasoning for sustainable ontology evolution, offering a new paradigm for LLM-driven knowledge base maintenance.
Abstract: Ontologies and knowledge graphs require continuous evolution to remain comprehensive and accurate, but manual curation is labor intensive. Large Language Models (LLMs) possess vast unstructured knowledge but struggle with maintaining structured consistency. We propose Evo-DKD, a novel dual-decoder framework for autonomous ontology evolution that combines structured ontology traversal with unstructured text reasoning. Evo-DKD introduces two parallel decoding streams within an LLM: one decoder generates candidate ontology edits (e.g., new concepts or relations) while the other produces natural-language justifications. A dynamic attention-based gating mechanism coordinates the two streams, deciding at each step how to blend structured and unstructured knowledge. Due to GPU constraints, we simulate the dual-decoder behavior using prompt-based mode control to approximate coordinated decoding in a single-stream mode. The system operates in a closed reasoning loop: proposed ontology edits are validated (via consistency checks and cross-verification with the text explanations) and then injected into the knowledge base, which in turn informs subsequent reasoning. We demonstrate Evo-DKD’s effectiveness on use cases including healthcare ontology refinement, semantic search improvement, and cultural heritage timeline modeling. Experiments show that Evo-DKD outperforms baselines using structured-only or unstructured-only decoding in both precision of ontology updates and downstream task performance. We present quantitative metrics and qualitative examples, confirming the contributions of the dual-decoder design and gating router. Evo-DKD offers a new paradigm for LLM-driven knowledge base maintenance, combining the strengths of symbolic and neural reasoning for sustainable ontology evolution.
[279] Validating Pharmacogenomics Generative Artificial Intelligence Query Prompts Using Retrieval-Augmented Generation (RAG)
Ashley Rector, Keaton Minor, Kamden Minor, Jeff McCormack, Beth Breeden, Ryan Nowers, Jay Dorris
Main category: cs.AI
TL;DR: Sherpa Rx, an AI tool using RAG and large language models, was evaluated for pharmacogenomics. It integrated CPIC and PharmGKB data, showing high performance in accuracy, relevance, clarity, and completeness, outperforming ChatGPT-4omini.
Details
Motivation: To validate the performance of Sherpa Rx in generating accurate and contextually relevant pharmacogenomics responses by integrating CPIC and PharmGKB data.Method: Used a dataset of 260 queries across 26 CPIC guidelines, comparing Sherpa Rx’s performance in two phases (with CPIC only and with CPIC+PharmGKB) and against ChatGPT-4omini. Metrics included accuracy, relevance, clarity, completeness, and recall.
Result: Sherpa Rx achieved high scores (e.g., accuracy 4.9, recall 0.99) and outperformed ChatGPT-4omini in accuracy and completeness. Phase 2 (with PharmGKB) showed slight improvements but was not statistically significant over Phase 1.
Conclusion: Integrating CPIC and PharmGKB with RAG enhances AI performance in pharmacogenomics, demonstrating Sherpa Rx’s potential for accurate, personalized decision-making.
Abstract: This study evaluated Sherpa Rx, an artificial intelligence tool leveraging large language models and retrieval-augmented generation (RAG) for pharmacogenomics, to validate its performance on key response metrics. Sherpa Rx integrated Clinical Pharmacogenetics Implementation Consortium (CPIC) guidelines with Pharmacogenomics Knowledgebase (PharmGKB) data to generate contextually relevant responses. A dataset (N=260 queries) spanning 26 CPIC guidelines was used to evaluate drug-gene interactions, dosing recommendations, and therapeutic implications. In Phase 1, only CPIC data was embedded. Phase 2 additionally incorporated PharmGKB content. Responses were scored on accuracy, relevance, clarity, completeness (5-point Likert scale), and recall. Wilcoxon signed-rank tests compared accuracy between Phase 1 and Phase 2, and between Phase 2 and ChatGPT-4omini. A 20-question quiz assessed the tool’s real-world applicability against other models. In Phase 1 (N=260), Sherpa Rx demonstrated high performance of accuracy 4.9, relevance 5.0, clarity 5.0, completeness 4.8, and recall 0.99. The subset analysis (N=20) showed improvements in accuracy (4.6 vs. 4.4, Phase 2 vs. Phase 1 subset) and completeness (5.0 vs. 4.8). ChatGPT-4omini performed comparably in relevance (5.0) and clarity (4.9) but lagged in accuracy (3.9) and completeness (4.2). Differences in accuracy between Phase 1 and Phase 2 was not statistically significant. However, Phase 2 significantly outperformed ChatGPT-4omini. On the 20-question quiz, Sherpa Rx achieved 90% accuracy, outperforming other models. Integrating additional resources like CPIC and PharmGKB with RAG enhances AI accuracy and performance. This study highlights the transformative potential of generative AI like Sherpa Rx in pharmacogenomics, improving decision-making with accurate, personalized responses.
[280] An LLM Driven Agent Framework for Automated Infrared Spectral Multi Task Reasoning
Zujie Xie, Zixuan Chen, Jiheng Liang, Xiangyang Yu, Ziru Yu
Main category: cs.AI
TL;DR: An LLM-driven framework for automated IR spectral analysis outperforms traditional methods under low-data conditions.
Details
Motivation: Infrared spectroscopy's high-dimensional, overlapping bands challenge conventional methods, while LLMs' generalization potential remains unexplored for IR analysis.Method: An end-to-end LLM framework integrates literature knowledge, preprocessing, feature extraction, and multi-task reasoning, with iterative refinement via mispredicted samples.
Result: The framework outperforms single-turn inference and rivals/exceeds ML/DL models in low-data scenarios across diverse materials.
Conclusion: LLMs show promise for accurate, automated IR spectral interpretation, especially in low-data conditions.
Abstract: Infrared spectroscopy offers rapid, non destructive measurement of chemical and material properties but suffers from high dimensional, overlapping spectral bands that challenge conventional chemometric approaches. Emerging large language models (LLMs), with their capacity for generalization and reasoning, offer promising potential for automating complex scientific workflows. Despite this promise, their application in IR spectral analysis remains largely unexplored. This study addresses the critical challenge of achieving accurate, automated infrared spectral interpretation under low-data conditions using an LLM-driven framework. We introduce an end-to-end, large language model driven agent framework that integrates a structured literature knowledge base, automated spectral preprocessing, feature extraction, and multi task reasoning in a unified pipeline. By querying a curated corpus of peer reviewed IR publications, the agent selects scientifically validated routines. The selected methods transform each spectrum into low dimensional feature sets, which are fed into few shot prompt templates for classification, regression, and anomaly detection. A closed loop, multi turn protocol iteratively appends mispredicted samples to the prompt, enabling dynamic refinement of predictions. Across diverse materials: stamp pad ink, Chinese medicine, Pu’er tea, Citri Reticulatae Pericarpium and waste water COD datasets, the multi turn LLM consistently outperforms single turn inference, rivaling or exceeding machine learning and deep learning models under low data regimes.
[281] Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess
Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson
Main category: cs.AI
TL;DR: Maia4All is a framework for efficiently modeling individual human decision-making in chess with limited data, using a two-stage optimization process.
Details
Motivation: To address the challenge of modeling human behavior in AI systems with limited individual data, particularly in chess.Method: Two-stage optimization: (1) enrichment step bridging population and individual behavior, (2) democratization step refining individual embeddings with minimal data.
Result: Maia4All accurately predicts individual moves and profiles behavior with high fidelity, requiring only 20 games vs. 5,000 previously.
Conclusion: Maia4All sets a new standard for personalized AI behavior modeling, with potential applications beyond chess.
Abstract: As humans seek to collaborate with, learn from, and better understand artificial intelligence systems, developing AIs that can accurately emulate individual decision-making becomes increasingly important. Chess, a long-standing AI benchmark with precise skill measurement, offers an ideal testbed for human-AI alignment. However, existing approaches to modeling human behavior require prohibitively large amounts of data from each individual, making them impractical for new or sparsely represented users. In this work, we introduce Maia4All, a framework designed to learn and adapt to individual decision-making styles efficiently, even with limited data. Maia4All achieves this through a two-stage optimization process: (1) an enrichment step, which bridges population and individual-level human behavior modeling with a prototype-enriched model, and (2) a democratization step, which leverages ability levels or user prototypes to initialize and refine individual embeddings with minimal data. Our experimental results show that Maia4All can accurately predict individual moves and profile behavioral patterns with high fidelity, establishing a new standard for personalized human-like AI behavior modeling in chess. Maia4All achieves individual human behavior modeling in chess with only 20 games, compared to the 5,000 games required previously, representing a significant improvement in data efficiency. Our work provides an example of how population AI systems can flexibly adapt to individual users using a prototype-enriched model as a bridge. This approach extends beyond chess, as shown in our case study on idiosyncratic LLMs, highlighting its potential for broader applications in personalized AI adaptation.
[282] Large Language Models for Supply Chain Decisions
David Simchi-Levi, Konstantina Mellou, Ishai Menache, Jeevan Pathuri
Main category: cs.AI
TL;DR: LLMs democratize supply chain tech by automating understanding and interaction with tools, reducing decision time from days to minutes.
Details
Motivation: Business planners face delays in understanding, scenario analysis, and updating models due to reliance on data science teams.Method: Apply Large Language Models (LLMs) to automate explanations, scenario analysis, and model updates in supply chain tools.
Result: Decision time reduced from days/weeks to minutes/hours, boosting productivity.
Conclusion: LLMs enable faster, more efficient supply chain decision-making without human intervention.
Abstract: Supply Chain Management requires addressing a variety of complex decision-making challenges, from sourcing strategies to planning and execution. Over the last few decades, advances in computation and information technologies have enabled the transition from manual, intuition and experience-based decision-making, into more automated and data-driven decisions using a variety of tools that apply optimization techniques. These techniques use mathematical methods to improve decision-making. Unfortunately, business planners and executives still need to spend considerable time and effort to (i) understand and explain the recommendations coming out of these technologies; (ii) analyze various scenarios and answer what-if questions; and (iii) update the mathematical models used in these tools to reflect current business environments. Addressing these challenges requires involving data science teams and/or the technology providers to explain results or make the necessary changes in the technology and hence significantly slows down decision making. Motivated by the recent advances in Large Language Models (LLMs), we report how this disruptive technology can democratize supply chain technology - namely, facilitate the understanding of tools’ outcomes, as well as the interaction with supply chain tools without human-in-the-loop. Specifically, we report how we apply LLMs to address the three challenges described above, thus substantially reducing the time to decision from days and weeks to minutes and hours as well as dramatically increasing planners’ and executives’ productivity and impact.
[283] MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
Yanxu Zhu, Shitong Duan, Xiangxu Zhang, Jitao Sang, Peng Zhang, Tun Lu, Xiao Zhou, Jing Yao, Xiaoyuan Yi, Xing Xie
Main category: cs.AI
TL;DR: The paper assesses honesty in Multimodal Large Language Models (MLLMs) when answering visually unanswerable questions, introduces MoHoBench, a benchmark for evaluation, and proposes alignment methods to improve honesty.
Details
Motivation: Despite advancements in MLLMs, their honesty in handling visually unanswerable questions is underexplored, posing risks of harmful content.Method: The study defines four types of unanswerable visual questions, constructs MoHoBench (12k+ samples), benchmarks 28 MLLMs, and implements alignment methods via supervised and preference learning.
Result: Most models fail to refuse unanswerable questions; honesty is influenced by visual data, requiring dedicated alignment methods.
Conclusion: The work provides a foundation for improving MLLM honesty, with data and code available for future research.
Abstract: Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs’ capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models’ response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs’ honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/DSTTSD/MoHoBench.
[284] What Does it Mean for a Neural Network to Learn a “World Model”?
Kenneth Li, Fernanda Viégas, Martin Wattenberg
Main category: cs.AI
TL;DR: The paper proposes criteria to define when a neural net learns a ‘world model,’ focusing on latent state space representation and avoiding trivial interpretations.
Details
Motivation: To provide operational meaning to informal terms and establish a common language for experimental investigation.Method: Uses ideas from linear probing literature to formalize computations factoring through a representation of the data generation process, with added conditions to ensure non-triviality.
Result: A precise definition of a ‘world model’ in neural nets, emphasizing latent state space representation.
Conclusion: The framework enables clearer experimental investigation of world models in neural networks, with future work to include action effects.
Abstract: We propose a set of precise criteria for saying a neural net learns and uses a “world model.” The goal is to give an operational meaning to terms that are often used informally, in order to provide a common language for experimental investigation. We focus specifically on the idea of representing a latent “state space” of the world, leaving modeling the effect of actions to future work. Our definition is based on ideas from the linear probing literature, and formalizes the notion of a computation that factors through a representation of the data generation process. An essential addition to the definition is a set of conditions to check that such a “world model” is not a trivial consequence of the neural net’s data or task.
[285] ST-GDance: Long-Term and Collision-Free Group Choreography from Music
Jing Xu, Weiqiang Wang, Cunjian Chen, Jun Liu, Qiuhong Ke
Main category: cs.AI
TL;DR: ST-GDance is a framework for generating synchronized group dances from music, addressing scalability and collision issues by decoupling spatial and temporal dependencies.
Details
Motivation: Group dance generation is challenging due to the need for synchronization and spatial coordination, especially with more dancers and longer sequences, leading to computational complexity and collision risks.Method: ST-GDance uses lightweight graph convolutions for spatial modeling and sparse attention for temporal modeling, optimizing for long-term, collision-free choreography.
Result: The framework outperforms state-of-the-art methods on the AIOZ-GDance dataset, especially in generating long, coherent sequences.
Conclusion: ST-GDance effectively addresses scalability and collision issues in group dance generation, offering a practical solution for applications in entertainment and animation.
Abstract: Group dance generation from music has broad applications in film, gaming, and animation production. However, it requires synchronizing multiple dancers while maintaining spatial coordination. As the number of dancers and sequence length increase, this task faces higher computational complexity and a greater risk of motion collisions. Existing methods often struggle to model dense spatial-temporal interactions, leading to scalability issues and multi-dancer collisions. To address these challenges, we propose ST-GDance, a novel framework that decouples spatial and temporal dependencies to optimize long-term and collision-free group choreography. We employ lightweight graph convolutions for distance-aware spatial modeling and accelerated sparse attention for efficient temporal modeling. This design significantly reduces computational costs while ensuring smooth and collision-free interactions. Experiments on the AIOZ-GDance dataset demonstrate that ST-GDance outperforms state-of-the-art baselines, particularly in generating long and coherent group dance sequences. Project page: https://yilliajing.github.io/ST-GDance-Website/.
[286] Large Language Models for Wireless Communications: From Adaptation to Autonomy
Le Liang, Hao Ye, Yucheng Sheng, Ouya Wang, Jiacheng Wang, Shi Jin, Geoffrey Ye Li
Main category: cs.AI
TL;DR: The paper explores how large language models (LLMs) can transform wireless communications, focusing on adaptation, efficiency, and autonomous capabilities, while highlighting benefits and future challenges.
Details
Motivation: The increasing complexity of wireless systems demands intelligent solutions, and LLMs offer unprecedented capabilities in reasoning and adaptation.Method: The study examines three approaches: adapting pretrained LLMs for communication tasks, developing wireless-specific foundation models, and enabling autonomous LLM agents.
Result: LLM-based approaches show unique benefits over traditional methods, supported by recent advances and case studies.
Conclusion: Open challenges include multimodal fusion and self-improving capabilities, paving the way for intelligent and adaptive wireless networks.
Abstract: The emergence of large language models (LLMs) has revolutionized artificial intelligence, offering unprecedented capabilities in reasoning, generalization, and zero-shot learning. These strengths open new frontiers in wireless communications, where increasing complexity and dynamics demand intelligent and adaptive solutions. This article explores the role of LLMs in transforming wireless systems across three key directions: adapting pretrained LLMs for core communication tasks, developing wireless-specific foundation models to balance versatility and efficiency, and enabling agentic LLMs with autonomous reasoning and coordination capabilities. We highlight recent advances, practical case studies, and the unique benefits of LLM-based approaches over traditional methods. Finally, we outline open challenges and research opportunities, including multimodal fusion, collaboration with lightweight models, and self-improving capabilities, charting a path toward intelligent, adaptive, and autonomous wireless networks of the future.
[287] Finding Uncommon Ground: A Human-Centered Model for Extrospective Explanations
Laura Spillner, Nima Zargham, Mihai Pomarlan, Robert Porzel, Rainer Malaka
Main category: cs.AI
TL;DR: The paper proposes a personalized AI explanation approach tailored to user preferences and context, using a dynamic memory model to estimate relevant information for the user.
Details
Motivation: Current AI explanations focus on model internals, which are often unsuitable for non-experts. A human-centered approach is needed to improve transparency and usability.Method: The paper introduces a personalized explanation model where the AI agent uses a dynamic memory of past interactions to tailor explanations to the user’s preferences and context.
Result: The proposed model enables AI agents to provide more relevant and user-friendly explanations by leveraging personalized and contextual knowledge.
Conclusion: Personalized explanations, based on user context and preferences, enhance the transparency and usability of AI systems for non-experts.
Abstract: The need for explanations in AI has, by and large, been driven by the desire to increase the transparency of black-box machine learning models. However, such explanations, which focus on the internal mechanisms that lead to a specific output, are often unsuitable for non-experts. To facilitate a human-centered perspective on AI explanations, agents need to focus on individuals and their preferences as well as the context in which the explanations are given. This paper proposes a personalized approach to explanation, where the agent tailors the information provided to the user based on what is most likely pertinent to them. We propose a model of the agent’s worldview that also serves as a personal and dynamic memory of its previous interactions with the same user, based on which the artificial agent can estimate what part of its knowledge is most likely new information to the user.
[288] SafeDriveRAG: Towards Safe Autonomous Driving with Knowledge Graph-based Retrieval-Augmented Generation
Hao Ye, Mengshi Qi, Zhaohong Liu, Liang Liu, Huadong Ma
Main category: cs.AI
TL;DR: The paper introduces SafeDrive228K, a large-scale multimodal benchmark for evaluating vision-language models (VLMs) in traffic safety-critical scenarios, and proposes SafeDriveRAG, a knowledge graph-based retrieval-augmented generation method to enhance VLM performance.
Details
Motivation: Existing research lacks evaluation of VLMs in safety-critical driving scenarios, prompting the creation of a benchmark and method to address this gap.Method: Developed SafeDrive228K (228K examples across 18 sub-tasks) and SafeDriveRAG, a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach with multi-scale subgraph retrieval.
Result: SafeDriveRAG improves VLM performance by +4.73% in Traffic Accidents, +8.79% in Corner Cases, and +14.57% in Traffic Safety Commonsense tasks.
Conclusion: The benchmark and methodology advance traffic safety research, demonstrating the effectiveness of integrating retrieval-augmented generation in VLMs.
Abstract: In this work, we study how vision-language models (VLMs) can be utilized to enhance the safety for the autonomous driving system, including perception, situational understanding, and path planning. However, existing research has largely overlooked the evaluation of these models in traffic safety-critical driving scenarios. To bridge this gap, we create the benchmark (SafeDrive228K) and propose a new baseline based on VLM with knowledge graph-based retrieval-augmented generation (SafeDriveRAG) for visual question answering (VQA). Specifically, we introduce SafeDrive228K, the first large-scale multimodal question-answering benchmark comprising 228K examples across 18 sub-tasks. This benchmark encompasses a diverse range of traffic safety queries, from traffic accidents and corner cases to common safety knowledge, enabling a thorough assessment of the comprehension and reasoning abilities of the models. Furthermore, we propose a plug-and-play multimodal knowledge graph-based retrieval-augmented generation approach that employs a novel multi-scale subgraph retrieval algorithm for efficient information retrieval. By incorporating traffic safety guidelines collected from the Internet, this framework further enhances the model’s capacity to handle safety-critical situations. Finally, we conduct comprehensive evaluations on five mainstream VLMs to assess their reliability in safety-sensitive driving tasks. Experimental results demonstrate that integrating RAG significantly improves performance, achieving a +4.73% gain in Traffic Accidents tasks, +8.79% in Corner Cases tasks and +14.57% in Traffic Safety Commonsense across five mainstream VLMs, underscoring the potential of our proposed benchmark and methodology for advancing research in traffic safety. Our source code and data are available at https://github.com/Lumos0507/SafeDriveRAG.
[289] Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning
Jiong Yin, Liang Li, Jiehua Zhang, Yuhan Gao, Chenggang Yan, Xichun Sheng
Main category: cs.AI
TL;DR: The paper introduces PHP, a three-stage method for audio-visual multi-task incremental learning, balancing knowledge retention and new task learning.
Details
Motivation: To address the challenge of preserving old task knowledge while learning new tasks in audio-visual multi-task incremental learning.Method: PHP uses three phases: task-shared modality aggregating adapter (shallow), task-specific modality-shared dynamic generating adapter (middle), and task-specific modality-independent prompts (deep).
Result: PHP achieves state-of-the-art performance across four tasks (AVE, AVVP, AVS, AVQA).
Conclusion: PHP effectively balances knowledge sharing and specificity, enhancing multi-task learning.
Abstract: Audio-visual multi-task incremental learning aims to continuously learn from multiple audio-visual tasks without the need for joint training on all tasks. The challenge of the problem is how to preserve the old task knowledge while facilitating the learning of new task with previous experiences. To address these challenges, we introduce a three-stage Progressive Homeostatic and Plastic audio-visual prompt (PHP) method. In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the models ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. In the deep phase, we introduce the task-specific modality-independent prompts to further refine the understand ability by targeting individual information for each task and modality. By incorporating these three phases, PHP retains task-specific prompts while adapting shared parameters for new tasks to effectively balance knowledge sharing and specificity. Our method achieves SOTA performance in different orders of four tasks (AVE, AVVP, AVS and AVQA). Our code can be available at https://github.com/ENJOY-Yin-jiong/PHP.
[290] Exploring the Link Between Bayesian Inference and Embodied Intelligence: Toward Open Physical-World Embodied AI Systems
Bin Liu
Main category: cs.AI
TL;DR: The paper explores the underutilized role of Bayesian inference in embodied intelligence systems, analyzing its potential through the lenses of search and learning.
Details
Motivation: To understand why Bayesian principles, despite their conceptual alignment with embodied intelligence, are not widely applied in modern systems.Method: Examines Bayesian and contemporary embodied intelligence approaches, focusing on search and learning as key themes.
Result: Identifies gaps in current systems, which are confined to closed environments, and highlights Bayesian methods’ potential for open-world embodied intelligence.
Conclusion: Bayesian inference could be pivotal in advancing embodied intelligence toward open physical-world applications.
Abstract: Embodied intelligence posits that cognitive capabilities fundamentally emerge from - and are shaped by - an agent’s real-time sensorimotor interactions with its environment. Such adaptive behavior inherently requires continuous inference under uncertainty. Bayesian statistics offers a principled probabilistic framework to address this challenge by representing knowledge as probability distributions and updating beliefs in response to new evidence. The core computational processes underlying embodied intelligence - including perception, action selection, learning, and even higher-level cognition - can be effectively understood and modeled as forms of Bayesian inference. Despite the deep conceptual connection between Bayesian statistics and embodied intelligence, Bayesian principles have not been widely or explicitly applied in today’s embodied intelligence systems. In this work, we examine both Bayesian and contemporary embodied intelligence approaches through two fundamental lenses: search and learning - the two central themes in modern AI, as highlighted in Rich Sutton’s influential essay “The Bitter Lesson”. This analysis sheds light on why Bayesian inference has not played a central role in the development of modern embodied intelligence. At the same time, it reveals that current embodied intelligence systems remain largely confined to closed-physical-world environments, and highlights the potential for Bayesian methods to play a key role in extending these systems toward truly open physical-world embodied intelligence.
[291] StaffPro: an LLM Agent for Joint Staffing and Profiling
Alessio Maritan
Main category: cs.AI
TL;DR: StaffPro is an LLM agent for workforce management, combining staffing and profiling tasks with natural language interaction and continuous human feedback.
Details
Motivation: To address the intertwined challenges of staffing (task assignment/scheduling) and profiling (worker skill estimation) in workforce management using LLMs.Method: StaffPro integrates LLMs with modular algorithmic components, uses natural language for optimization objectives, and establishes a human-agent feedback loop for continuous profiling.
Result: StaffPro successfully estimates worker attributes and generates high-quality schedules, demonstrated in a consulting firm simulation.
Conclusion: StaffPro provides a robust, interpretable, and human-centric solution for automated personnel management.
Abstract: Large language model (LLM) agents integrate pre-trained LLMs with modular algorithmic components and have shown remarkable reasoning and decision-making abilities. In this work, we investigate their use for two tightly intertwined challenges in workforce management: staffing, i.e., the assignment and scheduling of tasks to workers, which may require team formation; and profiling, i.e., the continuous estimation of workers’ skills, preferences, and other latent attributes from unstructured data. We cast these problems in a formal mathematical framework that links scheduling decisions to latent feature estimation, and we introduce StaffPro, an LLM agent that addresses staffing and profiling jointly. Differently from existing staffing solutions, StaffPro allows expressing optimization objectives using natural language, accepts textual task descriptions and provides high flexibility. StaffPro interacts directly with humans by establishing a continuous human-agent feedback loop, ensuring natural and intuitive use. By analyzing human feedback, our agent continuously estimates the latent features of workers, realizing life-long worker profiling and ensuring optimal staffing performance over time. A consulting firm simulation example demonstrates that StaffPro successfully estimates workers’ attributes and generates high quality schedules. With its innovative design, StaffPro offers a robust, interpretable, and human-centric solution for automated personnel management.
[292] Self-Aware Safety Augmentation: Leveraging Internal Semantic Understanding to Enhance Safety in Vision-Language Models
Wanying Wang, Zeyu Ma, Han Zheng, Xin Tan, Mingang Chen
Main category: cs.AI
TL;DR: The paper investigates vulnerabilities in Large Vision-Language Models (LVLMs) to harmful input, identifying three key safety capabilities. It proposes Self-Aware Safety Augmentation (SASA) to enhance safety without fine-tuning, showing significant improvements with minimal utility loss.
Details
Motivation: LVLMs are more vulnerable to harmful input than language-only models, prompting a study of their internal safety dynamics.Method: Defines three safety capabilities (safety perception, semantic understanding, alignment) and proposes SASA, projecting semantic representations to earlier layers for enhanced safety recognition.
Result: SASA improves LVLM safety significantly with minimal utility impact, validated across datasets and tasks.
Conclusion: The study highlights LVLM safety vulnerabilities and offers SASA as an effective, non-intrusive solution.
Abstract: Large vision-language models (LVLMs) are vulnerable to harmful input compared to their language-only backbones. We investigated this vulnerability by exploring LVLMs internal dynamics, framing their inherent safety understanding in terms of three key capabilities. Specifically, we define these capabilities as safety perception, semantic understanding, and alignment for linguistic expression, and experimentally pinpointed their primary locations within the model architecture. The results indicate that safety perception often emerges before comprehensive semantic understanding, leading to the reduction in safety. Motivated by these findings, we propose \textbf{Self-Aware Safety Augmentation (SASA)}, a technique that projects informative semantic representations from intermediate layers onto earlier safety-oriented layers. This approach leverages the model’s inherent semantic understanding to enhance safety recognition without fine-tuning. Then, we employ linear probing to articulate the model’s internal semantic comprehension to detect the risk before the generation process. Extensive experiments on various datasets and tasks demonstrate that SASA significantly improves the safety of LVLMs, with minimal impact on the utility.
[293] Assistax: A Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics
Leonard Hinckeldey, Elliot Fosong, Elle Miller, Rimvydas Rubavicius, Trevor McInroe, Patricia Wollstadt, Christiane B. Wiebel-Herboth, Subramanian Ramamoorthy, Stefano V. Albrecht
Main category: cs.AI
TL;DR: Assistax is a new RL benchmark for assistive robotics, offering faster training via JAX and multi-agent RL for diverse human-robot interactions.
Details
Motivation: Current RL benchmarks (e.g., games) lack real-world applicability, especially in embodied assistive robotics. Assistax aims to bridge this gap.Method: Assistax uses JAX for hardware acceleration and multi-agent RL to simulate interactions between robots and diverse human partners.
Result: Assistax achieves up to 370× faster training compared to CPU-based alternatives and provides reliable baselines for RL algorithms.
Conclusion: Assistax is a practical, open-source benchmark for advancing RL in assistive robotics, with potential for real-world impact.
Abstract: The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX’s hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to $370\times$ faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent’s zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: https://github.com/assistive-autonomy/assistax.
[294] Can the current trends of AI handle a full course of mathematics?
Mariam Alsayyad, Fayadh Kadhem
Main category: cs.AI
TL;DR: The paper evaluates AI’s ability to handle a full college-level math course, highlighting strengths in organization and accuracy but noting gaps in emotional and human aspects. It suggests integrating AI and human efforts for better outcomes.
Details
Motivation: To assess whether current AI trends can manage a full college-level math course, including syllabus creation, material presentation, answering questions, and assessments.Method: The study evaluates AI’s performance in four key aspects: syllabus creation, material presentation, answering student questions, and creating assessments.
Result: AI excels in organization and accuracy but lacks in emotional and human aspects, making it insufficient alone for a full course.
Conclusion: The paper recommends combining AI and human efforts to optimize the creation and delivery of a college-level math course.
Abstract: This paper addresses the question of how able the current trends of Artificial Intelligence (AI) are in managing to take the responsibility of a full course of mathematics at a college level. The study evaluates this ability in four significant aspects, namely, creating a course syllabus, presenting selected material, answering student questions, and creating an assessment. It shows that even though the AI is strong in some important parts like organization and accuracy, there are still some human aspects that are far away from the current abilities of AI. There is still a hidden emotional part, even in science, that cannot be fulfilled by the AI in its current state. This paper suggests some recommendations to integrate the human and AI potentials to create better outcomes in terms of reaching the target of creating a full course of mathematics, at a university level, as best as possible.
[295] Unrolling Dynamic Programming via Graph Filters
Sergio Rozada, Samuel Rey, Gonzalo Mateos, Antonio G. Marques
Main category: cs.AI
TL;DR: Proposes BellNet, a learnable parametric model to solve Bellman’s equations efficiently by unrolling policy iterations and leveraging graph signal processing insights.
Details
Motivation: Standard DP methods like policy iteration are computationally expensive for large state-action spaces or long-term dependencies.Method: BellNet unrolls and truncates policy iterations into a parametric model, trained to minimize Bellman error, and re-parameterizes it as nonlinear graph filters.
Result: Preliminary experiments show BellNet approximates optimal policies faster than classical methods.
Conclusion: BellNet offers a concise, transferable, and efficient alternative to traditional DP methods.
Abstract: Dynamic programming (DP) is a fundamental tool used across many engineering fields. The main goal of DP is to solve Bellman’s optimality equations for a given Markov decision process (MDP). Standard methods like policy iteration exploit the fixed-point nature of these equations to solve them iteratively. However, these algorithms can be computationally expensive when the state-action space is large or when the problem involves long-term dependencies. Here we propose a new approach that unrolls and truncates policy iterations into a learnable parametric model dubbed BellNet, which we train to minimize the so-termed Bellman error from random value function initializations. Viewing the transition probability matrix of the MDP as the adjacency of a weighted directed graph, we draw insights from graph signal processing to interpret (and compactly re-parameterize) BellNet as a cascade of nonlinear graph filters. This fresh look facilitates a concise, transferable, and unifying representation of policy and value iteration, with an explicit handle on complexity during inference. Preliminary experiments conducted in a grid-like environment demonstrate that BellNet can effectively approximate optimal policies in a fraction of the iterations required by classical methods.
[296] GDAIP: A Graph-Based Domain Adaptive Framework for Individual Brain Parcellation
Jianfei Zhu, Haiqi Zhu, Shaohui Liu, Feng Jiang, Baichun Wei, Chunzhi Yi
Main category: cs.AI
TL;DR: GDAIP is a novel framework combining Graph Attention Networks and Minimax Entropy for domain adaptation in brain parcellation, addressing cross-dataset domain shifts.
Details
Motivation: Existing methods struggle with domain shifts in cross-dataset scenarios, limiting their real-world applicability.Method: GDAIP integrates Graph Attention Networks and Minimax Entropy for domain adaptation, using semi-supervised training and adversarial optimization on unlabeled data.
Result: GDAIP achieves topologically plausible parcellations with cross-session consistency and reflects functional organization.
Conclusion: GDAIP effectively addresses domain shifts in brain parcellation, offering improved individual-level results.
Abstract: Recent deep learning approaches have shown promise in learning such individual brain parcellations from functional magnetic resonance imaging (fMRI). However, most existing methods assume consistent data distributions across domains and struggle with domain shifts inherent to real-world cross-dataset scenarios. To address this challenge, we proposed Graph Domain Adaptation for Individual Parcellation (GDAIP), a novel framework that integrates Graph Attention Networks (GAT) with Minimax Entropy (MME)-based domain adaptation. We construct cross-dataset brain graphs at both the group and individual levels. By leveraging semi-supervised training and adversarial optimization of the prediction entropy on unlabeled vertices from target brain graph, the reference atlas is adapted from the group-level brain graph to the individual brain graph, enabling individual parcellation under cross-dataset settings. We evaluated our method using parcellation visualization, Dice coefficient, and functional homogeneity. Experimental results demonstrate that GDAIP produces individual parcellations with topologically plausible boundaries, strong cross-session consistency, and ability of reflecting functional organization.
[297] SAT-Based Bounded Fitting for the Description Logic ALC
Maurice Funk, Jean Christoph Jung, Tom Voellmer
Main category: cs.AI
TL;DR: Bounded fitting for ALC and its fragments is NP-complete, even with minimal examples. It offers PAC learning guarantees, unlike other ALC learning methods. An implementation using a SAT solver is presented and compared.
Details
Motivation: To explore the computational complexity and learning guarantees of bounded fitting in description logic ALC and its fragments.Method: Investigate bounded fitting for ALC and its fragments, analyze NP-completeness, and implement a SAT solver-based solution.
Result: Bounded fitting is NP-complete even with one positive and one negative example. It provides PAC guarantees, unlike other methods.
Conclusion: Bounded fitting is a viable approach for learning ALC concepts with theoretical guarantees, and the SAT-based implementation is competitive.
Abstract: Bounded fitting is a general paradigm for learning logical formulas from positive and negative data examples, that has received considerable interest recently. We investigate bounded fitting for the description logic ALC and its syntactic fragments. We show that the underlying size-restricted fitting problem is NP-complete for all studied fragments, even in the special case of a single positive and a single negative example. By design, bounded fitting comes with probabilistic guarantees in Valiant’s PAC learning framework. In contrast, we show that other classes of algorithms for learning ALC concepts do not provide such guarantees. Finally, we present an implementation of bounded fitting in ALC and its fragments based on a SAT solver. We discuss optimizations and compare our implementation to other concept learning tools.
[298] Towards a rigorous evaluation of RAG systems: the challenge of due diligence
Grégoire Martinon, Alexandra Lorenzo de Brionne, Jérôme Bohard, Antoine Lojou, Damien Hervault, Nicolas J-B. Brunel
Main category: cs.AI
TL;DR: The paper evaluates the reliability of Retrieval-Augmented Generation (RAG) systems in high-risk sectors, proposing a robust evaluation protocol combining human and LLM-Judge annotations to address issues like hallucinations and off-topic responses.
Details
Motivation: To address concerns about the reliability of RAG systems in critical applications like healthcare and finance, particularly due to issues such as hallucinations and off-topic responses.Method: Proposes an evaluation protocol combining human annotations and LLM-Judge annotations, inspired by Prediction Powered Inference (PPI), to measure system performance with statistical guarantees.
Result: A comprehensive dataset and precise performance measurements are provided, identifying system failures like hallucinations, off-topic responses, failed citations, and abstentions.
Conclusion: The study enhances the reliability and scalability of RAG system evaluation protocols for industrial applications.
Abstract: The rise of generative AI, has driven significant advancements in high-risk sectors like healthcare and finance. The Retrieval-Augmented Generation (RAG) architecture, combining language models (LLMs) with search engines, is particularly notable for its ability to generate responses from document corpora. Despite its potential, the reliability of RAG systems in critical contexts remains a concern, with issues such as hallucinations persisting. This study evaluates a RAG system used in due diligence for an investment fund. We propose a robust evaluation protocol combining human annotations and LLM-Judge annotations to identify system failures, like hallucinations, off-topic, failed citations, and abstentions. Inspired by the Prediction Powered Inference (PPI) method, we achieve precise performance measurements with statistical guarantees. We provide a comprehensive dataset for further analysis. Our contributions aim to enhance the reliability and scalability of RAG systems evaluation protocols in industrial applications.
[299] Hybrid Causal Identification and Causal Mechanism Clustering
Saixiong Liu, Yuhua Qian, Jue Li, Honghong Cheng, Feijiang Li
Main category: cs.AI
TL;DR: The paper proposes MCVCI and MCVCC methods for bivariate causal direction identification, leveraging mixture models and neural networks to handle heterogeneous causality in real-world data.
Details
Motivation: Existing methods often assume a single causal mechanism, but real-world data involves heterogeneous causal relationships across environments.Method: MCVCI combines Gaussian mixture models and neural networks, using likelihoods from a mixture conditional variational auto-encoder as causal criteria. MCVCC extends this by clustering causal mechanisms.
Result: The methods outperform state-of-the-art approaches on simulated and real data.
Conclusion: MCVCI and MCVCC effectively address heterogeneous causality, demonstrating superior performance in causal inference.
Abstract: Bivariate causal direction identification is a fundamental and vital problem in the causal inference field. Among binary causal methods, most methods based on additive noise only use one single causal mechanism to construct a causal model. In the real world, observations are always collected in different environments with heterogeneous causal relationships. Therefore, on observation data, this paper proposes a Mixture Conditional Variational Causal Inference model (MCVCI) to infer heterogeneous causality. Specifically, according to the identifiability of the Hybrid Additive Noise Model (HANM), MCVCI combines the superior fitting capabilities of the Gaussian mixture model and the neural network and elegantly uses the likelihoods obtained from the probabilistic bounds of the mixture conditional variational auto-encoder as causal decision criteria. Moreover, we model the casual heterogeneity into cluster numbers and propose the Mixture Conditional Variational Causal Clustering (MCVCC) method, which can reveal causal mechanism expression. Compared with state-of-the-art methods, the comprehensive best performance demonstrates the effectiveness of the methods proposed in this paper on several simulated and real data.
[300] MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong
Main category: cs.AI
TL;DR: MixGRPO improves efficiency in human preference alignment for image generation by combining SDE and ODE sampling with a sliding window mechanism, reducing training time by 50%. MixGRPO-Flash further cuts training time by 71%.
Details
Motivation: Existing methods like FlowGRPO are inefficient due to sampling and optimizing all denoising steps in MDPs.Method: MixGRPO integrates SDE and ODE sampling with a sliding window to confine randomness and reduce overhead. MixGRPO-Flash adds higher-order solvers for faster training.
Result: MixGRPO outperforms DanceGRPO in efficiency and effectiveness, with 50% lower training time. MixGRPO-Flash reduces time by 71%.
Conclusion: MixGRPO and MixGRPO-Flash offer significant efficiency gains in human preference alignment for image generation.
Abstract: Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at $\href{https://github.com/Tencent-Hunyuan/MixGRPO}{MixGRPO}$.
[301] An Agentic AI for a New Paradigm in Business Process Development
Mohammad Azarijafari, Luisa Mich, Michele Missikoff
Main category: cs.AI
TL;DR: The paper introduces an agent-based AI approach for business process design, replacing traditional task-based methods with goal-oriented, collaborative agents for more modular and intelligent automation.
Details
Motivation: To address the limitations of traditional task-based business process design by leveraging Agentic AI for more flexible, context-aware automation in dynamic industrial environments.Method: Proposes an agent-based method where AI agents collaborate to achieve business goals, identified by business objects, with merge goals for multi-agent collaboration when needed.
Result: The approach enables modular, intelligent business process development, enhancing flexibility and context-awareness in industrial automation.
Conclusion: Agent-based AI offers a superior alternative to traditional methods, improving automation adaptability and efficiency in dynamic settings.
Abstract: Artificial Intelligence agents represent the next major revolution in the continuous technological evolution of industrial automation. In this paper, we introduce a new approach for business process design and development that leverages the capabilities of Agentic AI. Departing from the traditional task-based approach to business process design, we propose an agent-based method, where agents contribute to the achievement of business goals, identified by a set of business objects. When a single agent cannot fulfill a goal, we have a merge goal that can be achieved through the collaboration of multiple agents. The proposed model leads to a more modular and intelligent business process development by organizing it around goals, objects, and agents. As a result, this approach enables flexible and context-aware automation in dynamic industrial environments.
[302] DualSG: A Dual-Stream Explicit Semantic-Guided Multivariate Time Series Forecasting Framework
Kuiye Ding, Fanda Fan, Yao Wang, Ruijie jian, Xiaorui Wang, Luqi Gong, Yishan Jiang, Chunjie Luo an Jianfeng Zhan
Main category: cs.AI
TL;DR: DualSG introduces a dual-stream framework where LLMs act as semantic guides to refine traditional predictions, avoiding issues of numerical precision loss and alignment difficulties.
Details
Motivation: To address the limitations of using LLMs as end-to-end forecasters or aligning textual and time series modalities in latent space, which often leads to precision loss or alignment challenges.Method: Proposes DualSG, a dual-stream framework with explicit semantic guidance, using Time Series Caption for interpretable prompts and a caption-guided fusion module for inter-variable relationships.
Result: Outperforms 15 state-of-the-art baselines on real-world datasets, showing the effectiveness of combining numerical forecasting with semantic guidance.
Conclusion: Explicitly integrating LLMs as semantic guides within a dual-stream framework enhances forecasting accuracy and interpretability.
Abstract: Multivariate Time Series Forecasting plays a key role in many applications. Recent works have explored using Large Language Models for MTSF to take advantage of their reasoning abilities. However, many methods treat LLMs as end-to-end forecasters, which often leads to a loss of numerical precision and forces LLMs to handle patterns beyond their intended design. Alternatively, methods that attempt to align textual and time series modalities within latent space frequently encounter alignment difficulty. In this paper, we propose to treat LLMs not as standalone forecasters, but as semantic guidance modules within a dual-stream framework. We propose DualSG, a dual-stream framework that provides explicit semantic guidance, where LLMs act as Semantic Guides to refine rather than replace traditional predictions. As part of DualSG, we introduce Time Series Caption, an explicit prompt format that summarizes trend patterns in natural language and provides interpretable context for LLMs, rather than relying on implicit alignment between text and time series in the latent space. We also design a caption-guided fusion module that explicitly models inter-variable relationships while reducing noise and computation. Experiments on real-world datasets from diverse domains show that DualSG consistently outperforms 15 state-of-the-art baselines, demonstrating the value of explicitly combining numerical forecasting with semantic guidance.
[303] Probabilistic Active Goal Recognition
Chenyuan Zhang, Cristian Rojas Cardenas, Hamid Rezatofighi, Mor Vered, Buser Say
Main category: cs.AI
TL;DR: The paper introduces Active Goal Recognition (AGR) for multi-agent systems, combining probabilistic belief updates with MCTS to efficiently infer hidden goals without domain-specific knowledge. It outperforms passive methods and matches domain-specific baselines.
Details
Motivation: To improve multi-agent interactions by actively reducing uncertainty about other agents' goals, moving beyond passive reasoning.Method: Uses a probabilistic framework with joint belief updates and MCTS for efficient planning and goal inference.
Result: Joint belief update outperforms passive recognition; domain-independent MCTS matches domain-specific baselines.
Conclusion: The proposed AGR framework is practical and robust, advancing interactive multi-agent systems.
Abstract: In multi-agent environments, effective interaction hinges on understanding the beliefs and intentions of other agents. While prior work on goal recognition has largely treated the observer as a passive reasoner, Active Goal Recognition (AGR) focuses on strategically gathering information to reduce uncertainty. We adopt a probabilistic framework for Active Goal Recognition and propose an integrated solution that combines a joint belief update mechanism with a Monte Carlo Tree Search (MCTS) algorithm, allowing the observer to plan efficiently and infer the actor’s hidden goal without requiring domain-specific knowledge. Through comprehensive empirical evaluation in a grid-based domain, we show that our joint belief update significantly outperforms passive goal recognition, and that our domain-independent MCTS performs comparably to our strong domain-specific greedy baseline. These results establish our solution as a practical and robust framework for goal inference, advancing the field toward more interactive and adaptive multi-agent systems.
[304] EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
Xingjian Zhang, Siwei Wen, Wenjun Wu, Lei Huang
Main category: cs.AI
TL;DR: The paper introduces EDGE-GRPO, an algorithm addressing advantage collapse in LLMs by leveraging entropy-driven advantage and guided error correction, outperforming existing methods.
Details
Motivation: To mitigate the advantage collapse problem in GRPO algorithms, which arises from identical rewards within groups, by improving response diversity and training signals.Method: Proposes EDGE-GRPO, combining entropy-driven advantage and guided error correction to enhance policy optimization at a fine-grained sample level.
Result: EDGE-GRPO demonstrates superior performance on reasoning benchmarks, effectively reducing advantage collapse.
Conclusion: The proposed EDGE-GRPO algorithm successfully addresses advantage collapse, offering a robust solution for improving LLM reasoning.
Abstract: Large Language Models (LLMs) have made remarkable progress in enhancing step-by-step reasoning through reinforcement learning. However, the Group Relative Policy Optimization (GRPO) algorithm, which relies on sparse reward rules, often encounters the issue of identical rewards within groups, leading to the advantage collapse problem. Existing works typically address this challenge from two perspectives: enforcing model reflection to enhance response diversity, and introducing internal feedback to augment the training signal (advantage). In this work, we begin by analyzing the limitations of model reflection and investigating the policy entropy of responses at the fine-grained sample level. Based on our experimental findings, we propose the EDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantage and \textbf{G}uided \textbf{E}rror Correction to effectively mitigate the problem of advantage collapse. Extensive experiments on several main reasoning benchmarks demonstrate the effectiveness and superiority of our approach. It is available at https://github.com/ZhangXJ199/EDGE-GRPO.
[305] MultiEditor: Controllable Multimodal Object Editing for Driving Scenarios Using 3D Gaussian Splatting Priors
Shouyi Lu, Zihan Lin, Chao Lu, Huanran Wang, Guirong Zhuo, Lianqing Zheng
Main category: cs.AI
TL;DR: MultiEditor is a dual-branch latent diffusion framework for editing images and LiDAR point clouds in driving scenarios, using 3D Gaussian Splatting as a prior to improve cross-modality consistency and address long-tailed data distribution.
Details
Motivation: The long-tailed distribution of real-world data in autonomous driving hinders generalization, especially for rare but safety-critical vehicle categories.Method: MultiEditor uses 3D Gaussian Splatting as a structural and appearance prior, with a multi-level appearance control mechanism and depth-guided deformable cross-modality condition module.
Result: MultiEditor achieves high visual and geometric fidelity, editing controllability, and cross-modality consistency, improving detection accuracy for rare vehicle categories.
Conclusion: MultiEditor effectively addresses the challenge of long-tailed data distribution in autonomous driving by enhancing cross-modality consistency and generating rare-category vehicle data.
Abstract: Autonomous driving systems rely heavily on multimodal perception data to understand complex environments. However, the long-tailed distribution of real-world data hinders generalization, especially for rare but safety-critical vehicle categories. To address this challenge, we propose MultiEditor, a dual-branch latent diffusion framework designed to edit images and LiDAR point clouds in driving scenarios jointly. At the core of our approach is introducing 3D Gaussian Splatting (3DGS) as a structural and appearance prior for target objects. Leveraging this prior, we design a multi-level appearance control mechanism–comprising pixel-level pasting, semantic-level guidance, and multi-branch refinement–to achieve high-fidelity reconstruction across modalities. We further propose a depth-guided deformable cross-modality condition module that adaptively enables mutual guidance between modalities using 3DGS-rendered depth, significantly enhancing cross-modality consistency. Extensive experiments demonstrate that MultiEditor achieves superior performance in visual and geometric fidelity, editing controllability, and cross-modality consistency. Furthermore, generating rare-category vehicle data with MultiEditor substantially enhances the detection accuracy of perception models on underrepresented classes.
[306] A Neuro-Symbolic Approach for Probabilistic Reasoning on Graph Data
Raffaele Pojer, Andrea Passerini, Kim G. Larsen, Manfred Jaeger
Main category: cs.AI
TL;DR: A neuro-symbolic framework integrates GNNs with RBNs, combining learning and reasoning for improved performance on graph tasks.
Details
Motivation: GNNs lack symbolic reasoning, while RBNs lack learning strength. Integrating them leverages both strengths.Method: Two implementations: compiling GNNs into RBNs or keeping them external. Includes a MAP inference method.
Result: Improved accuracy in node classification and complex decision-making in environmental planning.
Conclusion: The framework bridges learning and reasoning, enabling novel applications and better performance.
Abstract: Graph neural networks (GNNs) excel at predictive tasks on graph-structured data but often lack the ability to incorporate symbolic domain knowledge and perform general reasoning. Relational Bayesian Networks (RBNs), in contrast, enable fully generative probabilistic modeling over graph-like structures and support rich symbolic knowledge and probabilistic inference. This paper presents a neuro-symbolic framework that seamlessly integrates GNNs into RBNs, combining the learning strength of GNNs with the flexible reasoning capabilities of RBNs. We develop two implementations of this integration: one compiles GNNs directly into the native RBN language, while the other maintains the GNN as an external component. Both approaches preserve the semantics and computational properties of GNNs while fully aligning with the RBN modeling paradigm. We also propose a maximum a-posteriori (MAP) inference method for these neuro-symbolic models. To demonstrate the framework’s versatility, we apply it to two distinct problems. First, we transform a GNN for node classification into a collective classification model that explicitly models homo- and heterophilic label patterns, substantially improving accuracy. Second, we introduce a multi-objective network optimization problem in environmental planning, where MAP inference supports complex decision-making. Both applications include new publicly available benchmark datasets. This work introduces a powerful and coherent neuro-symbolic approach to graph data, bridging learning and reasoning in ways that enable novel applications and improved performance across diverse tasks.
[307] Tiny-BioMoE: a Lightweight Embedding Model for Biosignal Analysis
Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis
Main category: cs.AI
TL;DR: The paper introduces Tiny-BioMoE, a lightweight pretrained model for biosignal analysis, aimed at improving automatic pain assessment through multimodal physiological signals.
Details
Motivation: Accurate pain assessment is crucial for patient care and management. Current systems lack efficiency and objectivity, which Tiny-BioMoE addresses by leveraging biosignals.Method: The study proposes Tiny-BioMoE, a pretrained embedding model trained on 4.4 million biosignal images with 7.3 million parameters, tested on diverse physiological signals for pain recognition.
Result: The model demonstrates effectiveness in automatic pain recognition across multiple biosignal modalities, including electrodermal activity and blood volume pulse.
Conclusion: Tiny-BioMoE offers a lightweight, efficient solution for biosignal-based pain assessment, with its code and weights publicly available for further use.
Abstract: Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person’s state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed approach introduces \textit{Tiny-BioMoE}, a lightweight pretrained embedding model for biosignal analysis. Trained on $4.4$ million biosignal image representations and consisting of only $7.3$ million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model’s effectiveness across diverse modalities in automatic pain recognition tasks. \textit{\textcolor{blue}{The model’s architecture (code) and weights are available at https://github.com/GkikasStefanos/Tiny-BioMoE.
[308] Multi-Representation Diagrams for Pain Recognition: Integrating Various Electrodermal Activity Signals into a Single Image
Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis
Main category: cs.AI
TL;DR: A study proposes an automated pain-assessment system using electrodermal activity signals, creating multiple signal representations for improved accuracy and robustness compared to traditional methods.
Details
Motivation: Pain affects many people, and reliable assessment is crucial for effective management. Automated systems can provide continuous, objective monitoring to aid clinical decisions and reduce distress.Method: The method uses electrodermal activity signals, creating and visualizing multiple signal representations in a single diagram. It incorporates various processing and filtering techniques.
Result: The approach outperforms traditional fusion methods, offering comparable or superior results in pain assessment.
Conclusion: The proposed pipeline is a robust alternative for integrating signal representations or modalities, enhancing pain-assessment systems.
Abstract: Pain is a multifaceted phenomenon that affects a substantial portion of the population. Reliable and consistent evaluation benefits those experiencing pain and underpins the development of effective and advanced management strategies. Automatic pain-assessment systems deliver continuous monitoring, inform clinical decision-making, and aim to reduce distress while preventing functional decline. By incorporating physiological signals, these systems provide objective, accurate insights into an individual’s condition. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages electrodermal activity signals as input modality. Multiple representations of the signal are created and visualized as waveforms, and they are jointly visualized within a single multi-representation diagram. Extensive experiments incorporating various processing and filtering techniques, along with multiple representation combinations, demonstrate the effectiveness of the proposed approach. It consistently yields comparable, and in several cases superior, results to traditional fusion methods, establishing it as a robust alternative for integrating different signal representations or modalities.
[309] The Impact of Foundational Models on Patient-Centric e-Health Systems
Elmira Onagh, Alireza Davoodi, Maleknaz Nayebi
Main category: cs.AI
TL;DR: The study evaluates AI maturity in 116 patient-centric healthcare apps, finding most (86.21%) at early stages and few (13.79%) with advanced AI.
Details
Motivation: Assess AI's trustworthiness, transparency, and impact in healthcare by examining its maturity in patient-centric applications.Method: Analyzed 116 apps using LLMs to extract and categorize AI features into Gartner AI maturity stages.
Result: 86.21% of apps are in early AI stages; only 13.79% show advanced integration.
Conclusion: AI in healthcare apps is mostly immature, highlighting a need for further development and standardization.
Abstract: As Artificial Intelligence (AI) becomes increasingly embedded in healthcare technologies, understanding the maturity of AI in patient-centric applications is critical for evaluating its trustworthiness, transparency, and real-world impact. In this study, we investigate the integration and maturity of AI feature integration in 116 patient-centric healthcare applications. Using Large Language Models (LLMs), we extracted key functional features, which are then categorized into different stages of the Gartner AI maturity model. Our results show that over 86.21% of applications remain at the early stages of AI integration, while only 13.79% demonstrate advanced AI integration.
[310] Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline
Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis
Main category: cs.AI
TL;DR: The paper proposes a respiration-based pain assessment method using a cross-attention transformer and multi-windowing, showing strong performance with efficient models.
Details
Motivation: Accurate pain assessment is crucial for effective management, and automated systems can aid continuous monitoring and clinical decisions.Method: A pipeline using respiration signals, a cross-attention transformer, and a multi-windowing strategy to capture short-term, long-term, and global features.
Result: Respiration is effective for pain assessment; compact, optimized models outperform larger ones. Multi-windowing enhances feature representation.
Conclusion: The method demonstrates the potential of respiration and efficient models in pain assessment, with multi-windowing improving performance.
Abstract: Pain is a complex condition affecting a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain, and it supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring and support clinical decision-making, aiming to reduce distress and prevent functional decline. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed method introduces a pipeline that leverages respiration as the input signal and incorporates a highly efficient cross-attention transformer alongside a multi-windowing strategy. Extensive experiments demonstrate that respiration is a valuable physiological modality for pain assessment. Moreover, experiments revealed that compact and efficient models, when properly optimized, can achieve strong performance, often surpassing larger counterparts. The proposed multi-window approach effectively captures both short-term and long-term features, as well as global characteristics, thereby enhancing the model’s representational capacity.
[311] UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li
Main category: cs.AI
TL;DR: UI-AGILE enhances GUI agents with improved training (Continuous Reward, Simple Thinking reward, Cropping-based Resampling) and inference (Decomposed Grounding with Selection) methods, achieving state-of-the-art performance.
Details
Motivation: Existing GUI agents face challenges in reasoning designs, ineffective rewards, and visual noise, limiting their performance.Method: Proposes UI-AGILE with training enhancements (Continuous Reward, Simple Thinking reward, Cropping-based Resampling) and inference method (Decomposed Grounding with Selection).
Result: Achieves 23% grounding accuracy improvement on ScreenSpot-Pro over baselines.
Conclusion: UI-AGILE significantly advances GUI agent capabilities by addressing key training and inference challenges.
Abstract: The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process:
- a Continuous Reward function to incentivize high-precision grounding; 2) a “Simple Thinking” reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.
[312] LLM-based Content Classification Approach for GitHub Repositories by the README Files
Malik Uzair Mehmood, Shahid Hussain, Wen Li Wang, Muhammad Usama Malik
Main category: cs.AI
TL;DR: The study fine-tunes LLMs (BERT, DistilBERT, RoBERTa) to classify GitHub README sections, achieving high accuracy (F1=0.98) and explores PEFT techniques for efficiency.
Details
Motivation: GitHub README files often lack detail, hindering repository adoption. Automating classification can improve repository usability.Method: Fine-tuned three encoder-only LLMs on 4226 README sections, tested PEFT techniques like LoRA.
Result: Achieved F1 score of 0.98, outperforming state-of-the-art methods. PEFT showed economical performance.
Conclusion: LLMs can effectively classify README content, aiding automated tools for GitHub repository improvement.
Abstract: GitHub is the world’s most popular platform for storing, sharing, and managing code. Every GitHub repository has a README file associated with it. The README files should contain project-related information as per the recommendations of GitHub to support the usage and improvement of repositories. However, GitHub repository owners sometimes neglected these recommendations. This prevents a GitHub repository from reaching its full potential. This research posits that the comprehensiveness of a GitHub repository’s README file significantly influences its adoption and utilization, with a lack of detail potentially hindering its full potential for widespread engagement and impact within the research community. Large Language Models (LLMs) have shown great performance in many text-based tasks including text classification, text generation, text summarization and text translation. In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub README files. Three encoder-only LLMs are utilized, including BERT, DistilBERT and RoBERTa. These pre-trained models are then fine-tuned based on a gold-standard dataset consisting of 4226 README file sections. This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98. Moreover, we have also investigated the use of Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) and shown an economical alternative to full fine-tuning without compromising much performance. The results demonstrate the potential of using LLMs in designing an automatic classifier for categorizing the content of GitHub README files. Consequently, this study contributes to the development of automated tools for GitHub repositories to improve their identifications and potential usages.
[313] UserBench: An Interactive Gym Environment for User-Centric Agents
Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
Main category: cs.AI
TL;DR: UserBench is a benchmark for evaluating LLM agents in proactive collaboration with users, revealing gaps in user alignment despite task completion.
Details
Motivation: Current LLM agents excel in reasoning and tool use but struggle with proactive collaboration when user goals are vague or evolving.Method: Introduces UserBench, a benchmark with simulated users and incremental preference revelation, testing agents’ ability to clarify intent and use tools.
Result: Evaluation shows low alignment with user intents (20%) and preference discovery (under 30%), even for advanced models.
Conclusion: UserBench highlights the need for agents to evolve from task executors to collaborative partners, providing a tool for advancement.
Abstract: Large Language Models (LLMs)-based agents have made impressive progress in reasoning and tool use, enabling them to solve complex tasks. However, their ability to proactively collaborate with users, especially when goals are vague, evolving, or indirectly expressed, remains underexplored. To address this gap, we introduce UserBench, a user-centric benchmark designed to evaluate agents in multi-turn, preference-driven interactions. UserBench features simulated users who start with underspecified goals and reveal preferences incrementally, requiring agents to proactively clarify intent and make grounded decisions with tools. Our evaluation of leading open- and closed-source LLMs reveals a significant disconnect between task completion and user alignment. For instance, models provide answers that fully align with all user intents only 20% of the time on average, and even the most advanced models uncover fewer than 30% of all user preferences through active interaction. These results highlight the challenges of building agents that are not just capable task executors, but true collaborative partners. UserBench offers an interactive environment to measure and advance this critical capability.
[314] Libra: Large Chinese-based Safeguard for AI Content
Ziyang Chen, Huimu Yu, Xing Wu, Dongqin Liu, Songlin Hu
Main category: cs.AI
TL;DR: Libra-Guard is a safeguard system for Chinese LLMs, using a two-stage training pipeline and achieving high accuracy in safety evaluations.
Details
Motivation: Address safety and ethical concerns in high-stakes applications of Chinese LLMs.Method: Two-stage curriculum training: guard pretraining on synthetic data, followed by fine-tuning on real-world data. Introduces Libra-Test benchmark for evaluation.
Result: Libra-Guard achieves 86.79% accuracy, outperforming other models and nearing closed-source ones like Claude-3.5-Sonnet and GPT-4o.
Conclusion: Libra-Guard and Libra-Test provide a robust framework for safer Chinese LLMs, advancing AI safety governance.
Abstract: Large language models (LLMs) excel in text understanding and generation but raise significant safety and ethical concerns in high-stakes applications. To mitigate these risks, we present Libra-Guard, a cutting-edge safeguard system designed to enhance the safety of Chinese-based LLMs. Leveraging a two-stage curriculum training pipeline, Libra-Guard enhances data efficiency by employing guard pretraining on synthetic samples, followed by fine-tuning on high-quality, real-world data, thereby significantly reducing reliance on manual annotations. To enable rigorous safety evaluations, we also introduce Libra-Test, the first benchmark specifically designed to evaluate the effectiveness of safeguard systems for Chinese content. It covers seven critical harm scenarios and includes over 5,700 samples annotated by domain experts. Experiments show that Libra-Guard achieves 86.79% accuracy, outperforming Qwen2.5-14B-Instruct (74.33%) and ShieldLM-Qwen-14B-Chat (65.69%), and nearing closed-source models like Claude-3.5-Sonnet and GPT-4o. These contributions establish a robust framework for advancing the safety governance of Chinese LLMs and represent a tentative step toward developing safer, more reliable Chinese AI systems.
[315] Thou Shalt Not Prompt: Zero-Shot Human Activity Recognition in Smart Homes via Language Modeling of Sensor Data & Activities
Sourish Gunesh Dhekane, Thomas Ploetz
Main category: cs.AI
TL;DR: Proposes a zero-shot HAR method using natural language embeddings, avoiding LLM prompting risks like privacy and inconsistency.
Details
Motivation: Addressing risks of LLM-based HAR methods (privacy, external reliance, inconsistency) by developing an alternative approach.Method: Models sensor data and activities using natural language embeddings for zero-shot classification, bypassing LLMs.
Result: Demonstrated effectiveness through a case study on six datasets.
Conclusion: Language modeling can enhance zero-shot HAR systems without relying on LLMs.
Abstract: Developing zero-shot human activity recognition (HAR) methods is a critical direction in smart home research – considering its impact on making HAR systems work across smart homes having diverse sensing modalities, layouts, and activities of interest. The state-of-the-art solutions along this direction are based on generating natural language descriptions of the sensor data and feeding it via a carefully crafted prompt to the LLM to perform classification. Despite their performance guarantees, such ``prompt-the-LLM’’ approaches carry several risks, including privacy invasion, reliance on an external service, and inconsistent predictions due to version changes, making a case for alternative zero-shot HAR methods that do not require prompting the LLMs. In this paper, we propose one such solution that models sensor data and activities using natural language, leveraging its embeddings to perform zero-shot classification and thereby bypassing the need to prompt the LLMs for activity predictions. The impact of our work lies in presenting a detailed case study on six datasets, highlighting how language modeling can bolster HAR systems in zero-shot recognition.
[316] Reasoning Language Models for Root Cause Analysis in 5G Wireless Networks
Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Yibin Kang, Haozhe Zhang, Merouane Debbah, Fadhel Ayed
Main category: cs.AI
TL;DR: A lightweight framework using LLMs for RCA in mobile networks is proposed, leveraging a curated dataset (TeleLogs) and a two-stage training method to enhance accuracy and reasoning.
Details
Motivation: RCA in mobile networks is challenging due to interpretability and domain expertise requirements. Existing LLMs struggle with domain-specific problems.Method: A two-stage training methodology combines supervised fine-tuning with reinforcement learning to adapt LLMs for RCA, using the TeleLogs dataset.
Result: Significant performance gains over state-of-the-art models, with strong generalization to randomized test variants.
Conclusion: Domain-adapted, reasoning-enhanced LLMs show promise for practical and explainable RCA in network operations.
Abstract: Root Cause Analysis (RCA) in mobile networks remains a challenging task due to the need for interpretability, domain expertise, and causal reasoning. In this work, we propose a lightweight framework that leverages Large Language Models (LLMs) for RCA. To do so, we introduce TeleLogs, a curated dataset of annotated troubleshooting problems designed to benchmark RCA capabilities. Our evaluation reveals that existing open-source reasoning LLMs struggle with these problems, underscoring the need for domain-specific adaptation. To address this issue, we propose a two-stage training methodology that combines supervised fine-tuning with reinforcement learning to improve the accuracy and reasoning quality of LLMs. The proposed approach fine-tunes a series of RCA models to integrate domain knowledge and generate structured, multi-step diagnostic explanations, improving both interpretability and effectiveness. Extensive experiments across multiple LLM sizes show significant performance gains over state-of-the-art reasoning and non-reasoning models, including strong generalization to randomized test variants. These results demonstrate the promise of domain-adapted, reasoning-enhanced LLMs for practical and explainable RCA in network operation and management.
[317] The Effect of Compression Techniques on Large Multimodal Language Models in the Medical Domain
Tanvir Ahmed Khan, Aranya Saha, Ismam Nur Swapnil, Mohammad Ariful Haque
Main category: cs.AI
TL;DR: The paper evaluates pruning and quantization for compressing Multimodal Large Language Models (MLLMs) in medical applications, achieving 70% memory reduction and 4% higher performance.
Details
Motivation: MLLMs are computationally expensive, requiring efficient compression techniques for practical medical use.Method: Proposes a novel layer selection method for pruning, analyzes quantization techniques, and evaluates a prune-SFT-quantize pipeline.
Result: Enables 7B-parameter MLLMs to run within 4GB VRAM, reducing memory by 70% with 4% higher performance.
Conclusion: The method effectively balances compression and performance, making MLLMs more feasible for medical applications.
Abstract: Multimodal Large Language Models (MLLMs) hold huge potential for usage in the medical domain, but their computational costs necessitate efficient compression techniques. This paper evaluates the impact of structural pruning and activation-aware quantization on a fine-tuned LLAVA model for medical applications. We propose a novel layer selection method for pruning, analyze different quantization techniques, and assess the performance trade-offs in a prune-SFT-quantize pipeline. Our proposed method enables MLLMs with 7B parameters to run within 4 GB of VRAM, reducing memory usage by 70% while achieving 4% higher model performance compared to traditional pruning and quantization techniques in the same compression ratio.
[318] PHAX: A Structured Argumentation Framework for User-Centered Explainable AI in Public Health and Biomedical Sciences
Bahar İlgen, Akshat Dubey, Georges Hattab
Main category: cs.AI
TL;DR: PHAX is a framework for generating human-centered explanations in AI-driven public health, using structured argumentation to enhance transparency and trust.
Details
Motivation: Current XAI methods lack adaptability for diverse health stakeholders, necessitating clearer, context-aware explanations.Method: PHAX combines defeasible reasoning, adaptive natural language techniques, and user modeling to create audience-specific justifications.
Result: PHAX improves interpretability and trust through use cases like medical term simplification and policy justification.
Conclusion: PHAX advances transparent, human-centered AI in public health by aligning formal reasoning with communicative needs.
Abstract: Ensuring transparency and trust in AI-driven public health and biomedical sciences systems requires more than accurate predictions-it demands explanations that are clear, contextual, and socially accountable. While explainable AI (XAI) has advanced in areas like feature attribution and model interpretability, most methods still lack the structure and adaptability needed for diverse health stakeholders, including clinicians, policymakers, and the general public. We introduce PHAX-a Public Health Argumentation and eXplainability framework-that leverages structured argumentation to generate human-centered explanations for AI outputs. PHAX is a multi-layer architecture combining defeasible reasoning, adaptive natural language techniques, and user modeling to produce context-aware, audience-specific justifications. More specifically, we show how argumentation enhances explainability by supporting AI-driven decision-making, justifying recommendations, and enabling interactive dialogues across user types. We demonstrate the applicability of PHAX through use cases such as medical term simplification, patient-clinician communication, and policy justification. In particular, we show how simplification decisions can be modeled as argument chains and personalized based on user expertise-enhancing both interpretability and trust. By aligning formal reasoning methods with communicative demands, PHAX contributes to a broader vision of transparent, human-centered AI in public health.
[319] The Interspeech 2025 Speech Accessibility Project Challenge
Xiuwen Zheng, Bornali Phukon, Jonghwan Na, Ed Cutrell, Kyu Han, Mark Hasegawa-Johnson, Pan-Pan Jiang, Aadhrik Kuila, Colin Lea, Bob MacDonald, Gautam Mantena, Venkatesh Ravichandran, Leda Sari, Katrin Tomanek, Chang D. Yoo, Chris Zwilling
Main category: cs.AI
TL;DR: The 2025 Interspeech SAP Challenge improved ASR for speech disabilities using 400+ hours of data, with 12/22 teams outperforming the baseline in WER and 17 in SemScore.
Details
Motivation: Address inadequate ASR performance for individuals with speech disabilities due to limited training data.Method: Utilized 400+ hours of SAP data from 500+ individuals, evaluated via EvalAI with Word Error Rate and Semantic Score metrics.
Result: 12 teams beat the baseline in WER, 17 in SemScore; top team achieved 8.11% WER and 88.44% SemScore.
Conclusion: The SAP Challenge set new benchmarks for ASR in recognizing impaired speech.
Abstract: While the last decade has witnessed significant advancements in Automatic Speech Recognition (ASR) systems, performance of these systems for individuals with speech disabilities remains inadequate, partly due to limited public training data. To bridge this gap, the 2025 Interspeech Speech Accessibility Project (SAP) Challenge was launched, utilizing over 400 hours of SAP data collected and transcribed from more than 500 individuals with diverse speech disabilities. Hosted on EvalAI and leveraging the remote evaluation pipeline, the SAP Challenge evaluates submissions based on Word Error Rate and Semantic Score. Consequently, 12 out of 22 valid teams outperformed the whisper-large-v2 baseline in terms of WER, while 17 teams surpassed the baseline on SemScore. Notably, the top team achieved the lowest WER of 8.11%, and the highest SemScore of 88.44% at the same time, setting new benchmarks for future ASR systems in recognizing impaired speech.
[320] Strategist: Self-improvement of LLM Decision Making via Bi-Level Tree Search
Jonathan Light, Min Cai, Weiqin Chen, Guanzhi Wang, Xiusi Chen, Wei Cheng, Yisong Yue, Ziniu Hu
Main category: cs.AI
TL;DR: STRATEGIST combines LLMs and MCTS for zero-shot planning in complex games, outperforming traditional RL and LLM-based methods.
Details
Motivation: Traditional RL and LLMs have limitations in planning and generalization. STRATEGIST aims to bridge this gap.Method: Integrates LLMs for high-level strategy generation and MCTS for refinement, using self-play simulations.
Result: Outperforms traditional RL, LLM-based methods, and matches human performance in games like GOPS and Avalon.
Conclusion: STRATEGIST is a generalizable framework for zero-shot planning in complex environments.
Abstract: Traditional reinforcement learning and planning typically requires vast amounts of data and training to develop effective policies. In contrast, large language models (LLMs) exhibit strong generalization and zero-shot capabilities, but struggle with tasks that require detailed planning and decision-making in complex action spaces. We introduce STRATEGIST, a novel approach that integrates the strengths of both methods. Our approach leverages LLMs to search and update high-level strategies (as text), which are then refined and executed by low-level Monte Carlo Tree Search (MCTS). STRATEGIST is a generalizable framework to optimize the strategy through population-based self-play simulations without the need for any training data. We demonstrate the effectiveness of STRATEGIST in learning optimal strategies for competitive, multi-turn games with partial information, including Game of Pure Strategy (GOPS) and multi-agent, hidden-identity discussion games like The Resistance: Avalon. Our results show that agents equipped with STRATEGIST outperform those trained with traditional RL methods, other LLM-based skill acquisition techniques, pre-existing LLM agents across both game environments and achieves comparable performance against human players.
[321] SAKE: Steering Activations for Knowledge Editing
Marco Scialanga, Thibault Laugel, Vincent Grari, Marcin Detyniecki
Main category: cs.AI
TL;DR: SAKE, a steering activation method, improves knowledge editing in LLMs by modeling facts as distributions and using Optimal Transport for robust edits.
Details
Motivation: Existing Knowledge Editing (KE) methods lack contextual robustness and fail to generalize to logical implications, necessitating a more effective approach.Method: SAKE models facts as distributions (paraphrases and logical implications) and uses Optimal Transport to edit LLM behavior across these distributions.
Result: SAKE outperforms existing KE methods, demonstrating more robust and generalized edits.
Conclusion: SAKE addresses limitations of current KE methods, offering a more effective solution for updating knowledge in LLMs.
Abstract: As Large Langue Models have been shown to memorize real-world facts, the need to update this knowledge in a controlled and efficient manner arises. Designed with these constraints in mind, Knowledge Editing (KE) approaches propose to alter specific facts in pretrained models. However, they have been shown to suffer from several limitations, including their lack of contextual robustness and their failure to generalize to logical implications related to the fact. To overcome these issues, we propose SAKE, a steering activation method that models a fact to be edited as a distribution rather than a single prompt. Leveraging Optimal Transport, SAKE alters the LLM behavior over a whole fact-related distribution, defined as paraphrases and logical implications. Several numerical experiments demonstrate the effectiveness of this method: SAKE is thus able to perform more robust edits than its existing counterparts.
[322] Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Oren Sultan, Eitan Stern, Dafna Shahaf
Main category: cs.AI
TL;DR: A neuro-symbolic approach combining LLMs with structured components improves proof accuracy in geometry by leveraging analogous problems and formal verification.
Details
Motivation: LLMs struggle with rigorous logical deduction, especially in domains like mathematical proof generation, limiting their reliability and applicability.Method: The approach retrieves analogous problems to guide LLMs and uses a formal verifier to evaluate and correct generated proofs.
Result: Proof accuracy for OpenAI’s o1 model improved by 58%-70%, with both analogous problems and verifier feedback contributing to the gains.
Conclusion: Enhancing LLMs to generate provably correct conclusions can boost their reliability, accuracy, and applicability in critical tasks.
Abstract: Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation. We propose a neuro-symbolic approach that combines LLMs’ generative strengths with structured components to overcome this challenge. As a proof-of-concept, we focus on geometry problems. Our approach is two-fold: (1) we retrieve analogous problems and use their proofs to guide the LLM, and (2) a formal verifier evaluates the generated proofs and provides feedback, helping the model fix incorrect proofs. We demonstrate that our method significantly improves proof accuracy for OpenAI’s o1 model (58%-70% improvement); both analogous problems and the verifier’s feedback contribute to these gains. More broadly, shifting to LLMs that generate provably correct conclusions could dramatically improve their reliability, accuracy and consistency, unlocking complex tasks and critical real-world applications that require trustworthiness.
[323] SLR: Automated Synthesis for Scalable Logical Reasoning
Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, and Wolfgang Stammer Kristian Kersting
Main category: cs.AI
TL;DR: SLR is an automated framework for evaluating and training LLMs via scalable logical reasoning, creating benchmarks and improving model accuracy without human input.
Details
Motivation: To address the limitations of LLMs in logical reasoning and provide a scalable, automated solution for evaluation and training.Method: SLR synthesizes instruction prompts, validation programs, and ground-truth rules automatically, creating a benchmark (SLR-Bench) with progressive difficulty levels.
Result: LLMs often fail at correct logical inference despite valid syntax. Curriculum learning via SLR improves accuracy (e.g., doubling Llama-3-8B’s performance) and generalizes to other benchmarks.
Conclusion: SLR is effective for enhancing LLMs’ reasoning capabilities efficiently and scalably, with demonstrated improvements in accuracy and generalization.
Abstract: We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user’s task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
[324] Ensuring Medical AI Safety: Interpretability-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data
Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Main category: cs.AI
TL;DR: The paper introduces an enhanced version of the Reveal2Revise framework for bias mitigation in deep neural networks used in medical applications, incorporating semi-automated interpretability-based bias annotation.
Details
Motivation: Deep neural networks in medical applications often suffer from shortcut learning due to spurious correlations, which can have severe consequences. Existing methods address detection or mitigation separately, but a combined approach is needed.Method: The Reveal2Revise framework is enhanced with semi-automated interpretability-based bias annotation, including sample- and feature-level annotation, to improve bias mitigation.
Result: The framework successfully identifies and mitigates biases in VGG16, ResNet50, and Vision Transformer models across four medical datasets, improving robustness.
Conclusion: The enhanced framework increases the applicability and safety of deep neural networks in real-world medical tasks by effectively addressing spurious correlations.
Abstract: Deep neural networks are increasingly employed in high-stakes medical applications, despite their tendency for shortcut learning in the presence of spurious correlations, which can have potentially fatal consequences in practice. Whereas a multitude of works address either the detection or mitigation of such shortcut behavior in isolation, the Reveal2Revise approach provides a comprehensive bias mitigation framework combining these steps. However, effectively addressing these biases often requires substantial labeling efforts from domain experts. In this work, we review the steps of the Reveal2Revise framework and enhance it with semi-automated interpretability-based bias annotation capabilities. This includes methods for the sample- and feature-level bias annotation, providing valuable information for bias mitigation methods to unlearn the undesired shortcut behavior. We show the applicability of the framework using four medical datasets across two modalities, featuring controlled and real-world spurious correlations caused by data artifacts. We successfully identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision Transformer models, ultimately increasing their robustness and applicability for real-world medical tasks. Our code is available at https://github.com/frederikpahde/medical-ai-safety.
[325] A Scalable Approach to Probabilistic Neuro-Symbolic Robustness Verification
Vasileios Manginas, Nikolaos Manginas, Edward Stevinson, Sherwin Varghese, Nikos Katzouris, Georgios Paliouras, Alessio Lomuscio
Main category: cs.AI
TL;DR: The paper proposes a method for verifying the robustness of probabilistic Neuro-Symbolic AI systems, addressing complexity with an approximate approach.
Details
Motivation: Ensuring safe deployment of Neuro-Symbolic AI in critical domains by verifying robustness.Method: Analyzes exact complexity (NP^PP-complete) and introduces an approximate, relaxation-based verification method.
Result: The method scales exponentially better than solver-based solutions and is applied to autonomous driving.
Conclusion: The approach enables practical verification of probabilistic NeSy systems, demonstrated in real-world applications.
Abstract: Neuro-Symbolic Artificial Intelligence (NeSy AI) has emerged as a promising direction for integrating neural learning with symbolic reasoning. Typically, in the probabilistic variant of such systems, a neural network first extracts a set of symbols from sub-symbolic input, which are then used by a symbolic component to reason in a probabilistic manner towards answering a query. In this work, we address the problem of formally verifying the robustness of such NeSy probabilistic reasoning systems, therefore paving the way for their safe deployment in critical domains. We analyze the complexity of solving this problem exactly, and show that a decision version of the core computation is $\mathrm{NP}^{\mathrm{PP}}$-complete. In the face of this result, we propose the first approach for approximate, relaxation-based verification of probabilistic NeSy systems. We demonstrate experimentally on a standard NeSy benchmark that the proposed method scales exponentially better than solver-based solutions and apply our technique to a real-world autonomous driving domain, where we verify a safety property under large input dimensionalities.
[326] An Algebraic Approach to Moralisation and Triangulation of Probabilistic Graphical Models
Antonio Lorenzin, Fabio Zanasi
Main category: cs.AI
TL;DR: The paper presents a categorical framework for transforming Bayesian networks into Markov networks (moralisation) and vice versa (triangulation) using functors.
Details
Motivation: To provide a modular, algebraic perspective on probabilistic graphical models by modeling transformations between Bayesian and Markov networks categorically.Method: Represent Bayesian and Markov networks as functors and model moralisation and triangulation as functors between categories of these networks.
Result: The framework allows inductive definitions of transformations and introduces a modular approach to probabilistic graphical models.
Conclusion: The categorical approach offers a structured, algebraic way to understand and manipulate transformations between graphical models.
Abstract: Moralisation and Triangulation are transformations allowing to switch between
different ways of factoring a probability distribution into a graphical model.
Moralisation allows to view a Bayesian network (a directed model) as a Markov
network (an undirected model), whereas triangulation works in the opposite
direction. We present a categorical framework where these transformations are
modelled as functors between a category of Bayesian networks and one of Markov
networks. The two kinds of network (the objects of these categories) are
themselves represented as functors, from a syntax' domain to a
semantics’
codomain. Notably, moralisation and triangulation are definable inductively on
such syntax, and operate as a form of functor pre-composition. This approach
introduces a modular, algebraic perspective in the theory of probabilistic
graphical models.
[327] 2D-Curri-DPO: Two-Dimensional Curriculum Learning for Direct Preference Optimization
Mengyang Li, Zhong Zhang
Main category: cs.AI
TL;DR: 2D-Curri-DPO introduces a two-dimensional curriculum for aligning language models, combining prompt complexity and pairwise distinguishability, outperforming prior methods.
Details
Motivation: Traditional DPO methods rely on single preference pairs, and recent approaches like Curriculum-DPO overlook prompt complexity, limiting alignment effectiveness.Method: Proposes 2D-Curri-DPO with dual difficulty metrics, a curriculum strategy space, and KL-divergence-based adaptive updates for dynamic training stability.
Result: Outperforms standard DPO and prior curriculum methods on benchmarks like MT-Bench, Vicuna Bench, and UltraFeedback.
Conclusion: Effective alignment requires modeling prompt complexity and pairwise distinguishability, establishing adaptive, multi-dimensional curriculum learning as a powerful paradigm.
Abstract: Aligning large language models with human preferences is crucial for their safe deployment. While Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning from human feedback, traditional DPO methods are limited by their reliance on single preference pairs. Recent work like Curriculum-DPO integrates multiple pairs using a one-dimensional difficulty curriculum based on pairwise distinguishability (PD), but overlooks the complexity of the input prompt itself. To address this, we propose 2D-Curri-DPO, a novel framework employing a two-dimensional curriculum that jointly models Prompt Complexity (PC) and Pairwise Distinguishability. This framework introduces dual difficulty metrics to quantify prompt semantic complexity and response preference clarity, defines a curriculum strategy space encompassing multiple selectable strategies for task adaptation, and incorporates a KL-divergence-based adaptive mechanism for dynamic reference model updates to enhance training stability. Comprehensive experiments demonstrate that 2D-Curri-DPO significantly outperforms standard DPO and prior curriculum methods across multiple benchmarks, including MT-Bench, Vicuna Bench, and WizardLM. Our approach achieves state-of-the-art performance on challenging test sets like UltraFeedback. Ablation studies confirm the benefits of the 2D structure and adaptive mechanisms, while analysis provides guidance for strategy selection. These findings demonstrate that effective alignment requires modeling both prompt complexity and pairwise distinguishability, establishing adaptive, multi-dimensional curriculum learning as a powerful and interpretable new paradigm for preference-based language model optimization.
[328] Enhancing Student Learning with LLM-Generated Retrieval Practice Questions: An Empirical Study in Data Science Courses
Yuan An, John Liu, Niyam Acharya, Ruhma Hashmi
Main category: cs.AI
TL;DR: LLM-generated retrieval practice questions improve student knowledge retention (89% accuracy) compared to no practice (73%), but require manual verification for quality.
Details
Motivation: To assess the effectiveness of LLM-generated retrieval practice questions in enhancing student learning, given the time-consuming nature of manual question creation.Method: Empirical study in two college-level data science courses (60 students), comparing learning outcomes with and without LLM-generated retrieval practice.
Result: Students with LLM-generated questions achieved 89% accuracy vs. 73% without, showing significant improvement in knowledge retention.
Conclusion: LLM-generated retrieval practice is effective but requires instructor verification to ensure question quality.
Abstract: Retrieval practice is a well-established pedagogical technique known to significantly enhance student learning and knowledge retention. However, generating high-quality retrieval practice questions is often time-consuming and labor intensive for instructors, especially in rapidly evolving technical subjects. Large Language Models (LLMs) offer the potential to automate this process by generating questions in response to prompts, yet the effectiveness of LLM-generated retrieval practice on student learning remains to be established. In this study, we conducted an empirical study involving two college-level data science courses, with approximately 60 students. We compared learning outcomes during one week in which students received LLM-generated multiple-choice retrieval practice questions to those from a week in which no such questions were provided. Results indicate that students exposed to LLM-generated retrieval practice achieved significantly higher knowledge retention, with an average accuracy of 89%, compared to 73% in the week without such practice. These findings suggest that LLM-generated retrieval questions can effectively support student learning and may provide a scalable solution for integrating retrieval practice into real-time teaching. However, despite these encouraging outcomes and the potential time-saving benefits, cautions must be taken, as the quality of LLM-generated questions can vary. Instructors must still manually verify and revise the generated questions before releasing them to students.
[329] Illuminating the Three Dogmas of Reinforcement Learning under Evolutionary Light
Mani Hamidi, Terrence W. Deacon
Main category: cs.AI
TL;DR: The paper critiques three core tenets of RL, proposing an evolutionary framework to revise them, addressing agency, learning objectives, and the reward hypothesis, with implications for biological learning.
Details
Motivation: To challenge and revise foundational assumptions in RL by drawing parallels with evolutionary theory, making it more applicable to biological learning.Method: Uses evolutionary insights to revisit RL dogmas, arguing for adaptation over search, multi-objective rewards, and integrating origins-of-life theory for agency.
Result: Proposes a framework to rethink RL assumptions, highlighting the need for evolutionary and thermodynamic perspectives to understand agency and learning.
Conclusion: Evolutionary theory enriches RL but requires additional insights from origins-of-life theory to fully address agency, offering a path for future research.
Abstract: Three core tenets of reinforcement learning (RL)–concerning the definition of agency, the objective of learning, and the scope of the reward hypothesis–have been highlighted as key targets for conceptual revision, with major implications for theory and application. We propose a framework, inspired by open-ended evolutionary theory, to reconsider these three “dogmas.” We revisit each assumption and address related concerns raised alongside them. To make our arguments relevant to RL as a model of biological learning, we first establish that evolutionary dynamics can plausibly operate within living brains over an individual’s lifetime, and are not confined to cross-generational processes. We begin by revisiting the second dogma, drawing on evolutionary insights to enrich the “adaptation-rather-than-search” view of learning. We then address the third dogma regarding the limits of the reward hypothesis, using analogies from evolutionary fitness to illuminate the scalar reward vs. multi-objective debate. After discussing practical implications for exploration in RL, we turn to the first–and arguably most fundamental–issue: the absence of a formal account of agency. We argue that unlike the other two problems, the evolutionary paradigm alone cannot resolve the agency question, though it gestures in a productive direction. We advocate integrating ideas from origins-of-life theory, where the thermodynamics of sustenance and replication offer promising foundations for understanding agency and resource-constrained reinforcement learning in biological systems.
[330] What Does ‘Human-Centred AI’ Mean?
Olivia Guest
Main category: cs.AI
TL;DR: The paper argues that AI must be understood as a relationship between technology and human cognition, analyzing its impact through displacement, enhancement, or replacement of human cognitive labor.
Details
Motivation: To clarify the human-centered nature of AI and address the obfuscation of human cognition in AI systems, which distorts critical engagement and limits human-centric engineering.Method: Uses examples (e.g., abacus vs. mental arithmetic, camera vs. vision) and novel definitions to analyze sociotechnical relationships.
Result: Identifies three types of AI impact on human cognition: displacement (harmful), enhancement (beneficial), and replacement (neutral).
Conclusion: To truly center humans in AI, we must recognize and address the human cognitive role in AI systems, avoiding obfuscation.
Abstract: While it seems sensible that human-centred artificial intelligence (AI) means centring “human behaviour and experience,” it cannot be any other way. AI, I argue, is usefully seen as a relationship between technology and humans where it appears that artifacts can perform, to a greater or lesser extent, human cognitive labour. This is evinced using examples that juxtapose technology with cognition, inter alia: abacus versus mental arithmetic; alarm clock versus knocker-upper; camera versus vision; and sweatshop versus tailor. Using novel definitions and analyses, sociotechnical relationships can be analysed into varying types of: displacement (harmful), enhancement (beneficial), and/or replacement (neutral) of human cognitive labour. Ultimately, all AI implicates human cognition; no matter what. Obfuscation of cognition in the AI context – from clocks to artificial neural networks – results in distortion, in slowing critical engagement, perverting cognitive science, and indeed in limiting our ability to truly centre humans and humanity in the engineering of AI systems. To even begin to de-fetishise AI, we must look the human-in-the-loop in the eyes.
cs.SD
[331] Combolutional Neural Networks
Cameron Churchwell, Minje Kim, Paris Smaragdis
Main category: cs.SD
TL;DR: The paper introduces the combolutional layer, a time-domain harmonic feature extractor for audio tasks, demonstrating its effectiveness and efficiency in various applications.
Details
Motivation: The need for effective inductive biases in machine learning models for audio, given the high sample count in short clips, drives the development of the combolutional layer.Method: The combolutional layer combines a learned-delay IIR comb filter and fused envelope detector to extract harmonic features in the time domain.
Result: The layer proves effective in tasks like piano transcription, speaker classification, and key detection, offering low parameter count, efficient CPU inference, and improved interpretability.
Conclusion: The combolutional layer is a viable replacement for convolutional layers in audio tasks requiring precise harmonic analysis, with additional benefits like computational efficiency and interpretability.
Abstract: Selecting appropriate inductive biases is an essential step in the design of machine learning models, especially when working with audio, where even short clips may contain millions of samples. To this end, we propose the combolutional layer: a learned-delay IIR comb filter and fused envelope detector, which extracts harmonic features in the time domain. We demonstrate the efficacy of the combolutional layer on three information retrieval tasks, evaluate its computational cost relative to other audio frontends, and provide efficient implementations for training. We find that the combolutional layer is an effective replacement for convolutional layers in audio tasks where precise harmonic analysis is important, e.g., piano transcription, speaker classification, and key detection. Additionally, the combolutional layer has several other key benefits over existing frontends, namely: low parameter count, efficient CPU inference, strictly real-valued computations, and improved interpretability.
[332] Relationship between objective and subjective perceptual measures of speech in individuals with head and neck cancer
Bence Mark Halpern, Thomas Tienkamp, Teja Rebernik, Rob J. J. H. van Son, Martijn Wieling, Defne Abur, Tomoki Toda
Main category: cs.SD
TL;DR: The study explores the relationship between perceptual speech assessments and objective acoustic measures in head and neck cancer patients, finding strong correlations between subjective and objective measures, particularly for intelligibility.
Details
Motivation: To improve clinical phonetics and therapy monitoring by understanding the link between subjective and objective speech assessments.Method: Trained listeners rated speech traits (intelligibility, articulation, etc.) in a large HNC dataset, comparing these with objective acoustic measures.
Result: Strong correlations between subjective intelligibility, articulation, and voice quality; objective measures aligned with subjective ones, especially for intelligibility and speech rate.
Conclusion: A single intelligibility measure may suffice for clinical monitoring in HNC patients treated with chemoradiation.
Abstract: Meaningful speech assessment is vital in clinical phonetics and therapy monitoring. This study examined the link between perceptual speech assessments and objective acoustic measures in a large head and neck cancer (HNC) dataset. Trained listeners provided ratings of intelligibility, articulation, voice quality, phonation, speech rate, nasality, and background noise on speech. Strong correlations were found between subjective intelligibility, articulation, and voice quality, likely due to a shared underlying cause of speech symptoms in our speaker population. Objective measures of intelligibility and speech rate aligned with their subjective counterpart. Our results suggest that a single intelligibility measure may be sufficient for the clinical monitoring of speakers treated for HNC using concomitant chemoradiation.
[333] Hyperbolic Embeddings for Order-Aware Classification of Audio Effect Chains
Aogu Wada, Tomohiko Nakamura, Hiroshi Saruwatari
Main category: cs.SD
TL;DR: The paper proposes a neural-network-based method for recognizing audio effect (AFX) chains by embedding wet signals in hyperbolic space, outperforming Euclidean methods due to hyperbolic space’s efficiency in modeling tree-structured data like AFX chains.
Details
Motivation: The order of AFXs in a chain significantly impacts the final sound, yet prior studies focused only on effect types and parameters, not their order. This work addresses the gap by jointly estimating AFX types and their order.Method: A neural network embeds wet signals into hyperbolic space, leveraging its exponential expansion property to model AFX chains as trees (nodes as AFXs, edges as order). This captures the non-commutative nature of AFX combinations.
Result: Experiments on guitar sounds show the hyperbolic method outperforms Euclidean approaches, especially with proper curvature. Analysis confirms its effectiveness in capturing AFX order across types and chain lengths.
Conclusion: Hyperbolic space is well-suited for AFX chain recognition, offering a promising direction for modeling ordered audio effects.
Abstract: Audio effects (AFXs) are essential tools in music production, frequently applied in chains to shape timbre and dynamics. The order of AFXs in a chain plays a crucial role in determining the final sound, particularly when non-linear (e.g., distortion) or time-variant (e.g., chorus) processors are involved. Despite its importance, most AFX-related studies have primarily focused on estimating effect types and their parameters from a wet signal. To address this gap, we formulate AFX chain recognition as the task of jointly estimating AFX types and their order from a wet signal. We propose a neural-network-based method that embeds wet signals into a hyperbolic space and classifies their AFX chains. Hyperbolic space can represent tree-structured data more efficiently than Euclidean space due to its exponential expansion property. Since AFX chains can be represented as trees, with AFXs as nodes and edges encoding effect order, hyperbolic space is well-suited for modeling the exponentially growing and non-commutative nature of ordered AFX combinations, where changes in effect order can result in different final sounds. Experiments using guitar sounds demonstrate that, with an appropriate curvature, the proposed method outperforms its Euclidean counterpart. Further analysis based on AFX type and chain length highlights the effectiveness of the proposed method in capturing AFX order.
[334] SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods
Wen Huang, Yanmei Gu, Zhiming Wang, Huijia Zhu, Yanmin Qian
Main category: cs.SD
TL;DR: SpeechFake is a large-scale dataset for speech deepfake detection, featuring 3M samples, 3K hours of audio, and 40 synthesis tools. It supports 46 languages and includes diverse generation techniques. Baseline models show strong performance, and the dataset aids in advancing detection methods.
Details
Motivation: The rise of deepfake audio misuse highlights the need for robust detection systems, but existing datasets lack scale and diversity, limiting model generalization.Method: SpeechFake is introduced, a dataset with 3M deepfake samples from 40 tools, covering text-to-speech, voice conversion, and neural vocoder methods. It includes 46 languages and provides detailed statistics and baseline detection models.
Result: Baseline models trained on SpeechFake perform well on its test sets and unseen data. Experiments explore the impact of generation methods, languages, and speaker variation on detection.
Conclusion: SpeechFake is a valuable resource for improving speech deepfake detection and developing robust models against evolving synthesis techniques.
Abstract: As speech generation technology advances, the risk of misuse through deepfake audio has become a pressing concern, which underscores the critical need for robust detection systems. However, many existing speech deepfake datasets are limited in scale and diversity, making it challenging to train models that can generalize well to unseen deepfakes. To address these gaps, we introduce SpeechFake, a large-scale dataset designed specifically for speech deepfake detection. SpeechFake includes over 3 million deepfake samples, totaling more than 3,000 hours of audio, generated using 40 different speech synthesis tools. The dataset encompasses a wide range of generation techniques, including text-to-speech, voice conversion, and neural vocoder, incorporating the latest cutting-edge methods. It also provides multilingual support, spanning 46 languages. In this paper, we offer a detailed overview of the dataset’s creation, composition, and statistics. We also present baseline results by training detection models on SpeechFake, demonstrating strong performance on both its own test sets and various unseen test sets. Additionally, we conduct experiments to rigorously explore how generation methods, language diversity, and speaker variation affect detection performance. We believe SpeechFake will be a valuable resource for advancing speech deepfake detection and developing more robust models for evolving generation techniques.
[335] Whilter: A Whisper-based Data Filter for “In-the-Wild” Speech Corpora Using Utterance-level Multi-Task Classification
William Ravenscroft, George Close, Kit Bower-Morris, Jamie Stacey, Dmitry Sityaev, Kris Y. Hong
Main category: cs.SD
TL;DR: The Whilter model is a multitask solution for identifying undesirable features in large-scale speech datasets, outperforming state-of-the-art methods in accuracy and efficiency.
Details
Motivation: Large-scale speech datasets often contain undesirable features (e.g., multiple speakers, non-target languages, music) that can hinder model learning.Method: Whilter uses a Whisper encoder with an attention-based classifier to solve five diverse classification tasks simultaneously.
Result: Whilter achieves F1 scores >85% and equal error rates of 6.5%-7.8% for three subtasks, outperforming BEATs and reducing processing time.
Conclusion: Whilter is an effective multitask solution for filtering undesirable samples in speech datasets, with superior performance and efficiency.
Abstract: Large-scale in-the-wild speech datasets have become more prevalent in recent years due to increased interest in models that can learn useful features from unlabelled data for tasks such as speech recognition or synthesis. These datasets often contain undesirable features, such as multiple speakers, non-target languages, and music, which may impact model learning. The Whilter model is proposed as a multitask solution to identify these undesirable samples. Whilter uses a Whisper encoder with an attention-based classifier to solve five diverse classification problems at once. In addition, an annotated dataset is published for a subset of two popular in-the-wild corpora. Whilter achieves F1 scores above 85% and equal error rates of 6.5% to 7.8% for three of five subtasks, outperforming a state-of-the-art BEATs classifier on speech-specific classes, with a notable decrease in processing time compared to a combination of single-task alternatives.
[336] Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment
Ohad Cohen, Gershon Hazan, Sharon Gannot
Main category: cs.SD
TL;DR: A Multi-modal Emotion Recognition (MER) system combining audio and video modalities improves accuracy in challenging acoustic conditions.
Details
Motivation: To enhance emotion recognition accuracy in difficult acoustic environments by integrating audio and video data.Method: Combines a modified Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio and an R(2+1)D CNN for video analysis, tested on a reverberated RAVDESS dataset with synthetic and real-world RIRs.
Result: Multimodal (audiovisual) approach outperforms uni-modal methods, especially in challenging conditions, and using multiple microphones further improves performance.
Conclusion: Integrating audio and video modalities with multiple microphones enhances emotion recognition accuracy in challenging acoustic scenarios.
Abstract: This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions. Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis. We evaluate our proposed method on a reverberated version of the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset using synthetic and real-world Room Impulse Responsess (RIRs). Our results demonstrate that integrating audio and video modalities yields superior performance compared to uni-modal approaches, especially in challenging acoustic conditions. Moreover, we show that the multimodal (audiovisual) approach that utilizes multiple microphones outperforms its single-microphone counterpart.
[337] Latent Swap Joint Diffusion for 2D Long-Form Latent Generation
Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, Jun Du, Kewei Li, Ruoyu Wang, Jiefeng Ma, Lei Sun, Jianqing Gao
Main category: cs.SD
TL;DR: SaFa introduces a modality-agnostic method for seamless long-spectrum and panorama generation using latent swap joint diffusion, addressing spectrum aliasing and improving cross-view consistency.
Details
Motivation: The paper addresses the spectrum aliasing problem in audio generation caused by existing joint diffusion methods, which suppress high-frequency components excessively.Method: Proposes Self-Loop Latent Swap for frame-level bidirectional swaps and Reference-Guided Latent Swap for global consistency, refining swap timing and intervals.
Result: SaFa outperforms existing methods in audio generation and adapts well to panorama generation, achieving faster speeds and better generalizability.
Conclusion: SaFa is an efficient and effective solution for seamless long-spectrum and panorama generation, with superior performance and adaptability.
Abstract: This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross-view similarity-diversity balance in a forward-only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models, along with effective longer length adaptation. It also adapts well to panorama generation, achieving comparable performance with 2 $\sim$ 20 $\times$ faster speed and greater model generalizability. More generation demos are available at https://swapforward.github.io/
[338] Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
Main category: cs.SD
TL;DR: AF3 is a state-of-the-art open audio-language model excelling in speech, sound, and music reasoning. It introduces novel features like unified encoding, flexible reasoning, multi-audio chat, and long audio understanding, trained with curated datasets and a five-stage curriculum.
Details
Motivation: To advance reasoning and understanding across speech, sound, and music by creating a versatile, open-source model that outperforms existing solutions.Method: AF3 uses AF-Whisper for joint representation learning, flexible reasoning, multi-audio chat, and long audio understanding. It employs novel datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) and a five-stage curriculum-based training strategy.
Result: AF3 achieves SOTA results on 20+ benchmarks, surpassing open-weight and closed-source models despite using only open-source data.
Conclusion: AF3 demonstrates superior performance in audio-language tasks, proving the effectiveness of its novel training strategies and multimodal capabilities.
Abstract: We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
cs.LG
[339] Task-Focused Consolidation with Spaced Recall: Making Neural Networks learn like college students
Prital Bamnodkar
Main category: cs.LG
TL;DR: TFC-SR, a novel continual learning method inspired by human strategies like Active Recall, improves performance by stabilizing past knowledge with an Active Recall Probe, outperforming baselines on benchmarks like Split CIFAR-100.
Details
Motivation: Addressing Catastrophic Forgetting in Deep Neural Networks by mimicking human learning strategies (Active Recall, Deliberate Practice, Spaced Repetition).Method: Introduces TFC-SR with an Active Recall Probe for periodic, task-aware evaluation to stabilize past knowledge, tested on Split MNIST and Split CIFAR-100.
Result: TFC-SR achieves 13.17% accuracy on Split CIFAR-100 vs. 7.40% for standard replay, showing better performance in memory-constrained settings.
Conclusion: TFC-SR is robust and efficient, emphasizing the value of active memory retrieval in continual learning.
Abstract: Deep Neural Networks often suffer from a critical limitation known as Catastrophic Forgetting, where performance on past tasks degrades after learning new ones. This paper introduces a novel continual learning approach inspired by human learning strategies like Active Recall, Deliberate Practice and Spaced Repetition, named Task Focused Consolidation with Spaced Recall (TFC-SR). TFC-SR enhances the standard experience replay with a mechanism we termed the Active Recall Probe. It is a periodic, task-aware evaluation of the model’s memory that stabilizes the representations of past knowledge. We test TFC-SR on the Split MNIST and Split CIFAR-100 benchmarks against leading regularization-based and replay-based baselines. Our results show that TFC-SR performs significantly better than these methods. For instance, on the Split CIFAR-100, it achieves a final accuracy of 13.17% compared to standard replay’s 7.40%. We demonstrate that this advantage comes from the stabilizing effect of the probe itself, and not from the difference in replay volume. Additionally, we analyze the trade-off between memory size and performance and show that while TFC-SR performs better in memory-constrained environments, higher replay volume is still more effective when available memory is abundant. We conclude that TFC-SR is a robust and efficient approach, highlighting the importance of integrating active memory retrieval mechanisms into continual learning systems.
[340] Pre-, In-, and Post-Processing Class Imbalance Mitigation Techniques for Failure Detection in Optical Networks
Yousuf Moiz Ali, Jaroslaw E. Prilepsky, Nicola Sambo, João Pedro, Mohammad M. Hosseini, Antonio Napoli, Sergei K. Turitsyn, Pedro Freire
Main category: cs.LG
TL;DR: Comparison of imbalance mitigation techniques in optical network failure detection, with Threshold Adjustment offering the highest F1 gain and RUS providing the fastest inference.
Details
Motivation: To address class imbalance in optical network failure detection by evaluating different mitigation techniques.Method: Comparison of pre-, in-, and post-processing techniques, including Threshold Adjustment and Random Under-sampling (RUS).
Result: Threshold Adjustment achieved the highest F1 gain (15.3%), while RUS offered the fastest inference.
Conclusion: There is a trade-off between performance (F1 gain) and complexity (inference speed) in imbalance mitigation techniques.
Abstract: We compare pre-, in-, and post-processing techniques for class imbalance mitigation in optical network failure detection. Threshold Adjustment achieves the highest F1 gain (15.3%), while Random Under-sampling (RUS) offers the fastest inference, highlighting a key performance-complexity trade-off.
[341] Quantum Geometry of Data
Alexander G. Abanov, Luca Candelori, Harold C. Steinacker, Martin T. Wells, Jerome R. Busemeyer, Cameron J. Hogan, Vahagn Kirakosyan, Nicola Marzari, Sunil Pinnamaneni, Dario Villani, Mengjia Xu, Kharen Musaelian
Main category: cs.LG
TL;DR: QCML encodes data as quantum geometry using Hermitian matrices and Hilbert space states, capturing global properties and avoiding dimensionality issues.
Details
Motivation: To leverage quantum geometry for richer data representation and understanding cognitive phenomena.Method: Represent data features as learned Hermitian matrices and map data points to Hilbert space states, extracting geometric and topological properties.
Result: QCML captures intrinsic dimension, quantum metric, and Berry curvature, demonstrating effectiveness on synthetic and real-world examples.
Conclusion: QCML’s quantum geometric representation offers a novel framework for understanding cognitive phenomena and data properties.
Abstract: We demonstrate how Quantum Cognition Machine Learning (QCML) encodes data as quantum geometry. In QCML, features of the data are represented by learned Hermitian matrices, and data points are mapped to states in Hilbert space. The quantum geometry description endows the dataset with rich geometric and topological structure - including intrinsic dimension, quantum metric, and Berry curvature - derived directly from the data. QCML captures global properties of data, while avoiding the curse of dimensionality inherent in local methods. We illustrate this on a number of synthetic and real-world examples. Quantum geometric representation of QCML could advance our understanding of cognitive phenomena within the framework of quantum cognition.
[342] A Study on Variants of Conventional, Fuzzy, and Nullspace-Based Independence Criteria for Improving Supervised and Unsupervised Learning
Mojtaba Moattari
Main category: cs.LG
TL;DR: The paper proposes 3 independence criteria for unsupervised and supervised dimensionality reduction, outperforming baseline methods like tSNE, PCA, and VAE, and advancing interpretable ML.
Details
Motivation: Experts often struggle to ensure proposed nonlinearities in kernels maximize data variability and diversity. The study aims to address this by designing unsupervised learners using independence criteria.Method: Reviewed independence criteria, proposed 3 new ones, and designed unsupervised/supervised dimensionality reduction methods. Evaluated contrast, accuracy, and interpretability in linear and neural nonlinear settings.
Result: The methods outperformed baselines (tSNE, PCA, regularized LDA, VAE) and introduced a new approach for interpretable ML.
Conclusion: The proposed methods advance interpretable ML and outperform existing techniques, offering a promising direction for future research.
Abstract: Unsupervised and supervised learning methods conventionally use kernels to capture nonlinearities inherent in data structure. However experts have to ensure their proposed nonlinearity maximizes variability and capture inherent diversity of data. We reviewed all independence criteria to design unsupervised learners. Then we proposed 3 independence criteria and used them to design unsupervised and supervised dimensionality reduction methods. We evaluated contrast, accuracy and interpretability of these methods in both linear and neural nonlinear settings. The results show that the methods have outperformed the baseline (tSNE, PCA, regularized LDA, VAE with (un)supervised learner and layer sharing) and opened a new line of interpretable machine learning (ML) for the researchers.
[343] Advancing Wildfire Risk Prediction via Morphology-Aware Curriculum Contrastive Learning
Fabrizio Lo Scudo, Alessio De Rango, Luca Furnari, Alfonso Senatore, Donato D’Ambrosio, Giuseppe Mendicino, Gianluigi Greco
Main category: cs.LG
TL;DR: The paper proposes a contrastive learning framework to improve wildfire prediction by addressing data imbalance and high-dimensional spatio-temporal challenges, using smaller patch sizes without performance loss.
Details
Motivation: Wildfires, worsened by climate change, require advanced risk management. Current data imbalances and computational costs hinder effective deep learning solutions.Method: Introduces morphology-based curriculum contrastive learning for better latent representations of dynamic features, validated through experimental analysis.
Result: The proposed method mitigates regional diversity issues and reduces computational costs while maintaining performance.
Conclusion: Contrastive learning enhances wildfire prediction models, addressing data imbalance and spatio-temporal complexity.
Abstract: Wildfires significantly impact natural ecosystems and human health, leading to biodiversity loss, increased hydrogeological risks, and elevated emissions of toxic substances. Climate change exacerbates these effects, particularly in regions with rising temperatures and prolonged dry periods, such as the Mediterranean. This requires the development of advanced risk management strategies that utilize state-of-the-art technologies. However, in this context, the data show a bias toward an imbalanced setting, where the incidence of wildfire events is significantly lower than typical situations. This imbalance, coupled with the inherent complexity of high-dimensional spatio-temporal data, poses significant challenges for training deep learning architectures. Moreover, since precise wildfire predictions depend mainly on weather data, finding a way to reduce computational costs to enable more frequent updates using the latest weather forecasts would be beneficial. This paper investigates how adopting a contrastive framework can address these challenges through enhanced latent representations for the patch’s dynamic features. We thus introduce a new morphology-based curriculum contrastive learning that mitigates issues associated with diverse regional characteristics and enables the use of smaller patch sizes without compromising performance. An experimental analysis is performed to validate the effectiveness of the proposed modeling strategies.
[344] Deep Unfolding for MIMO Signal Detection
Hangli Ge, Noboru Koshizuka
Main category: cs.LG
TL;DR: A deep unfolding neural network-based MIMO detector using Wirtinger calculus, called DPST, offers efficient, interpretable, and low-complexity signal detection in the complex domain.
Details
Motivation: Prior methods rely on real-valued approximations, which misalign with the complex nature of signal processing tasks.Method: Dynamic Partially Shrinkage Thresholding (DPST) operates natively in the complex domain with few trainable parameters.
Result: Superior detection performance with fewer iterations and lower computational complexity.
Conclusion: DPST is a practical solution for next-generation massive MIMO systems.
Abstract: In this paper, we propose a deep unfolding neural network-based MIMO detector that incorporates complex-valued computations using Wirtinger calculus. The method, referred as Dynamic Partially Shrinkage Thresholding (DPST), enables efficient, interpretable, and low-complexity MIMO signal detection. Unlike prior approaches that rely on real-valued approximations, our method operates natively in the complex domain, aligning with the fundamental nature of signal processing tasks. The proposed algorithm requires only a small number of trainable parameters, allowing for simplified training. Numerical results demonstrate that the proposed method achieves superior detection performance with fewer iterations and lower computational complexity, making it a practical solution for next-generation massive MIMO systems.
[345] LLAMAPIE: Proactive In-Ear Conversation Assistants
Tuochao Chen, Nicholas Batchelder, Alisa Liu, Noah Smith, Shyamnath Gollakota
Main category: cs.LG
TL;DR: LlamaPIE is a real-time proactive assistant for enhancing human conversations via hearable devices, using a two-model pipeline for discreet guidance without explicit user invocation.
Details
Motivation: To improve human conversations by providing unobtrusive, context-aware assistance without disrupting the flow of dialogue.Method: A two-model pipeline: a small model decides when to respond, and a larger model generates concise responses. Evaluated on real-world datasets and implemented on Apple Silicon M2 hardware.
Result: User studies show strong preference for LlamaPIE over no assistance and reactive models, proving its effectiveness.
Conclusion: LlamaPIE successfully enhances live conversations through proactive, discreet assistance.
Abstract: We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive model, highlighting the potential of LlamaPie to enhance live conversations.
[346] Deep Reinforcement Learning for Real-Time Green Energy Integration in Data Centers
Abderaouf Bahi, Amel Ourici
Main category: cs.LG
TL;DR: A DRL-optimized energy management system for e-commerce data centers reduces energy costs by 38%, improves efficiency by 82%, and cuts carbon emissions by 45%, outperforming traditional RL and heuristic methods.
Details
Motivation: To enhance energy efficiency, cost-effectiveness, and environmental sustainability in e-commerce data centers by dynamically managing renewable energy, storage, and grid power.Method: Uses Deep Reinforcement Learning (DRL) algorithms to adapt to real-time energy fluctuations, integrating renewable sources, storage, and grid power.
Result: Achieves 38% lower energy costs, 82% higher efficiency, 45% fewer emissions, and a 1.5% SLA violation rate, outperforming RL (28%, 3.0%) and heuristic methods (22%, 4.8%).
Conclusion: The DRL-optimized system is a robust solution for energy management, demonstrating DRL’s potential in sustainability and optimization.
Abstract: This paper explores the implementation of a Deep Reinforcement Learning (DRL)-optimized energy management system for e-commerce data centers, aimed at enhancing energy efficiency, cost-effectiveness, and environmental sustainability. The proposed system leverages DRL algorithms to dynamically manage the integration of renewable energy sources, energy storage, and grid power, adapting to fluctuating energy availability in real time. The study demonstrates that the DRL-optimized system achieves a 38% reduction in energy costs, significantly outperforming traditional Reinforcement Learning (RL) methods (28%) and heuristic approaches (22%). Additionally, it maintains a low SLA violation rate of 1.5%, compared to 3.0% for RL and 4.8% for heuristic methods. The DRL-optimized approach also results in an 82% improvement in energy efficiency, surpassing other methods, and a 45% reduction in carbon emissions, making it the most environmentally friendly solution. The system’s cumulative reward of 950 reflects its superior performance in balancing multiple objectives. Through rigorous testing and ablation studies, the paper validates the effectiveness of the DRL model’s architecture and parameters, offering a robust solution for energy management in data centers. The findings highlight the potential of DRL in advancing energy optimization strategies and addressing sustainability challenges.
[347] SPADE-S: A Sparsity-Robust Foundational Forecaster
Malcolm Wolff, Matthew Li, Ravi Kiran Selvam, Hanjing Zhu, Kin G. Olivares, Ruijun Ma, Abhinav Katoch, Shankar Ramasubramanian, Mengfei Cao, Roberto Bandarra, Rahul Gopalsamy, Stefania La Vattiata, Sitan Yang, Michael M. Mahoney
Main category: cs.LG
TL;DR: SPADE-S is a forecasting architecture addressing biases in time series with low magnitude or sparsity, improving accuracy by up to 15% in demand forecasting.
Details
Motivation: Existing models struggle with heterogeneous time series, especially those with low magnitude or sparsity, due to biased loss functions and encoding limitations.Method: SPADE-S reduces biases in magnitude and sparsity through robust architecture design.
Result: SPADE-S outperforms state-of-the-art models, achieving up to 15% better accuracy, with significant gains in P90 and P50 forecasts across large datasets.
Conclusion: SPADE-S effectively addresses systematic biases in time series forecasting, enhancing accuracy for diverse and sparse datasets.
Abstract: Despite significant advancements in time series forecasting, accurate modeling of time series with strong heterogeneity in magnitude and/or sparsity patterns remains challenging for state-of-the-art deep learning architectures. We identify several factors that lead existing models to systematically underperform on low-magnitude and sparse time series, including loss functions with implicit biases toward high-magnitude series, training-time sampling methods, and limitations of time series encoding methods. SPADE-S is a robust forecasting architecture that significantly reduces magnitude- and sparsity-based systematic biases and improves overall prediction accuracy. Empirical results demonstrate that SPADE-S outperforms existing state-of-the-art approaches across a diverse set of use cases in demand forecasting. In particular, we show that, depending on the quantile forecast and magnitude of the series, SPADE-S can improve forecast accuracy by up to 15%. This results in P90 overall forecast accuracy gains of 2.21%, 6.58%, and 4.28%, and P50 forecast accuracy gains of 0.92%, 0.77%, and 1.95%, respectively, for each of three distinct datasets, ranging from 3 million to 700 million series, from a large online retailer.
[348] Handling Out-of-Distribution Data: A Survey
Lakpa Tamang, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal
Main category: cs.LG
TL;DR: The paper addresses distribution shifts in ML, focusing on covariate and concept shifts, reviews existing methods, and suggests future research directions.
Details
Motivation: To formalize distribution shifts, highlight limitations of conventional methods, and advocate for models robust to all shift types.Method: Review and analyze existing techniques for detecting, measuring, and mitigating distribution shifts, with a focus on OOD data.
Result: Identifies gaps in current methods and emphasizes the need for comprehensive solutions.
Conclusion: Calls for future research to develop models capable of handling all types of distribution shifts effectively.
Abstract: In the field of Machine Learning (ML) and data-driven applications, one of the significant challenge is the change in data distribution between the training and deployment stages, commonly known as distribution shift. This paper outlines different mechanisms for handling two main types of distribution shifts: (i) Covariate shift: where the value of features or covariates change between train and test data, and (ii) Concept/Semantic-shift: where model experiences shift in the concept learned during training due to emergence of novel classes in the test phase. We sum up our contributions in three folds. First, we formalize distribution shifts, recite on how the conventional method fails to handle them adequately and urge for a model that can simultaneously perform better in all types of distribution shifts. Second, we discuss why handling distribution shifts is important and provide an extensive review of the methods and techniques that have been developed to detect, measure, and mitigate the effects of these shifts. Third, we discuss the current state of distribution shift handling mechanisms and propose future research directions in this area. Overall, we provide a retrospective synopsis of the literature in the distribution shift, focusing on OOD data that had been overlooked in the existing surveys.
[349] OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection
Nicolas Pinon, Carole Lartizien
Main category: cs.LG
TL;DR: A novel method for unsupervised anomaly detection (UAD) tightly couples representation learning with a one-class SVM (OCSVM) to address limitations of existing approaches, demonstrating robustness in tasks like MNIST-C and brain MRI lesion detection.
Details
Motivation: Existing UAD methods either reconstruct anomalies too well or suffer from suboptimal feature spaces. Recent attempts to couple feature learning and anomaly detection rely on surrogate objectives or approximations, limiting expressiveness and robustness.Method: Proposes a custom loss formulation aligning latent features with the OCSVM decision boundary, tightly coupling representation learning with an analytically solvable OCSVM.
Result: Outperforms existing methods in detecting small, non-hyperintense brain lesions and shows robustness to domain shifts (e.g., MNIST-C corruptions, MRI scanner/age variations).
Conclusion: The method offers improved performance and robustness for UAD, particularly in medical imaging applications, with potential for broader use.
Abstract: Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that tightly couples representation learning with an analytically solvable one-class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a new benchmark based on MNIST-C, and a challenging brain MRI subtle lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and scanner/age variations in MRI. Results demonstrate performance and robustness of our proposed mode,highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at https://github.com/Nicolas-Pinon/uad_ocsvm_guided_repr_learning
[350] AGORA: Incentivizing Group Emergence Capability in LLMs via Group Distillation
Ren Zhuang, Ben Wang, Shuifa Sun
Main category: cs.LG
TL;DR: AGORA introduces structured interaction as a new scaling axis, surpassing traditional parameter scaling, and achieves a 4.45% performance boost on math benchmarks through collaborative reasoning.
Details
Motivation: Current training datasets are static, limiting progress in complex reasoning. The paper aims to explore interaction as a scalable driver of intelligence.Method: AGORA, a self-evolving framework, uses a collaborative ensemble of models to enable structured interaction and group emergent abilities.
Result: AGORA outperforms state-of-the-art monolithic systems by up to 4.45 percentage points on challenging mathematical benchmarks.
Conclusion: Engineering collaborative ecosystems is a vital frontier for capability emergence, validating interaction as a scalable driver of intelligence.
Abstract: Progress in complex reasoning is constrained by the static nature of the current training datasets. We propose structured interaction as a new scaling axis, moving beyond the prevailing paradigm of increasing model parameters. Our self-evolving framework, AGORA, enables a collaborative ensemble to achieve reasoning performance exceeding state-of-the-art monolithic systems by up to 4.45 percentage points on challenging mathematical benchmarks. This gain stems from group emergent ability-the synthesis of collective capabilities unattainable by isolated models, validating interaction as a scalable driver of intelligence. Our results position the engineering of collaborative ecosystems as a vital frontier for capability emergence.
[351] LLM-Adapted Interpretation Framework for Machine Learning Models
Yuqi Jin, Zihan Hu, Weiteng Zhang, Weihao Xie, Jianwei Shuai, Xian Shen, Zhen Feng
Main category: cs.LG
TL;DR: LAI-ML framework improves XGBoost’s interpretability for sarcopenia risk assessment, achieving higher accuracy and generating transparent clinical narratives.
Details
Motivation: To address the lack of interpretability in high-performance ML models like XGBoost, hindering clinical adoption.Method: Proposes LAI-ML, a knowledge distillation architecture using HAGA and CACS techniques to transform XGBoost feature attributions into probabilistic formats, then employs an LLM with reinforcement learning for narrative generation.
Result: Achieved 83% prediction accuracy (13% higher than baseline XGBoost) and corrected predictions in 21.7% of discordant cases.
Conclusion: LAI-ML successfully translates opaque predictions into interpretable clinical insights, solving the ‘black-box’ problem in medical AI.
Abstract: Background & Aims: High-performance machine learning models like XGBoost are often “black boxes,” limiting their clinical adoption due to a lack of interpretability. This study aims to bridge the gap between predictive accuracy and narrative transparency for sarcopenia risk assessment. Methods: We propose the LLM-Adapted Interpretation Framework (LAI-ML), a novel knowledge distillation architecture. LAI-ML transforms feature attributions from a trained XGBoost model into a probabilistic format using specialized techniques (HAGA and CACS). A Large Language Model (LLM), guided by a reinforcement learning loop and case-based retrieval, then generates data-faithful diagnostic narratives. Results: The LAI-ML framework achieved 83% prediction accuracy, significantly outperforming the baseline XGBoost model, 13% higher. Notably, the LLM not only replicated the teacher model’s logic but also corrected its predictions in 21.7% of discordant cases, demonstrating enhanced reasoning. Conclusion: LAI-ML effectively translates opaque model predictions into trustworthy and interpretable clinical insights, offering a deployable solution to the “black-box” problem in medical AI.
[352] MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton
Main category: cs.LG
TL;DR: MaPPO is a new framework for aligning LLMs with human preferences by integrating prior reward knowledge into optimization, outperforming existing methods like DPO without extra hyperparameters.
Details
Motivation: Existing methods like DPO treat preference learning as MLE, lacking prior reward integration, which MaPPO addresses to improve alignment.Method: MaPPO extends MLE to MAP by incorporating prior reward estimates, supporting offline and online settings, and works as a plugin for DPO variants.
Result: Empirical evaluations show MaPPO improves alignment performance on benchmarks like MT-Bench and AlpacaEval 2.0 without sacrificing efficiency.
Conclusion: MaPPO generalizes and enhances existing methods, offering consistent improvements in LLM alignment with human preferences.
Abstract: As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.
[353] EvoSLD: Automated Neural Scaling Law Discovery With Large Language Models
Haowei Lin, Xiangyu Wang, Jianzhu Ma, Yitao Liang
Main category: cs.LG
TL;DR: EvoSLD automates scaling law discovery using evolutionary algorithms and LLMs, outperforming manual methods and baselines in accuracy and efficiency.
Details
Motivation: Manual discovery of scaling laws is time-consuming and requires expertise; EvoSLD aims to automate this process.Method: Uses evolutionary algorithms guided by LLMs to co-evolve symbolic expressions and optimization routines, handling diverse experimental settings.
Result: Rediscovers human-derived laws in some cases and surpasses them in others, reducing error significantly.
Conclusion: EvoSLD is accurate, interpretable, and efficient, potentially accelerating AI research.
Abstract: Scaling laws are fundamental mathematical relationships that predict how neural network performance evolves with changes in variables such as model size, dataset size, and computational resources. Traditionally, discovering these laws requires extensive human expertise and manual experimentation. We introduce EvoSLD, an automated framework for Scaling Law Discovery (SLD) that leverages evolutionary algorithms guided by Large Language Models (LLMs) to co-evolve symbolic expressions and their optimization routines. Formulated to handle scaling variables, control variables, and response metrics across diverse experimental settings, EvoSLD searches for parsimonious, universal functional forms that minimize fitting errors on grouped data subsets. Evaluated on five real-world scenarios from recent literature, EvoSLD rediscovers exact human-derived laws in two cases and surpasses them in others, achieving up to orders-of-magnitude reductions in normalized mean squared error on held-out test sets. Compared to baselines like symbolic regression and ablated variants, EvoSLD demonstrates superior accuracy, interpretability, and efficiency, highlighting its potential to accelerate AI research. Code is available at https://github.com/linhaowei1/SLD.
[354] Embeddings to Diagnosis: Latent Fragility under Agentic Perturbations in Clinical LLMs
Raj Krishnan Vijayaraj
Main category: cs.LG
TL;DR: The paper introduces LAPD, a geometry-aware framework to evaluate clinical LLMs’ robustness under adversarial edits, revealing latent fragility despite high static benchmark performance.
Details
Motivation: Clinical LLMs often fail under small input shifts, but standard NLP metrics miss these issues. The goal is to detect and address latent representation shifts affecting diagnosis stability.Method: Proposes LAPD with Latent Diagnosis Flip Rate (LDFR) to measure representational instability. Uses structured adversarial edits (masking, negation, synonym replacement, numeric variation) on clinical notes.
Result: Finds latent fragility in clinical LLMs under minimal changes. Validates LDFR on real clinical notes, showing generalizability.
Conclusion: Highlights the gap between surface robustness and semantic stability, emphasizing the need for geometry-aware auditing in clinical AI.
Abstract: LLMs for clinical decision support often fail under small but clinically meaningful input shifts such as masking a symptom or negating a finding, despite high performance on static benchmarks. These reasoning failures frequently go undetected by standard NLP metrics, which are insensitive to latent representation shifts that drive diagnosis instability. We propose a geometry-aware evaluation framework, LAPD (Latent Agentic Perturbation Diagnostics), which systematically probes the latent robustness of clinical LLMs under structured adversarial edits. Within this framework, we introduce Latent Diagnosis Flip Rate (LDFR), a model-agnostic diagnostic signal that captures representational instability when embeddings cross decision boundaries in PCA-reduced latent space. Clinical notes are generated using a structured prompting pipeline grounded in diagnostic reasoning, then perturbed along four axes: masking, negation, synonym replacement, and numeric variation to simulate common ambiguities and omissions. We compute LDFR across both foundation and clinical LLMs, finding that latent fragility emerges even under minimal surface-level changes. Finally, we validate our findings on 90 real clinical notes from the DiReCT benchmark (MIMIC-IV), confirming the generalizability of LDFR beyond synthetic settings. Our results reveal a persistent gap between surface robustness and semantic stability, underscoring the importance of geometry-aware auditing in safety-critical clinical AI.
[355] Operator-Based Machine Intelligence: A Hilbert Space Framework for Spectral Learning and Symbolic Reasoning
Andrew Kiruluta, Andreas Lemos, Priscilla Burity
Main category: cs.LG
TL;DR: The paper explores machine learning in infinite-dimensional Hilbert spaces, using tools like RKHS and spectral theory, and compares it to traditional neural networks.
Details
Motivation: To provide an alternative to finite-dimensional machine learning models by leveraging infinite-dimensional Hilbert spaces for more expressive and interpretable learning.Method: Uses Reproducing Kernel Hilbert Spaces (RKHS), spectral operator learning, and wavelet-domain representations, along with scattering transforms and Koopman operators.
Result: A rigorous mathematical framework for learning in Hilbert spaces, with insights into advantages and limitations over conventional neural networks.
Conclusion: Proposes scalable and interpretable machine learning directions based on Hilbertian signal processing.
Abstract: Traditional machine learning models, particularly neural networks, are rooted in finite-dimensional parameter spaces and nonlinear function approximations. This report explores an alternative formulation where learning tasks are expressed as sampling and computation in infinite dimensional Hilbert spaces, leveraging tools from functional analysis, signal processing, and spectral theory. We review foundational concepts such as Reproducing Kernel Hilbert Spaces (RKHS), spectral operator learning, and wavelet-domain representations. We present a rigorous mathematical formulation of learning in Hilbert spaces, highlight recent models based on scattering transforms and Koopman operators, and discuss advantages and limitations relative to conventional neural architectures. The report concludes by outlining directions for scalable and interpretable machine learning grounded in Hilbertian signal processing.
[356] Beyond Neural Networks: Symbolic Reasoning over Wavelet Logic Graph Signals
Andrew Kiruluta, Andreas Lemos, Priscilla Burity
Main category: cs.LG
TL;DR: A non-neural learning framework using Graph Laplacian Wavelet Transforms (GLWT) for graph-based tasks, offering interpretability and efficiency.
Details
Motivation: To provide a transparent and resource-efficient alternative to neural networks for graph learning.Method: Uses GLWT for signal decomposition, nonlinear shrinkage, and symbolic logic over wavelet coefficients, combined with a domain-specific language (DSL).
Result: Competes with lightweight GNNs in tasks like denoising and token classification while being more interpretable and efficient.
Conclusion: Proposes a principled, interpretable, and efficient non-neural approach for graph learning.
Abstract: We present a fully non neural learning framework based on Graph Laplacian Wavelet Transforms (GLWT). Unlike traditional architectures that rely on convolutional, recurrent, or attention based neural networks, our model operates purely in the graph spectral domain using structured multiscale filtering, nonlinear shrinkage, and symbolic logic over wavelet coefficients. Signals defined on graph nodes are decomposed via GLWT, modulated with interpretable nonlinearities, and recombined for downstream tasks such as denoising and token classification. The system supports compositional reasoning through a symbolic domain-specific language (DSL) over graph wavelet activations. Experiments on synthetic graph denoising and linguistic token graphs demonstrate competitive performance against lightweight GNNs with far greater transparency and efficiency. This work proposes a principled, interpretable, and resource-efficient alternative to deep neural architectures for learning on graphs.
[357] Exploring Adaptive Structure Learning for Heterophilic Graphs
Garv Kaushik
Main category: cs.LG
TL;DR: The paper proposes structure learning to rewire edges in shallow GCNs to capture long-range dependencies in heterophilic graphs, addressing oversmoothing and performance degradation.
Details
Motivation: To improve GCN performance on heterophilic graphs by enabling long-range dependency capture, which is hindered by localized feature aggregation.Method: Parameterizing the adjacency matrix to learn connections between non-local nodes and extending the hop span of shallow GCNs.
Result: The method captures long-range dependencies but lacks generalizability across heterophilic graphs and performs inconsistently in node classification.
Conclusion: While effective for specific cases, the method’s inconsistency and lack of generalizability highlight the need for further refinement.
Abstract: Graph Convolutional Networks (GCNs) gained traction for graph representation learning, with recent attention on improving performance on heterophilic graphs for various real-world applications. The localized feature aggregation in a typical message-passing paradigm hinders the capturing of long-range dependencies between non-local nodes of the same class. The inherent connectivity structure in heterophilic graphs often conflicts with information sharing between distant nodes of same class. We propose structure learning to rewire edges in shallow GCNs itself to avoid performance degradation in downstream discriminative tasks due to oversmoothing. Parameterizing the adjacency matrix to learn connections between non-local nodes and extend the hop span of shallow GCNs facilitates the capturing of long-range dependencies. However, our method is not generalizable across heterophilic graphs and performs inconsistently on node classification task contingent to the graph structure.
[358] EdgeAgentX-DT: Integrating Digital Twins and Generative AI for Resilient Edge Intelligence in Tactical Networks
Abir Ray
Main category: cs.LG
TL;DR: EdgeAgentX-DT enhances edge intelligence in military networks using digital twins and generative AI for robust training and validation.
Details
Motivation: To improve edge intelligence in contested military environments by integrating digital twins and generative AI for realistic training.Method: Uses network digital twins and generative AI (diffusion models, transformers) for scenario training in a multi-layer architecture.
Result: Faster learning, higher throughput, reduced latency, and improved resilience in simulations.
Conclusion: EdgeAgentX-DT demonstrates the effectiveness of digital-twin-enabled generative training for edge AI in contested environments.
Abstract: We introduce EdgeAgentX-DT, an advanced extension of the EdgeAgentX framework that integrates digital twin simulations and generative AI-driven scenario training to significantly enhance edge intelligence in military networks. EdgeAgentX-DT utilizes network digital twins, virtual replicas synchronized with real-world edge devices, to provide a secure, realistic environment for training and validation. Leveraging generative AI methods, such as diffusion models and transformers, the system creates diverse and adversarial scenarios for robust simulation-based agent training. Our multi-layer architecture includes: (1) on-device edge intelligence; (2) digital twin synchronization; and (3) generative scenario training. Experimental simulations demonstrate notable improvements over EdgeAgentX, including faster learning convergence, higher network throughput, reduced latency, and improved resilience against jamming and node failures. A case study involving a complex tactical scenario with simultaneous jamming attacks, agent failures, and increased network loads illustrates how EdgeAgentX-DT sustains operational performance, whereas baseline methods fail. These results highlight the potential of digital-twin-enabled generative training to strengthen edge AI deployments in contested environments.
[359] AdaptHetero: Machine Learning Interpretation-Driven Subgroup Adaptation for EHR-Based Clinical Prediction
Ling Liao, Eva Aagaard
Main category: cs.LG
TL;DR: AdaptHetero is an MLI-driven framework that uses interpretability insights to tailor model training and evaluation for EHR subpopulations, improving predictive performance.
Details
Motivation: The complexity and heterogeneity of EHR data limit the effectiveness of machine learning interpretation in guiding subgroup-specific modeling.Method: The framework integrates SHAP-based interpretation and unsupervised clustering to identify clinically meaningful subgroup-specific characteristics.
Result: AdaptHetero consistently identifies heterogeneous model behaviors in predicting ICU mortality, in-hospital death, and hidden hypoxemia across three EHR datasets.
Conclusion: The framework enhances predictive performance by transforming interpretability insights into actionable guidance for subpopulation-specific modeling.
Abstract: Machine learning interpretation has primarily been leveraged to build clinician trust and uncover actionable insights in EHRs. However, the intrinsic complexity and heterogeneity of EHR data limit its effectiveness in guiding subgroup-specific modeling. We propose AdaptHetero, a novel MLI-driven framework that transforms interpretability insights into actionable guidance for tailoring model training and evaluation across subpopulations within individual hospital systems. Evaluated on three large-scale EHR datasets - GOSSIS-1-eICU, WiDS, and MIMIC-IV - AdaptHetero consistently identifies heterogeneous model behaviors in predicting ICU mortality, in-hospital death, and hidden hypoxemia. By integrating SHAP-based interpretation and unsupervised clustering, the framework enhances the identification of clinically meaningful subgroup-specific characteristics, leading to improved predictive performance.
[360] Uncovering Gradient Inversion Risks in Practical Language Model Training
Xinguo Feng, Zhongkui Ma, Zihan Wang, Eu Joe Chegne, Mengyao Ma, Alsharif Abuadbba, Guangdong Bai
Main category: cs.LG
TL;DR: Grab, a gradient inversion attack for language models in FL, recovers up to 92.9% of private training data, outperforming prior methods by up to 48.5%.
Details
Motivation: Privacy threats of gradient inversion attacks in FL for language models are underestimated due to discrete token challenges.Method: Grab uses hybrid optimization: alternating dropout mask optimization and discrete token sequencing.
Result: Achieves up to 92.9% recovery rate, surpassing prior methods by 28.9%-48.5%.
Conclusion: Grab advances understanding of privacy risks in FL for language models.
Abstract: The gradient inversion attack has been demonstrated as a significant privacy threat to federated learning (FL), particularly in continuous domains such as vision models. In contrast, it is often considered less effective or highly dependent on impractical training settings when applied to language models, due to the challenges posed by the discrete nature of tokens in text data. As a result, its potential privacy threats remain largely underestimated, despite FL being an emerging training method for language models. In this work, we propose a domain-specific gradient inversion attack named Grab (gradient inversion with hybrid optimization). Grab features two alternating optimization processes to address the challenges caused by practical training settings, including a simultaneous optimization on dropout masks between layers for improved token recovery and a discrete optimization for effective token sequencing. Grab can recover a significant portion (up to 92.9% recovery rate) of the private training data, outperforming the attack strategy of utilizing discrete optimization with an auxiliary model by notable improvements of up to 28.9% recovery rate in benchmark settings and 48.5% recovery rate in practical settings. Grab provides a valuable step forward in understanding this privacy threat in the emerging FL training mode of language models.
[361] Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications
Xinye Cao, Hongcan Guo, Guoshun Nan, Jiaoyang Cui, Haoting Qian, Yihan Lin, Yilin Peng, Diyang Zhang, Yanzhao Hou, Huici Wu, Xiaofeng Tao, Tony Q. S. Quek
Main category: cs.LG
TL;DR: The paper introduces a single compositional LLM for diverse IMAs, addressing adaptability and efficiency challenges with ContextLoRA and ContextGear, validated on benchmarks and a real-world testbed.
Details
Motivation: To overcome the inefficiency of using multiple LLMs for IMAs by proposing a unified approach that adapts to diverse tasks and operates efficiently in resource-constrained environments.Method: Proposes ContextLoRA for learning structured context among IMAs via task dependency graphs and parameter partitioning, followed by a fine-tuning procedure. Introduces ContextGear for optimizing training costs.
Result: Superior performance on benchmarks and successful real-world wireless testbed implementation.
Conclusion: The proposed paradigm is effective and practical for diverse IMAs, with code to be released for community use.
Abstract: Interactive multimodal applications (IMAs), such as route planning in the Internet of Vehicles, enrich users’ personalized experiences by integrating various forms of data over wireless networks. Recent advances in large language models (LLMs) utilize mixture-of-experts (MoE) mechanisms to empower multiple IMAs, with each LLM trained individually for a specific task that presents different business workflows. In contrast to existing approaches that rely on multiple LLMs for IMAs, this paper presents a novel paradigm that accomplishes various IMAs using a single compositional LLM over wireless networks. The two primary challenges include 1) guiding a single LLM to adapt to diverse IMA objectives and 2) ensuring the flexibility and efficiency of the LLM in resource-constrained mobile environments. To tackle the first challenge, we propose ContextLoRA, a novel method that guides an LLM to learn the rich structured context among IMAs by constructing a task dependency graph. We partition the learnable parameter matrix of neural layers for each IMA to facilitate LLM composition. Then, we develop a step-by-step fine-tuning procedure guided by task relations, including training, freezing, and masking phases. This allows the LLM to learn to reason among tasks for better adaptation, capturing the latent dependencies between tasks. For the second challenge, we introduce ContextGear, a scheduling strategy to optimize the training procedure of ContextLoRA, aiming to minimize computational and communication costs through a strategic grouping mechanism. Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear. Furthermore, we prototype our proposed paradigm on a real-world wireless testbed, demonstrating its practical applicability for various IMAs. We will release our code to the community.
[362] Learning from Limited and Imperfect Data
Harsh Rangwani
Main category: cs.LG
TL;DR: The paper addresses the challenge of training deep models on real-world, imbalanced data by developing robust algorithms for diverse scenarios like long-tail learning, inductive regularization, semi-supervised learning, and domain adaptation.
Details
Motivation: Real-world data is often imbalanced and imperfect, unlike curated datasets, leading to suboptimal performance of existing algorithms. The goal is to reduce reliance on labor-intensive data curation.Method: Four-part approach: 1) Learning generative models for long-tail data, 2) Inductive regularization for tail class generalization, 3) Optimizing metrics for semi-supervised learning, and 4) Efficient domain adaptation with minimal labels.
Result: Improved performance in diverse scenarios, including better image generation for minority classes, effective generalization, and adaptation to new domains with limited data.
Conclusion: The proposed algorithms enable robust learning from imperfect, real-world data, reducing the need for extensive curation and expanding the applicability of deep models.
Abstract: The distribution of data in the world (eg, internet, etc.) significantly differs from the well-curated datasets and is often over-populated with samples from common categories. The algorithms designed for well-curated datasets perform suboptimally when used for learning from imperfect datasets with long-tailed imbalances and distribution shifts. To expand the use of deep models, it is essential to overcome the labor-intensive curation process by developing robust algorithms that can learn from diverse, real-world data distributions. Toward this goal, we develop practical algorithms for Deep Neural Networks which can learn from limited and imperfect data present in the real world. This thesis is divided into four segments, each covering a scenario of learning from limited or imperfect data. The first part of the thesis focuses on Learning Generative Models from Long-Tail Data, where we mitigate the mode-collapse and enable diverse aesthetic image generations for tail (minority) classes. In the second part, we enable effective generalization on tail classes through Inductive Regularization schemes, which allow tail classes to generalize as effectively as the head classes without requiring explicit generation of images. In the third part, we develop algorithms for Optimizing Relevant Metrics for learning from long-tailed data with limited annotation (semi-supervised), followed by the fourth part, which focuses on the Efficient Domain Adaptation of the model to various domains with very few to zero labeled samples.
[363] Bubbleformer: Forecasting Boiling with Transformers
Sheikh Md Shakeel Hassan, Xianwei Zou, Akash Dhruv, Vishwanath Ganesan, Aparna Chandramowlishwaran
Main category: cs.LG
TL;DR: Bubbleformer, a transformer-based model, autonomously forecasts boiling dynamics (nucleation, interface evolution, heat transfer) without relying on simulation data during inference, outperforming existing methods.
Details
Motivation: Existing neural PDE surrogates fail to learn nucleation from past states and struggle with flow boiling velocity fields, limiting autonomous forecasting of boiling dynamics.Method: Bubbleformer uses factorized axial attention, frequency-aware scaling, and thermophysical parameter conditioning to generalize across fluids, geometries, and conditions. It also introduces physics-based metrics for evaluation.
Result: Bubbleformer achieves benchmark results in predicting and forecasting two-phase boiling flows, validated by the high-fidelity BubbleML 2.0 dataset.
Conclusion: Bubbleformer advances modeling of chaotic boiling processes, offering stable, long-range forecasting without simulation data dependency, with potential applications in energy and thermal systems.
Abstract: Modeling boiling (an inherently chaotic, multiphase process central to energy and thermal systems) remains a significant challenge for neural PDE surrogates. Existing models require future input (e.g., bubble positions) during inference because they fail to learn nucleation from past states, limiting their ability to autonomously forecast boiling dynamics. They also fail to model flow boiling velocity fields, where sharp interface-momentum coupling demands long-range and directional inductive biases. We introduce Bubbleformer, a transformer-based spatiotemporal model that forecasts stable and long-range boiling dynamics including nucleation, interface evolution, and heat transfer without dependence on simulation data during inference. Bubbleformer integrates factorized axial attention, frequency-aware scaling, and conditions on thermophysical parameters to generalize across fluids, geometries, and operating conditions. To evaluate physical fidelity in chaotic systems, we propose interpretable physics-based metrics that evaluate heat-flux consistency, interface geometry, and mass conservation. We also release BubbleML 2.0, a high-fidelity dataset that spans diverse working fluids (cryogens, refrigerants, dielectrics), boiling configurations (pool and flow boiling), flow regimes (bubbly, slug, annular), and boundary conditions. Bubbleformer sets new benchmark results in both prediction and forecasting of two-phase boiling flows.
[364] Adaptive Multimodal Protein Plug-and-Play with Diffusion-Based Priors
Amartya Banerjee, Xingyu Xu, Caroline Moosmüller, Harlin Lee
Main category: cs.LG
TL;DR: Adam-PnP is a Plug-and-Play framework for guiding protein diffusion models with gradients from multiple noisy data sources, featuring adaptive noise estimation and dynamic weighting to reduce manual tuning.
Details
Motivation: Integrating noisy experimental data from multiple sources into deep generative models for protein structure recovery is challenging due to the need for precise noise knowledge and manual tuning.Method: Adam-PnP uses an adaptive noise estimation scheme and dynamic modality weighting within the diffusion process to guide a pre-trained protein diffusion model.
Result: Experiments show Adam-PnP significantly improves accuracy in complex reconstruction tasks.
Conclusion: Adam-PnP effectively addresses the challenge of integrating heterogeneous data sources into diffusion models, reducing reliance on manual tuning.
Abstract: In an inverse problem, the goal is to recover an unknown parameter (e.g., an image) that has typically undergone some lossy or noisy transformation during measurement. Recently, deep generative models, particularly diffusion models, have emerged as powerful priors for protein structure generation. However, integrating noisy experimental data from multiple sources to guide these models remains a significant challenge. Existing methods often require precise knowledge of experimental noise levels and manually tuned weights for each data modality. In this work, we introduce Adam-PnP, a Plug-and-Play framework that guides a pre-trained protein diffusion model using gradients from multiple, heterogeneous experimental sources. Our framework features an adaptive noise estimation scheme and a dynamic modality weighting mechanism integrated into the diffusion process, which reduce the need for manual hyperparameter tuning. Experiments on complex reconstruction tasks demonstrate significantly improved accuracy using Adam-PnP.
[365] Deep Polynomial Chaos Expansion
Johannes Exenberger, Sascha Ranftl, Robert Peharz
Main category: cs.LG
TL;DR: DeepPCE combines PCE with probabilistic circuits to scale to high-dimensional problems, matching MLP performance while retaining PCE’s exact inference capabilities.
Details
Motivation: PCE struggles with high-dimensionality due to exponential growth of basis functions.Method: DeepPCE integrates PCE with probabilistic circuits for scalability.
Result: DeepPCE achieves MLP-level predictive performance and exact statistical inference.
Conclusion: DeepPCE effectively addresses PCE’s scalability issues for high-dimensional problems.
Abstract: Polynomial chaos expansion (PCE) is a classical and widely used surrogate modeling technique in physical simulation and uncertainty quantification. By taking a linear combination of a set of basis polynomials - orthonormal with respect to the distribution of uncertain input parameters - PCE enables tractable inference of key statistical quantities, such as (conditional) means, variances, covariances, and Sobol sensitivity indices, which are essential for understanding the modeled system and identifying influential parameters and their interactions. As the number of basis functions grows exponentially with the number of parameters, PCE does not scale well to high-dimensional problems. We address this challenge by combining PCE with ideas from probabilistic circuits, resulting in the deep polynomial chaos expansion (DeepPCE) - a deep generalization of PCE that scales effectively to high-dimensional input spaces. DeepPCE achieves predictive performance comparable to that of multi-layer perceptrons (MLPs), while retaining PCE’s ability to compute exact statistical inferences via simple forward passes.
[366] Large Language Model-Enhanced Reinforcement Learning for Diverse and Novel Recommendations
Jiin Woo, Alireza Bagheri Garakani, Tianchen Zhou, Zhishen Huang, Yan Gao
Main category: cs.LG
TL;DR: LAAC (LLM-guided Adversarial Actor Critic) improves recommendation diversity and novelty by leveraging LLMs for suggestions and refining them with a lightweight policy, outperforming baselines in accuracy and robustness.
Details
Motivation: Diversity and novelty in recommendations are often sacrificed for click relevance, and existing RL methods rely on random exploration. LAAC addresses this by integrating LLM knowledge without costly fine-tuning.Method: LAAC uses LLMs as reference policies to suggest novel items, trains a lightweight policy via bilevel optimization, and applies regularization to mitigate unreliable LLM suggestions.
Result: LAAC outperforms baselines in diversity, novelty, and accuracy, remaining robust on imbalanced data.
Conclusion: LAAC effectively integrates LLM guidance to enhance recommendation systems without expensive fine-tuning, balancing novelty and accuracy.
Abstract: In recommendation systems, diversity and novelty are essential for capturing varied user preferences and encouraging exploration, yet many systems prioritize click relevance. While reinforcement learning (RL) has been explored to improve diversity, it often depends on random exploration that may not align with user interests. We propose LAAC (LLM-guided Adversarial Actor Critic), a novel method that leverages large language models (LLMs) as reference policies to suggest novel items, while training a lightweight policy to refine these suggestions using system-specific data. The method formulates training as a bilevel optimization between actor and critic networks, enabling the critic to selectively favor promising novel actions and the actor to improve its policy beyond LLM recommendations. To mitigate overestimation of unreliable LLM suggestions, we apply regularization that anchors critic values for unexplored items close to well-estimated dataset actions. Experiments on real-world datasets show that LAAC outperforms existing baselines in diversity, novelty, and accuracy, while remaining robust on imbalanced data, effectively integrating LLM knowledge without expensive fine-tuning.
[367] Blending data and physics for reduced-order modeling of systems with spatiotemporal chaotic dynamics
Alex Guo, Michael D. Graham
Main category: cs.LG
TL;DR: A hybrid reduced-order model (ROM) combining data and full-order physics improves chaotic dynamics predictions, outperforming data-only methods in various scenarios.
Details
Motivation: Leverage known physics (full-order models) alongside data to enhance predictive capability in reduced-order modeling of chaotic systems.Method: Develop a hybrid ROM using an autoencoder to find invariant manifold coordinates, project the FOM’s vector field onto it, and correct it with data or use it as a Bayesian prior. Neural ODEs are employed.
Result: The hybrid approach significantly improves time-series predictions for Kuramoto-Sivashinsky and complex Ginzburg-Landau equations, even with scarce data or incorrect FOM parameters.
Conclusion: Integrating physics and data in ROMs enhances predictive accuracy, demonstrating robustness across diverse conditions.
Abstract: While data-driven techniques are powerful tools for reduced-order modeling of systems with chaotic dynamics, great potential remains for leveraging known physics (i.e. a full-order model (FOM)) to improve predictive capability. We develop a hybrid reduced order model (ROM), informed by both data and FOM, for evolving spatiotemporal chaotic dynamics on an invariant manifold whose coordinates are found using an autoencoder. This approach projects the vector field of the FOM onto the invariant manifold; then, this physics-derived vector field is either corrected using dynamic data, or used as a Bayesian prior that is updated with data. In both cases, the neural ordinary differential equation approach is used. We consider simulated data from the Kuramoto-Sivashinsky and complex Ginzburg-Landau equations. Relative to the data-only approach, for scenarios of abundant data, scarce data, and even an incorrect FOM (i.e. erroneous parameter values), the hybrid approach yields substantially improved time-series predictions.
[368] DEM-NeRF: A Neuro-Symbolic Method for Scientific Discovery through Physics-Informed Simulation
Wenkai Tan, Alvaro Velasquez, Houbing Song
Main category: cs.LG
TL;DR: A neuro-symbolic framework combines neural networks and symbolic physics to model elastic objects from sparse images, integrating NeRF for reconstruction and PINN for physics constraints.
Details
Motivation: Address the gap between purely empirical methods (risking deviation from physics) and traditional solvers (requiring full geometry and high cost).Method: Uses neural radiance field (NeRF) for object reconstruction and physics-informed neural networks (PINN) with elasticity PDEs, incorporating energy constraints for boundary conditions.
Result: Learns spatiotemporal representations of deforming objects, balancing image data and physical laws for accurate, explainable simulations.
Conclusion: The framework successfully merges data-driven learning with physics, enabling high-fidelity simulations without explicit geometric knowledge.
Abstract: Neural networks have emerged as a powerful tool for modeling physical systems, offering the ability to learn complex representations from limited data while integrating foundational scientific knowledge. In particular, neuro-symbolic approaches that combine data-driven learning, the neuro, with symbolic equations and rules, the symbolic, address the tension between methods that are purely empirical, which risk straying from established physical principles, and traditional numerical solvers that demand complete geometric knowledge and can be prohibitively expensive for high-fidelity simulations. In this work, we present a novel neuro-symbolic framework for reconstructing and simulating elastic objects directly from sparse multi-view image sequences, without requiring explicit geometric information. Specifically, we integrate a neural radiance field (NeRF) for object reconstruction with physics-informed neural networks (PINN) that incorporate the governing partial differential equations of elasticity. In doing so, our method learns a spatiotemporal representation of deforming objects that leverages both image supervision and symbolic physical constraints. To handle complex boundary and initial conditions, which are traditionally confronted using finite element methods, boundary element methods, or sensor-based measurements, we employ an energy-constrained Physics-Informed Neural Network architecture. This design enhances both simulation accuracy and the explainability of results.
[369] A Contrastive Diffusion-based Network (CDNet) for Time Series Classification
Yaoyu Zhang, Chi-Guhn Lee
Main category: cs.LG
TL;DR: CDNet, a Contrastive Diffusion-based Network, improves deep learning classifiers for time series classification by generating informative samples via a learned diffusion process, enhancing performance under challenging conditions.
Details
Motivation: Deep learning models for time series classification struggle with class similarity, multimodal distributions, and noise. CDNet aims to address these limitations.Method: CDNet uses a learned diffusion process to generate positive and negative samples, with convolutional approximations of reverse diffusion steps and an uncertainty-weighted composite loss for robust training.
Result: CDNet significantly improves state-of-the-art classifiers on the UCR Archive and simulated datasets, especially in noisy, similar, and multimodal conditions.
Conclusion: CDNet effectively enhances classifier performance under challenging data conditions, demonstrating its potential for robust time series classification.
Abstract: Deep learning models are widely used for time series classification (TSC) due to their scalability and efficiency. However, their performance degrades under challenging data conditions such as class similarity, multimodal distributions, and noise. To address these limitations, we propose CDNet, a Contrastive Diffusion-based Network that enhances existing classifiers by generating informative positive and negative samples via a learned diffusion process. Unlike traditional diffusion models that denoise individual samples, CDNet learns transitions between samples–both within and across classes–through convolutional approximations of reverse diffusion steps. We introduce a theoretically grounded CNN-based mechanism to enable both denoising and mode coverage, and incorporate an uncertainty-weighted composite loss for robust training. Extensive experiments on the UCR Archive and simulated datasets demonstrate that CDNet significantly improves state-of-the-art (SOTA) deep learning classifiers, particularly under noisy, similar, and multimodal conditions.
[370] Efficient Neural Combinatorial Optimization Solver for the Min-max Heterogeneous Capacitated Vehicle Routing Problem
Xuan Wu, Di Wang, Chunguo Wu, Kaifang Qi, Chunyan Miao, Yubin Xiao, Jian Zhang, You Zhou
Main category: cs.LG
TL;DR: ECHO, a Neural Combinatorial Optimization solver, addresses the min-max Heterogeneous Capacitated Vehicle Routing Problem (MMHCVRP) by capturing local topology, reducing myopic decisions, and leveraging symmetry, outperforming existing solvers.
Details
Motivation: Existing solvers for MMHCVRP overlook key properties like local topology and symmetry, leading to suboptimal performance.Method: ECHO uses a dual-modality node encoder, Parameter-Free Cross-Attention, and tailored data augmentation to address these limitations.
Result: ECHO outperforms state-of-the-art solvers in scalability and generalization, validated by extensive experiments.
Conclusion: The proposed methods in ECHO effectively improve MMHCVRP solving, with ablation studies confirming their impact.
Abstract: Numerous Neural Combinatorial Optimization (NCO) solvers have been proposed to address Vehicle Routing Problems (VRPs). However, most of these solvers focus exclusively on single-vehicle VRP variants, overlooking the more realistic min-max Heterogeneous Capacitated Vehicle Routing Problem (MMHCVRP), which involves multiple vehicles. Existing MMHCVRP solvers typically select a vehicle and its next node to visit at each decoding step, but often make myopic decoding decisions and overlook key properties of MMHCVRP, including local topological relationships, vehicle permutation invariance, and node symmetry, resulting in suboptimal performance. To better address these limitations, we propose ECHO, an efficient NCO solver. First, ECHO exploits the proposed dual-modality node encoder to capture local topological relationships among nodes. Subsequently, to mitigate myopic decisions, ECHO employs the proposed Parameter-Free Cross-Attention mechanism to prioritize the vehicle selected in the preceding decoding step. Finally, leveraging vehicle permutation invariance and node symmetry, we introduce a tailored data augment strategy for MMHCVRP to stabilize the Reinforcement Learning training process. To assess the performance of ECHO, we conduct extensive experiments. The experimental results demonstrate that ECHO outperforms state-of-the-art NCO solvers across varying numbers of vehicles and nodes, and exhibits well-performing generalization across both scales and distribution patterns. Finally, ablation studies validate the effectiveness of all proposed methods.
[371] Systolic Array-based Accelerator for State-Space Models
Shiva Raja, Cansu Demirkiran, Aakash Sarkar, Milos Popovic, Ajay Joshi
Main category: cs.LG
TL;DR: The paper introduces EpochCore, a hardware accelerator for State-Space Models (SSMs), achieving significant performance and energy efficiency improvements over traditional methods.
Details
Motivation: Existing models (RNNs, CNNs, Transformers) struggle with long sequences due to memory limitations. SSMs offer better efficiency but are computationally intensive.Method: EpochCore uses systolic arrays and a specialized processing element (LIMA-PE) with a novel dataflow (ProDF) to optimize SSM execution.
Result: EpochCore achieves 250x performance gains, 45x energy efficiency, and ~2,000x latency improvement over GPUs.
Conclusion: EpochCore is a promising solution for accelerating SSMs, balancing performance, efficiency, and area cost.
Abstract: Sequence modeling is crucial for AI to understand temporal data and detect complex time-dependent patterns. While recurrent neural networks (RNNs), convolutional neural networks (CNNs), and Transformers have advanced in capturing long-range dependencies, they struggle with achieving high accuracy with very long sequences due to limited memory retention (fixed context window). State-Space Models (SSMs) leverage exponentially decaying memory enabling lengthy context window and so they process very long data sequences more efficiently than recurrent and Transformer-based models. Unlike traditional neural models like CNNs and RNNs, SSM-based models require solving differential equations through continuous integration, making training and inference both compute- and memory-intensive on conventional CPUs and GPUs. In this paper we introduce a specialized hardware accelerator, EpochCore, for accelerating SSMs. EpochCore is based on systolic arrays (SAs) and is designed to enhance the energy efficiency and throughput of inference of SSM-based models for long-range sequence tasks. Within the SA, we propose a versatile processing element (PE) called LIMA-PE to perform traditional and specialized MAC operations to support traditional DNNs and SSMs. To complement the EpochCore microarchitecture, we propose a novel dataflow, ProDF, which enables highly efficient execution of SSM-based models. By leveraging the LIMA-PE microarchitecture and ProDF, EpochCore achieves on average 250x gains in performance and 45x improvement in energy efficiency, at the expense of 2x increase in area cost over traditional SA-based accelerators, and around ~2,000x improvement in latency/inference on LRA datasets compared to GPU kernel operations.
[372] Enabling Pareto-Stationarity Exploration in Multi-Objective Reinforcement Learning: A Multi-Objective Weighted-Chebyshev Actor-Critic Approach
Fnu Hairi, Jiao Yang, Tianchen Zhou, Haibo Yang, Chaosheng Dong, Fan Yang, Michinari Momma, Yan Gao, Jia Liu
Main category: cs.LG
TL;DR: Error: OutputParser failed
Details
Motivation: Error: OutputParser failedMethod: Error: OutputParser failed
Result: Error: OutputParser failed
Conclusion: Error: OutputParser failed
Abstract: In many multi-objective reinforcement learning (MORL) applications, being able to systematically explore the Pareto-stationary solutions under multiple non-convex reward objectives with theoretical finite-time sample complexity guarantee is an important and yet under-explored problem. This motivates us to take the first step and fill the important gap in MORL. Specifically, in this paper, we propose a \uline{M}ulti-\uline{O}bjective weighted-\uline{CH}ebyshev \uline{A}ctor-critic (MOCHA) algorithm for MORL, which judiciously integrates the weighted-Chebychev (WC) and actor-critic framework to enable Pareto-stationarity exploration systematically with finite-time sample complexity guarantee. Sample complexity result of MOCHA algorithm reveals an interesting dependency on $p_{\min}$ in finding an $\epsilon$-Pareto-stationary solution, where $p_{\min}$ denotes the minimum entry of a given weight vector $\mathbf{p}$ in WC-scarlarization. By carefully choosing learning rates, the sample complexity for each exploration can be $\tilde{\mathcal{O}}(\epsilon^{-2})$. Furthermore, simulation studies on a large KuaiRand offline dataset, show that the performance of MOCHA algorithm significantly outperforms other baseline MORL approaches.
[373] Data Leakage and Redundancy in the LIT-PCBA Benchmark
Amber Huang, Ian Scott Knight, Slava Naprienko
Main category: cs.LG
TL;DR: The LIT-PCBA benchmark is compromised by data leakage, duplication, and structural redundancy, making it unsuitable for fair model evaluation. A trivial memorization-based baseline outperforms state-of-the-art models, highlighting the dataset’s flaws.
Details
Motivation: To audit and expose the severe flaws in the LIT-PCBA benchmark, which undermine its validity for virtual screening and model evaluation.Method: Identified data leakage, duplication, and structural redundancy in LIT-PCBA. Implemented a memorization-based baseline to demonstrate the dataset’s flaws.
Result: Found 2,491 duplicated inactives, leaked query ligands, and high structural redundancy (e.g., 80% near duplicates in some targets). The trivial baseline outperformed advanced models.
Conclusion: LIT-PCBA is unfit for its intended purpose due to critical flaws, and previous results using it are questionable. The audit aims to improve future dataset rigor.
Abstract: LIT-PCBA is a widely used benchmark for virtual screening, but our audit reveals it is fundamentally compromised. The dataset suffers from egregious data leakage, rampant duplication, and pervasive analog redundancy – flaws that invalidate its use for fair model evaluation. Notably, we identify 2,491 inactives duplicated across training and validation sets, and thousands more repeated within individual data splits (2,945 in training, 789 in validation). Critically, three ligands in the query set – meant to represent unseen test cases – are leaked: two appear in the training set, one in validation. Structural redundancy compounds these issues: for some targets, over 80% of query ligands are near duplicates, with Tanimoto similarity >= 0.9. In ALDH1 alone, we find 323 highly similar active pairs between training and validation sets, invalidating claims of chemical diversity. These and other flaws collectively cause models trained on LIT-PCBA to memorize rather than generalize. To demonstrate the consequences of these data integrity failures, we implement a trivial memorization-based baseline – using no learning, no physics, and no modeling – that outperforms state-of-the-art models, including deep neural networks like CHEESE, on LIT-PCBA simply by exploiting these artifacts. Our findings render the benchmark unfit for its intended purpose and call into question previous results based on its use. We share this audit to raise awareness and provide tooling to help the community develop more rigorous and reliable datasets going forward. All scripts necessary to reproduce our audit and the baseline implementation are available at: https://github.com/sievestack/LIT-PCBA-audit
[374] Torque-based Graph Surgery:Enhancing Graph Neural Networks with Hierarchical Rewiring
Sujia Huang, Lele Fu, Zhen Cui, Tong Zhang, Na Song, Bo Huang
Main category: cs.LG
TL;DR: The paper proposes a torque-driven hierarchical rewiring strategy for GNNs to improve representation learning in heterophilous and noisy graphs by dynamically adjusting message passing.
Details
Motivation: Native graph interactions may hinder effective message passing, prompting the need for rewiring methods to enhance learning and robustness.Method: Introduces an interference-aware torque metric to quantify edge perturbations, guiding hierarchical rewiring by pruning high-torque edges and adding low-torque links.
Result: Outperforms state-of-the-art methods on heterophilous, homophilous, and noisy graphs.
Conclusion: The torque-driven rewiring strategy effectively improves GNN performance by optimizing message passing and reducing noise.
Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning from graph-structured data, leveraging message passing to diffuse information and update node representations. However, most efforts have suggested that native interactions encoded in the graph may not be friendly for this process, motivating the development of graph rewiring methods. In this work, we propose a torque-driven hierarchical rewiring strategy, inspired by the notion of torque in classical mechanics, dynamically modulating message passing to improve representation learning in heterophilous graphs and enhance robustness against noisy graphs. Specifically, we define an interference-aware torque metric that integrates structural distance and energy scores to quantify the perturbation induced by edges, thereby encouraging each node to aggregate information from its nearest low-energy neighbors. We use the metric to hierarchically reconfigure the receptive field of each layer by judiciously pruning high-torque edges and adding low-torque links, suppressing propagation noise and boosting pertinent signals. Extensive evaluations on benchmark datasets show that our approach surpasses state-of-the-art methods on both heterophilous and homophilous graphs, and maintains high accuracy on noisy graph.
[375] MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse
Kaiwen Chen, Xin Tan, Minchen Yu, Hong Xu
Main category: cs.LG
TL;DR: MemShare reduces memory overhead in Large Reasoning Models by reusing similar KV cache blocks, improving throughput by 84.79% without sacrificing accuracy.
Details
Motivation: LRMs generate redundant intermediate reasoning steps, leading to high memory usage. MemShare aims to optimize KV cache reuse.Method: Uses collaborative filtering to identify reusable KV cache blocks and enables zero-copy cache reuse.
Result: Achieves up to 84.79% throughput improvement while maintaining accuracy.
Conclusion: MemShare effectively reduces memory overhead and enhances performance in LRMs.
Abstract: Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks. However, their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference. We observe that LRMs frequently produce highly similar intermediate reasoning steps, which correspond to similar KV cache states across layers. Motivated by this observation, we propose MemShare, a novel KV cache management approach that effectively reduces memory overhead. MemShare employs a collaborative filtering algorithm to efficiently identify reusable KV cache blocks and enables zero copy cache reuse to significantly reduce memory overhead, improve throughput while maintaining accuracy. Experimental results demonstrate that MemShare delivers up to 84.79% improvement in throughput while maintaining better accuracy compared to existing KV cache management methods.
[376] PVD-ONet: A Multi-scale Neural Operator Method for Singularly Perturbed Boundary Layer Problems
Tiantian Sun, Jian Zu
Main category: cs.LG
TL;DR: Proposes PVD-Net and PVD-ONet frameworks to solve singularly perturbed PDEs without data, outperforming existing methods.
Details
Motivation: Address the failure of Physics-informed neural networks in singularly perturbed problems.Method: Two versions of PVD-Net (stability-focused and high-accuracy) and PVD-ONet for operator learning, using Prandtl’s and Van Dyke’s matching principles.
Result: Numerical experiments show superior performance over baselines in multi-scale problems.
Conclusion: PVD-Net and PVD-ONet offer effective solutions for singularly perturbed PDEs, enhancing stability and accuracy.
Abstract: Physics-informed neural networks and Physics-informed DeepONet excel in solving partial differential equations; however, they often fail to converge for singularly perturbed problems. To address this, we propose two novel frameworks, Prandtl-Van Dyke neural network (PVD-Net) and its operator learning extension Prandtl-Van Dyke Deep Operator Network (PVD-ONet), which rely solely on governing equations without data. To address varying task-specific requirements, both PVD-Net and PVD-ONet are developed in two distinct versions, tailored respectively for stability-focused and high-accuracy modeling. The leading-order PVD-Net adopts a two-network architecture combined with Prandtl’s matching condition, targeting stability-prioritized scenarios. The high-order PVD-Net employs a five-network design with Van Dyke’s matching principle to capture fine-scale boundary layer structures, making it ideal for high-accuracy scenarios. PVD-ONet generalizes PVD-Net to the operator learning setting by assembling multiple DeepONet modules, directly mapping initial conditions to solution operators and enabling instant predictions for an entire family of boundary layer problems without retraining. Numerical experiments on various models show that our proposed methods consistently outperform existing baselines under various error metrics, thereby offering a powerful new approach for multi-scale problems.
[377] Retrieve-Augmented Generation for Speeding up Diffusion Policy without Additional Training
Sodtavilan Odonchimed, Tatsuya Matsushima, Simon Holk, Yusuke Iwasawa, Yutaka Matsuo
Main category: cs.LG
TL;DR: RAGDP is a novel framework that speeds up pre-trained Diffusion Policies (DPs) without extra training by using a knowledge base of expert demonstrations. It improves accuracy and speed trade-offs, outperforming distillation methods like CP.
Details
Motivation: DPs are slow due to multiple noise removal steps, and distillation methods like CP require lengthy training. RAGDP aims to address these issues by leveraging a knowledge base for faster inference.Method: RAGDP encodes observation-action pairs into a vector database. During inference, it retrieves the most similar expert action and combines it with intermediate noise removal to reduce steps.
Result: RAGDP improves speed and accuracy trade-offs, achieving a 7% accuracy increase over CP even at 20x acceleration.
Conclusion: RAGDP offers a training-free solution to accelerate DPs while maintaining or improving accuracy, outperforming existing methods like CP.
Abstract: Diffusion Policies (DPs) have attracted attention for their ability to achieve significant accuracy improvements in various imitation learning tasks. However, DPs depend on Diffusion Models, which require multiple noise removal steps to generate a single action, resulting in long generation times. To solve this problem, knowledge distillation-based methods such as Consistency Policy (CP) have been proposed. However, these methods require a significant amount of training time, especially for difficult tasks. In this study, we propose RAGDP (Retrieve-Augmented Generation for Diffusion Policies) as a novel framework that eliminates the need for additional training using a knowledge base to expedite the inference of pre-trained DPs. In concrete, RAGDP encodes observation-action pairs through the DP encoder to construct a vector database of expert demonstrations. During inference, the current observation is embedded, and the most similar expert action is extracted. This extracted action is combined with an intermediate noise removal step to reduce the number of steps required compared to the original diffusion step. We show that by using RAGDP with the base model and existing acceleration methods, we improve the accuracy and speed trade-off with no additional training. Even when accelerating the models 20 times, RAGDP maintains an advantage in accuracy, with a 7% increase over distillation models such as CP.
[378] Capacity-Constrained Continual Learning
Zheng Wen, Doina Precup, Benjamin Van Roy, Satinder Singh
Main category: cs.LG
TL;DR: The paper explores optimal resource allocation for agents with limited capacity, focusing on the capacity-constrained LQG sequential prediction problem. It provides solutions and steady-state capacity allocation for decomposable sub-problems.
Details
Motivation: To address the lack of attention on how agents with finite resources should allocate their capacity for optimal performance.Method: Study of the capacity-constrained linear-quadratic-Gaussian (LQG) sequential prediction problem under technical conditions.
Result: Derived a solution for the problem and demonstrated optimal capacity allocation for decomposable sub-problems in steady state.
Conclusion: This work is a foundational step in systematically studying learning under capacity constraints.
Abstract: Any agents we can possibly build are subject to capacity constraints, as memory and compute resources are inherently finite. However, comparatively little attention has been dedicated to understanding how agents with limited capacity should allocate their resources for optimal performance. The goal of this paper is to shed some light on this question by studying a simple yet relevant continual learning problem: the capacity-constrained linear-quadratic-Gaussian (LQG) sequential prediction problem. We derive a solution to this problem under appropriate technical conditions. Moreover, for problems that can be decomposed into a set of sub-problems, we also demonstrate how to optimally allocate capacity across these sub-problems in the steady state. We view the results of this paper as a first step in the systematic theoretical study of learning under capacity constraints.
[379] Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning
Wenxuan Bao, Ruxi Deng, Ruizhong Qiu, Tianxin Wei, Hanghang Tong, Jingrui He
Main category: cs.LG
TL;DR: Latte is a novel framework for test-time adaptation in decentralized settings, using local and external memories to enhance model performance while minimizing communication costs.
Details
Motivation: Existing test-time adaptation methods struggle with limited data in decentralized settings and lack personalization for individual clients.Method: Latte uses local and external memories to store historical test data and class prototypes, leveraging similarity and uncertainty for adaptation.
Result: Latte outperforms existing methods in decentralized settings with negligible added costs.
Conclusion: Latte effectively addresses distribution shifts in decentralized environments, balancing personalization and robustness.
Abstract: Test-time adaptation with pre-trained vision-language models has gained increasing attention for addressing distribution shifts during testing. Among these approaches, memory-based algorithms stand out due to their training-free nature and ability to leverage historical test data. However, existing test-time adaptation methods are typically designed for a single domain with abundant data. In decentralized settings such as federated learning, applying these methods individually to each client suffers from limited test data, while directly sharing a single global memory via the server prevents proper personalization to each client’s unique distribution. To address this, we propose Latte, a novel framework where each client maintains a local memory to store embeddings from its own historical test data and an external memory to store class prototypes from other relevant clients. During communication, each client retrieves prototypes from similar clients under the server’s coordination to expand its memory. For local adaptation, Latte utilizes both embedding similarity and uncertainty to enhance model performance. Our theoretical analysis shows that Latte effectively leverages in-distribution clients while remaining robust to out-of-distribution clients. Extensive experiments on domain adaptation and corruption benchmarks validate that Latte achieves superior performance in decentralized settings, while introducing only negligible communication and computation costs. Our code is available at https://github.com/baowenxuan/Latte .
[380] Evaluation and Benchmarking of LLM Agents: A Survey
Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip
Main category: cs.LG
TL;DR: The paper surveys LLM agent evaluation, proposing a taxonomy for objectives and processes, addressing enterprise challenges, and suggesting future research directions.
Details
Motivation: The complexity and underdevelopment of evaluating LLM-based agents necessitate a structured approach to guide researchers and practitioners.Method: Introduces a two-dimensional taxonomy for evaluation (objectives and process) and discusses enterprise-specific challenges.
Result: Provides a framework for systematic assessment of LLM agents, highlighting gaps like reliability and compliance.
Conclusion: Aims to clarify the fragmented evaluation landscape and enable real-world deployment of LLM agents.
Abstract: The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives – what to evaluate, such as agent behavior, capabilities, reliability, and safety – and (2) evaluation process – how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition to taxonomy, we highlight enterprise-specific challenges, such as role-based access to data, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance, which are often overlooked in current research. We also identify future research directions, including holistic, more realistic, and scalable evaluation. This work aims to bring clarity to the fragmented landscape of agent evaluation and provide a framework for systematic assessment, enabling researchers and practitioners to evaluate LLM agents for real-world deployment.
[381] Hierarchical Stochastic Differential Equation Models for Latent Manifold Learning in Neural Time Series
Pedram Rajaei, Maryam Ostadsharif Memar, Navid Ziaei, Behzad Nazari, Ali Yousefi
Main category: cs.LG
TL;DR: A novel hierarchical SDE model is proposed to efficiently and interpretably uncover low-dimensional manifolds in high-dimensional neural time series, outperforming existing methods.
Details
Motivation: To address limitations of current latent dynamical variable models in balancing computational efficiency and interpretability for uncovering low-dimensional manifolds in neural time series.Method: Uses hierarchical Brownian bridge SDEs to model latent space, with points sampled from a multivariate marked point process, and maps these to observed data for continuous, differentiable latent processes.
Result: The model accurately recovers manifold structure, scales linearly with data length, and performs well on synthetic and neural recording data.
Conclusion: The proposed SDE model effectively balances efficiency and interpretability, offering a robust solution for manifold reconstruction in neural time series.
Abstract: The manifold hypothesis suggests that high-dimensional neural time series lie on a low-dimensional manifold shaped by simpler underlying dynamics. To uncover this structure, latent dynamical variable models such as state-space models, recurrent neural networks, neural ordinary differential equations, and Gaussian Process Latent Variable Models are widely used. We propose a novel hierarchical stochastic differential equation (SDE) model that balances computational efficiency and interpretability, addressing key limitations of existing methods. Our model assumes the trajectory of a manifold can be reconstructed from a sparse set of samples from the manifold trajectory. The latent space is modeled using Brownian bridge SDEs, with points - specified in both time and value - sampled from a multivariate marked point process. These Brownian bridges define the drift of a second set of SDEs, which are then mapped to the observed data. This yields a continuous, differentiable latent process capable of modeling arbitrarily complex time series as the number of manifold points increases. We derive training and inference procedures and show that the computational cost of inference scales linearly with the length of the observation data. We then validate our model on both synthetic data and neural recordings to demonstrate that it accurately recovers the underlying manifold structure and scales effectively with data dimensionality.
[382] Categorical Distributions are Effective Neural Network Outputs for Event Prediction
Kevin Doran, Tom Baden
Main category: cs.LG
TL;DR: A simple neural network output (categorical probability distribution) is effective for next spike prediction, challenging the underuse of this approach in temporal point process models. Existing datasets often lack revealing information, and model performance may rely on regularization. Extended datasets show the simplicity of categorical distribution is competitive.
Details
Motivation: Investigate why simple categorical probability distributions are underused in neural temporal point process models, despite their effectiveness for next spike prediction.Method: Use a simple neural network output (categorical distribution) for next spike prediction. Extend and create datasets to explore beyond information-limited regimes.
Result: Existing datasets often don’t reveal much about event-generating processes, and model performance may depend on regularization. The categorical distribution approach is competitive across diverse datasets.
Conclusion: Simple categorical probability distributions are effective and competitive for next spike prediction, suggesting reevaluation of their underuse in temporal point process models.
Abstract: We demonstrate the effectiveness of using a simple neural network output, a categorical probability distribution, for the task of next spike prediction. This case study motivates an investigation into why this simple output structure is not commonly used with neural temporal point process models. We find evidence that many existing datasets for evaluating temporal point process models do not reveal much information about the underlying event generating processes, and many existing models perform well due to regularization effects of model size and constraints on output structure. We extend existing datasets and create new ones in order to explore outside of this information limited regime and find that outputting a simple categorical distribution is competitive across a wide range of datasets.
[383] Hyperbolic Genome Embeddings
Raiyan R. Khan, Philippe Chlenski, Itsik Pe’er
Main category: cs.LG
TL;DR: Hyperbolic CNNs outperform Euclidean models in genomic sequence modeling, achieving better results with fewer parameters and no pretraining.
Details
Motivation: Aligning machine learning inductive biases with biological evolutionary structure for better DNA sequence representations.Method: Novel application of hyperbolic CNNs to exploit evolutionary structure without explicit phylogenetic mapping.
Result: Outperforms Euclidean models on 37/42 benchmarks and surpasses state-of-the-art on 7 GUE datasets. Introduces Transposable Elements Benchmark.
Conclusion: Hyperbolic framework shows robust potential for genome representation learning.
Abstract: Current approaches to genomic sequence modeling often struggle to align the inductive biases of machine learning models with the evolutionarily-informed structure of biological systems. To this end, we formulate a novel application of hyperbolic CNNs that exploits this structure, enabling more expressive DNA sequence representations. Our strategy circumvents the need for explicit phylogenetic mapping while discerning key properties of sequences pertaining to core functional and regulatory behavior. Across 37 out of 42 genome interpretation benchmark datasets, our hyperbolic models outperform their Euclidean equivalents. Notably, our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets, consistently outperforming many DNA language models while using orders of magnitude fewer parameters and avoiding pretraining. Our results include a novel set of benchmark datasets–the Transposable Elements Benchmark–which explores a major but understudied component of the genome with deep evolutionary significance. We further motivate our work by exploring how our hyperbolic models recognize genomic signal under various data-generating conditions and by constructing an empirical method for interpreting the hyperbolicity of dataset embeddings. Throughout these assessments, we find persistent evidence highlighting the potential of our hyperbolic framework as a robust paradigm for genome representation learning. Our code and benchmark datasets are available at https://github.com/rrkhan/HGE.
[384] DGP: A Dual-Granularity Prompting Framework for Fraud Detection with Graph-Enhanced LLMs
Yuan Li, Jun Hu, Bryan Hooi, Bingsheng He, Cheng Chen
Main category: cs.LG
TL;DR: Dual Granularity Prompting (DGP) improves fraud detection by summarizing neighbor information in graphs, balancing fine-grained target details with coarse-grained neighbor prompts.
Details
Motivation: Address the challenge of information overload in text-only prompting for heterogeneous fraud-detection graphs, where dense textual data degrades performance.Method: DGP preserves fine-grained textual details for the target node while summarizing neighbor information into concise prompts using tailored summarization strategies for different data modalities.
Result: DGP improves fraud detection performance by up to 6.8% (AUPRC) over state-of-the-art methods while operating within a manageable token budget.
Conclusion: DGP demonstrates the potential of Graph-Enhanced LLMs for fraud detection by effectively mitigating information overload.
Abstract: Real-world fraud detection applications benefit from graph learning techniques that jointly exploit node features, often rich in textual data, and graph structural information. Recently, Graph-Enhanced LLMs emerge as a promising graph learning approach that converts graph information into prompts, exploiting LLMs’ ability to reason over both textual and structural information. Among them, text-only prompting, which converts graph information to prompts consisting solely of text tokens, offers a solution that relies only on LLM tuning without requiring additional graph-specific encoders. However, text-only prompting struggles on heterogeneous fraud-detection graphs: multi-hop relations expand exponentially with each additional hop, leading to rapidly growing neighborhoods associated with dense textual information. These neighborhoods may overwhelm the model with long, irrelevant content in the prompt and suppress key signals from the target node, thereby degrading performance. To address this challenge, we propose Dual Granularity Prompting (DGP), which mitigates information overload by preserving fine-grained textual details for the target node while summarizing neighbor information into coarse-grained text prompts. DGP introduces tailored summarization strategies for different data modalities, bi-level semantic abstraction for textual fields and statistical aggregation for numerical features, enabling effective compression of verbose neighbor content into concise, informative prompts. Experiments across public and industrial datasets demonstrate that DGP operates within a manageable token budget while improving fraud detection performance by up to 6.8% (AUPRC) over state-of-the-art methods, showing the potential of Graph-Enhanced LLMs for fraud detection.
[385] Probabilistic Consistency in Machine Learning and Its Connection to Uncertainty Quantification
Paul Patrone, Anthony Kearsley
Main category: cs.LG
TL;DR: The paper explores the black-box nature of ML models, focusing on uncertainty quantification (UQ) and prevalence in classification. It derives a level-set theory showing self-consistent ML models equate to class-conditional probability distributions, with applications in multiclass classification and UQ.
Details
Motivation: To address the difficulty in quantifying confidence in ML predictions and understanding how models abstract training data, by leveraging diagnostics and prevalence.Method: Analyzes binary Bayes optimal classifiers, reinterprets boundary sets as density ratio level-sets, and parameterizes classifiers by prevalence to deduce density ratios and derive multiclass results.
Result: Shows that self-consistent ML models satisfy normalization and self-consistency conditions, equivalent to the law of total probability, and are necessary for valid probabilistic interpretations.
Conclusion: The analysis provides a framework for UQ in ML, demonstrating how understanding prevalence and classifier properties can improve uncertainty estimation.
Abstract: Machine learning (ML) is often viewed as a powerful data analysis tool that is easy to learn because of its black-box nature. Yet this very nature also makes it difficult to quantify confidence in predictions extracted from ML models, and more fundamentally, to understand how such models are mathematical abstractions of training data. The goal of this paper is to unravel these issues and their connections to uncertainty quantification (UQ) by pursuing a line of reasoning motivated by diagnostics. In such settings, prevalence - i.e. the fraction of elements in class - is often of inherent interest. Here we analyze the many interpretations of prevalence to derive a level-set theory of classification, which shows that certain types of self-consistent ML models are equivalent to class-conditional probability distributions. We begin by studying the properties of binary Bayes optimal classifiers, recognizing that their boundary sets can be reinterpreted as level-sets of pairwise density ratios. By parameterizing Bayes classifiers in terms of the prevalence, we then show that they satisfy important monotonicity and class-switching properties that can be used to deduce the density ratios without direct access to the boundary sets. Moreover, this information is sufficient for tasks such as constructing the multiclass Bayes-optimal classifier and estimating inherent uncertainty in the class assignments. In the multiclass case, we use these results to deduce normalization and self-consistency conditions, the latter being equivalent to the law of total probability for classifiers. We also show that these are necessary conditions for arbitrary ML models to have valid probabilistic interpretations. Throughout we demonstrate how this analysis informs the broader task of UQ for ML via an uncertainty propagation framework.
[386] PREIG: Physics-informed and Reinforcement-driven Interpretable GRU for Commodity Demand Forecasting
Hongwei Ma, Junbin Gao, Minh-Ngoc Tran
Main category: cs.LG
TL;DR: PREIG is a deep learning framework for commodity demand forecasting, combining GRU and PINN with economic constraints, outperforming traditional models.
Details
Motivation: Addressing the challenge of volatile commodity demand forecasting with nonlinear dependencies and the need for economically consistent predictions.Method: Integrates GRU with PINN principles, enforcing economic constraints via a customized loss function, and uses hybrid optimization (NAdam, L-BFGS, POP).
Result: PREIG outperforms traditional econometric models (ARIMA, GARCH) and deep learning baselines (BPNN, RNN) in RMSE and MAPE.
Conclusion: PREIG offers a robust, interpretable, and scalable solution for high-dimensional nonlinear time series forecasting in economics.
Abstract: Accurately forecasting commodity demand remains a critical challenge due to volatile market dynamics, nonlinear dependencies, and the need for economically consistent predictions. This paper introduces PREIG, a novel deep learning framework tailored for commodity demand forecasting. The model uniquely integrates a Gated Recurrent Unit (GRU) architecture with physics-informed neural network (PINN) principles by embedding a domain-specific economic constraint: the negative elasticity between price and demand. This constraint is enforced through a customized loss function that penalizes violations of the physical rule, ensuring that model predictions remain interpretable and aligned with economic theory. To further enhance predictive performance and stability, PREIG incorporates a hybrid optimization strategy that couples NAdam and L-BFGS with Population-Based Training (POP). Experiments across multiple commodities datasets demonstrate that PREIG significantly outperforms traditional econometric models (ARIMA,GARCH) and deep learning baselines (BPNN,RNN) in both RMSE and MAPE. When compared with GRU,PREIG maintains good explainability while still performing well in prediction. By bridging domain knowledge, optimization theory and deep learning, PREIG provides a robust, interpretable, and scalable solution for high-dimensional nonlinear time series forecasting in economy.
[387] Data-Driven Extended Corresponding State Approach for Residual Property Prediction of Hydrofluoroolefins
Gang Wang, Peng Hu
Main category: cs.LG
TL;DR: A neural network extended corresponding state model is proposed to predict hydrofluoroolefin refrigerant properties, combining theoretical and data-driven methods for improved accuracy.
Details
Motivation: The lack of reliable thermodynamic data for hydrofluoroolefins hinders their discovery and application as next-generation refrigerants.Method: Integrates graph neural networks and specialized model architecture to predict residual thermodynamic properties, trained on accurate data and validated via leave-one-out cross-validation.
Result: Achieves significantly improved accuracy for density and energy properties, with deviations of 1.49%-2.42% for density and 1.34%-3.37% for entropy and enthalpy.
Conclusion: The model effectively embeds physics into machine learning, accelerating the discovery of superior hydrofluoroolefin refrigerants.
Abstract: Hydrofluoroolefins are considered the most promising next-generation refrigerants due to their extremely low global warming potential values, which can effectively mitigate the global warming effect. However, the lack of reliable thermodynamic data hinders the discovery and application of newer and superior hydrofluoroolefin refrigerants. In this work, integrating the strengths of theoretical method and data-driven method, we proposed a neural network extended corresponding state model to predict the residual thermodynamic properties of hydrofluoroolefin refrigerants. The innovation is that the fluids are characterized through their microscopic molecular structures by the inclusion of graph neural network module and the specialized design of model architecture to enhance its generalization ability. The proposed model is trained using the highly accurate data of available known fluids, and evaluated via the leave-one-out cross-validation method. Compared to conventional extended corresponding state models or cubic equation of state, the proposed model shows significantly improved accuracy for density and energy properties in liquid and supercritical regions, with average absolute deviation of 1.49% (liquid) and 2.42% (supercritical) for density, 3.37% and 2.50% for residual entropy, 1.85% and 1.34% for residual enthalpy. These results demonstrate the effectiveness of embedding physics knowledge into the machine learning model. The proposed neural network extended corresponding state model is expected to significantly accelerate the discovery of novel hydrofluoroolefin refrigerants.
[388] Zero-Shot Machine Unlearning with Proxy Adversarial Data Generation
Huiqiang Chen, Tianqing Zhu, Xin Yu, Wanlei Zhou
Main category: cs.LG
TL;DR: ZS-PAG is a novel framework for zero-shot machine unlearning, addressing over-unlearning by generating adversarial samples, pinpointing a subspace for unlearning, and using influence-based pseudo-labeling.
Details
Motivation: Existing unlearning methods rely on remaining data, making them impractical for zero-shot scenarios where only unlearning samples are available.Method: ZS-PAG approximates remaining data with adversarial samples, identifies a subspace for unlearning, and employs influence-based pseudo-labeling.
Result: The method improves model performance post-unlearning and outperforms baselines in experiments.
Conclusion: ZS-PAG effectively addresses zero-shot unlearning with theoretical guarantees and superior performance.
Abstract: Machine unlearning aims to remove the influence of specific samples from a trained model. A key challenge in this process is over-unlearning, where the model’s performance on the remaining data significantly drops due to the change in the model’s parameters. Existing unlearning algorithms depend on the remaining data to prevent this issue. As such, these methods are inapplicable in a more practical scenario, where only the unlearning samples are available (i.e., zero-shot unlearning). This paper presents a novel framework, ZS-PAG, to fill this gap. Our approach offers three key innovations: (1) we approximate the inaccessible remaining data by generating adversarial samples; (2) leveraging the generated samples, we pinpoint a specific subspace to perform the unlearning process, therefore preventing over-unlearning in the challenging zero-shot scenario; and (3) we consider the influence of the unlearning process on the remaining samples and design an influence-based pseudo-labeling strategy. As a result, our method further improves the model’s performance after unlearning. The proposed method holds a theoretical guarantee, and experiments on various benchmarks validate the effectiveness and superiority of our proposed method over several baselines.
[389] evoxels: A differentiable physics framework for voxel-based microstructure simulations
Simon Daubner, Alexander E. Cohen, Benjamin Dörich, Samuel J. Cooper
Main category: cs.LG
TL;DR: The paper introduces evoxels, a differentiable physics framework for integrating microscopy, simulations, and machine learning to optimize material design.
Details
Motivation: Bridging experimental and computational domains is crucial for inverse material design, where desired performance guides microstructure and manufacturing optimization.Method: The evoxels framework uses a Pythonic, voxel-based approach to combine segmented 3D microscopy data, physical simulations, inverse modeling, and machine learning.
Result: This integration accelerates discovery and enhances understanding of process-structure-property relationships.
Conclusion: The evoxels framework effectively unifies experimental and computational methods for advanced material design.
Abstract: Materials science inherently spans disciplines: experimentalists use advanced microscopy to uncover micro- and nanoscale structure, while theorists and computational scientists develop models that link processing, structure, and properties. Bridging these domains is essential for inverse material design where you start from desired performance and work backwards to optimal microstructures and manufacturing routes. Integrating high-resolution imaging with predictive simulations and data-driven optimization accelerates discovery and deepens understanding of process-structure-property relationships. The differentiable physics framework evoxels is based on a fully Pythonic, unified voxel-based approach that integrates segmented 3D microscopy data, physical simulations, inverse modeling, and machine learning.
[390] TempRe: Template generation for single and direct multi-step retrosynthesis
Nguyen Xuan-Vu, Daniel Armstrong, Zlatko Joncev, Philippe Schwaller
Main category: cs.LG
TL;DR: TempRe is a generative framework for retrosynthesis planning, combining the scalability of sequence generation with the plausibility of template-based methods, outperforming existing approaches.
Details
Motivation: Address the limitations of traditional template-based methods (poor scalability, limited generalization) and template-free approaches (invalid reactions) in retrosynthesis planning.Method: TempRe reformulates template-based retrosynthesis as sequence generation, enabling scalable and chemically plausible synthesis planning.
Result: TempRe outperforms template classification and SMILES-based methods in single-step and multi-step tasks, achieving strong accuracy on the PaRoutes benchmark.
Conclusion: Template generative modeling, exemplified by TempRe, is a promising paradigm for efficient and flexible computer-aided synthesis planning.
Abstract: Retrosynthesis planning remains a central challenge in molecular discovery due to the vast and complex chemical reaction space. While traditional template-based methods offer tractability, they suffer from poor scalability and limited generalization, and template-free generative approaches risk generating invalid reactions. In this work, we propose TempRe, a generative framework that reformulates template-based approaches as sequence generation, enabling scalable, flexible, and chemically plausible retrosynthesis. We evaluated TempRe across single-step and multi-step retrosynthesis tasks, demonstrating its superiority over both template classification and SMILES-based generation methods. On the PaRoutes multi-step benchmark, TempRe achieves strong top-k route accuracy. Furthermore, we extend TempRe to direct multi-step synthesis route generation, providing a lightweight and efficient alternative to conventional single-step and search-based approaches. These results highlight the potential of template generative modeling as a powerful paradigm in computer-aided synthesis planning.
[391] Unlocking Interpretability for RF Sensing: A Complex-Valued White-Box Transformer
Xie Zhang, Yina Wang, Chenshu Wu
Main category: cs.LG
TL;DR: RF-CRATE is the first mathematically interpretable deep network for RF sensing, extending white-box transformers to the complex domain with improved performance and interpretability.
Details
Motivation: Existing DWS models lack interpretability, limiting generalizability and raising security concerns in RF applications.Method: Extends white-box transformers to the complex domain using CR-Calculus, introduces Subspace Regularization for feature diversity, and evaluates on multiple RF datasets.
Result: RF-CRATE matches black-box models’ performance, improves classification by 5.08%, and reduces regression error by 10.34%.
Conclusion: RF-CRATE offers interpretability and superior performance, advancing RF sensing with open-source availability.
Abstract: The empirical success of deep learning has spurred its application to the radio-frequency (RF) domain, leading to significant advances in Deep Wireless Sensing (DWS). However, most existing DWS models function as black boxes with limited interpretability, which hampers their generalizability and raises concerns in security-sensitive physical applications. In this work, inspired by the remarkable advances of white-box transformers, we present RF-CRATE, the first mathematically interpretable deep network architecture for RF sensing, grounded in the principles of complex sparse rate reduction. To accommodate the unique RF signals, we conduct non-trivial theoretical derivations that extend the original real-valued white-box transformer to the complex domain. By leveraging the CR-Calculus framework, we successfully construct a fully complex-valued white-box transformer with theoretically derived self-attention and residual multi-layer perceptron modules. Furthermore, to improve the model’s ability to extract discriminative features from limited wireless data, we introduce Subspace Regularization, a novel regularization strategy that enhances feature diversity, resulting in an average performance improvement of 19.98% across multiple sensing tasks. We extensively evaluate RF-CRATE against seven baselines with multiple public and self-collected datasets involving different RF signals. The results show that RF-CRATE achieves performance on par with thoroughly engineered black-box models, while offering full mathematical interpretability. More importantly, by extending CRATE to the complex domain, RF-CRATE yields substantial improvements, achieving an average classification gain of 5.08% and reducing regression error by 10.34% across diverse sensing tasks compared to CRATE. RF-CRATE is fully open-sourced at: https://github.com/rfcrate/RF_CRATE.
[392] Bayesian Neural Network Surrogates for Bayesian Optimization of Carbon Capture and Storage Operations
Sofianos Panagiotis Fotias, Vassilis Gaganis
Main category: cs.LG
TL;DR: The paper explores Bayesian Optimization (BO) for optimizing Carbon Capture and Storage (CCS) projects, comparing novel stochastic models to Gaussian Processes (GPs) to improve decision-making in complex scenarios.
Details
Motivation: To enhance the economic and sustainable deployment of CCS technologies by addressing limitations of traditional BO methods like GPs in complex environments.Method: Uses derivative-free Bayesian Optimization with various stochastic models, including novel ones, to optimize CCS project variables, focusing on Net Present Value (NPV) as a key objective.
Result: Demonstrates the potential of alternative stochastic models in BO to outperform GPs in scenarios with many decision variables or scaled objectives, improving CCS project viability.
Conclusion: The study pioneers the application of advanced BO techniques in reservoir engineering, showcasing its promise for sustainable energy solutions.
Abstract: Carbon Capture and Storage (CCS) stands as a pivotal technology for fostering a sustainable future. The process, which involves injecting supercritical CO$_2$ into underground formations, a method already widely used for Enhanced Oil Recovery, serves a dual purpose: it not only curbs CO$_2$ emissions and addresses climate change but also extends the operational lifespan and sustainability of oil fields and platforms, easing the shift toward greener practices. This paper delivers a thorough comparative evaluation of strategies for optimizing decision variables in CCS project development, employing a derivative-free technique known as Bayesian Optimization. In addition to Gaussian Processes, which usually serve as the gold standard in BO, various novel stochastic models were examined and compared within a BO framework. This research investigates the effectiveness of utilizing more exotic stochastic models than GPs for BO in environments where GPs have been shown to underperform, such as in cases with a large number of decision variables or multiple objective functions that are not similarly scaled. By incorporating Net Present Value (NPV) as a key objective function, the proposed framework demonstrates its potential to improve economic viability while ensuring the sustainable deployment of CCS technologies. Ultimately, this study represents the first application in the reservoir engineering industry of the growing body of BO research, specifically in the search for more appropriate stochastic models, highlighting its potential as a preferred method for enhancing sustainability in the energy sector.
[393] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang
Main category: cs.LG
TL;DR: R-Stitch is a token-level, confidence-based hybrid decoding framework that accelerates Chain-of-Thought (CoT) reasoning by dynamically switching between small and large language models, reducing inference latency by up to 85% with minimal accuracy loss.
Details
Motivation: CoT reasoning improves problem-solving in large language models but introduces computational overhead. Existing acceleration methods like speculative decoding have limitations in speedup and fail to leverage small models' potential for concise reasoning.Method: R-Stitch uses a small language model (SLM) by default and switches to a large language model (LLM) only when the SLM’s confidence is low, avoiding full-sequence rollback and selectively invoking the LLM for uncertain steps.
Result: Experiments show R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop on math reasoning benchmarks.
Conclusion: R-Stitch is a practical, model-agnostic, and training-free solution for efficient CoT reasoning, balancing speed and accuracy.
Abstract: Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.
[394] Analysis of Fourier Neural Operators via Effective Field Theory
Taeyoung Kim
Main category: cs.LG
TL;DR: The paper analyzes Fourier Neural Operators (FNOs) using effective-field-theory to explain their stability, generalization, and frequency behavior, revealing how nonlinear activations and architecture choices impact performance.
Details
Motivation: FNOs are widely used for solving high-dimensional PDEs, but their underlying mechanisms, such as stability and frequency behavior, lack theoretical understanding.Method: The study employs effective-field-theory in infinite-dimensional function space, deriving recursion relations for layer kernels and vertices, and examines analytic activations, scale-invariant cases, and residual connections.
Result: Nonlinear activations couple frequency inputs to high-frequency modes, and criticality conditions for weight initialization are derived. Experiments confirm frequency transfer and validate predictions.
Conclusion: The work explains how nonlinearity aids feature learning in FNOs, provides hyper-parameter selection criteria, and highlights the benefits of scale-invariant activations and residual connections.
Abstract: Fourier Neural Operators (FNOs) have emerged as leading surrogates for high-dimensional partial-differential equations, yet their stability, generalization and frequency behavior lack a principled explanation. We present the first systematic effective-field-theory analysis of FNOs in an infinite-dimensional function space, deriving closed recursion relations for the layer kernel and four-point vertex and then examining three practically important settings-analytic activations, scale-invariant cases and architectures with residual connections. The theory shows that nonlinear activations inevitably couple frequency inputs to high-frequency modes that are otherwise discarded by spectral truncation, and experiments confirm this frequency transfer. For wide networks we obtain explicit criticality conditions on the weight-initialization ensemble that keep small input perturbations to have uniform scale across depth, and empirical tests validate these predictions. Taken together, our results quantify how nonlinearity enables neural operators to capture non-trivial features, supply criteria for hyper-parameter selection via criticality analysis, and explain why scale-invariant activations and residual connections enhance feature learning in FNOs.
[395] Discovering Interpretable Ordinary Differential Equations from Noisy Data
Rahul Golder, M. M. Faruque Hasan
Main category: cs.LG
TL;DR: Proposes an unsupervised method for discovering interpretable ODE models from noisy data using spline transformations and gradient matrices.
Details
Motivation: Existing methods lack physical interpretability and accuracy in modeling system dynamics.Method: Uses an approximate general solution and spline transformation to estimate ODE coefficients via a gradient matrix.
Result: Achieves high accuracy and sparsity in ODE discovery without regularization, even with noisy data.
Conclusion: The method is robust and suitable for real-world experimental data-driven learning of physical phenomena.
Abstract: The data-driven discovery of interpretable models approximating the underlying dynamics of a physical system has gained attraction in the past decade. Current approaches employ pre-specified functional forms or basis functions and often result in models that lack physical meaning and interpretability, let alone represent the true physics of the system. We propose an unsupervised parameter estimation methodology that first finds an approximate general solution, followed by a spline transformation to linearly estimate the coefficients of the governing ordinary differential equation (ODE). The approximate general solution is postulated using the same functional form as the analytical solution of a general homogeneous, linear, constant-coefficient ODE. An added advantage is its ability to produce a high-fidelity, smooth functional form even in the presence of noisy data. The spline approximation obtains gradient information from the functional form which are linearly independent and creates the basis of the gradient matrix. This gradient matrix is used in a linear system to find the coefficients of the ODEs. From the case studies, we observed that our modeling approach discovers ODEs with high accuracy and also promotes sparsity in the solution without using any regularization techniques. The methodology is also robust to noisy data and thus allows the integration of data-driven techniques into real experimental setting for data-driven learning of physical phenomena.
[396] Cardiovascular Disease Prediction using Machine Learning: A Comparative Analysis
Risshab Srinivas Ramesh, Roshani T S Udupa, Monisha J, Kushi K K S
Main category: cs.LG
TL;DR: The study analyzes a CVD dataset to identify key risk factors like age, hypertension, and cholesterol, using statistical tests and logistic regression. CatBoost outperforms other models in accuracy and probabilistic prediction, but data issues suggest better preprocessing is needed.
Details
Motivation: Cardiovascular diseases (CVDs) cause 31% of global deaths, prompting an investigation into numerical and categorical risk factors to improve understanding and prediction.Method: Statistical analyses (t-tests, Chi-square, ANOVA) and logistic regression were applied to a dataset of 68,119 records to identify CVD risk factors. Model performance was compared, with CatBoost evaluated for accuracy and probabilistic prediction.
Result: Key risk factors include age, hypertension, and cholesterol. CatBoost achieved the highest accuracy (0.734) and best probabilistic prediction (Brier score = 0.1824). Data issues like outliers were noted.
Conclusion: Age, blood pressure, and cholesterol are primary CVD risk factors. CatBoost is the best-performing model, but data preprocessing improvements are needed for better reliability.
Abstract: Cardiovascular diseases (CVDs) are a main cause of mortality globally, accounting for 31% of all deaths. This study involves a cardiovascular disease (CVD) dataset comprising 68,119 records to explore the influence of numerical (age, height, weight, blood pressure, BMI) and categorical gender, cholesterol, glucose, smoking, alcohol, activity) factors on CVD occurrence. We have performed statistical analyses, including t-tests, Chi-square tests, and ANOVA, to identify strong associations between CVD and elderly people, hypertension, higher weight, and abnormal cholesterol levels, while physical activity (a protective factor). A logistic regression model highlights age, blood pressure, and cholesterol as primary risk factors, with unexpected negative associations for smoking and alcohol, suggesting potential data issues. Model performance comparisons reveal CatBoost as the top performer with an accuracy of 0.734 and an ECE of 0.0064 and excels in probabilistic prediction (Brier score = 0.1824). Data challenges, including outliers and skewed distributions, indicate a need for improved preprocessing to enhance predictive reliability.
[397] Multi-state Protein Design with DynamicMPNN
Alex Abrudan, Sebastian Pujalte Ojeda, Chaitanya K. Joshi, Matthew Greenig, Felipe Engelberger, Alena Khmelinskaia, Jens Meiler, Michele Vendruscolo, Tuomas P. J. Knowles
Main category: cs.LG
TL;DR: DynamicMPNN, a new inverse folding model, outperforms ProteinMPNN by 13% in multi-state protein design by jointly learning across conformational ensembles.
Details
Motivation: Existing multi-state design methods rely on aggregated single-state predictions, leading to poor experimental success rates.Method: DynamicMPNN is trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using AlphaFold.
Result: DynamicMPNN achieves up to 13% better structure-normalized RMSD than ProteinMPNN on a multi-state benchmark.
Conclusion: DynamicMPNN advances multi-state protein design by directly addressing conformational diversity.
Abstract: Structural biology has long been dominated by the one sequence, one structure, one function paradigm, yet many critical biological processes - from enzyme catalysis to membrane transport - depend on proteins that adopt multiple conformational states. Existing multi-state design approaches rely on post-hoc aggregation of single-state predictions, achieving poor experimental success rates compared to single-state design. We introduce DynamicMPNN, an inverse folding model explicitly trained to generate sequences compatible with multiple conformations through joint learning across conformational ensembles. Trained on 46,033 conformational pairs covering 75% of CATH superfamilies and evaluated using AlphaFold initial guess, DynamicMPNN outperforms ProteinMPNN by up to 13% on structure-normalized RMSD across our challenging multi-state protein benchmark.
[398] SLA-Centric Automated Algorithm Selection Framework for Cloud Environments
Siana Rizwan, Tasnim Ahmed, Salimur Choudhury
Main category: cs.LG
TL;DR: An SLA-aware automated algorithm-selection framework for combinatorial optimization in cloud environments, using ML to predict performance and rank algorithms. Applied to the 0-1 knapsack problem with empirical validation.
Details
Motivation: SLA violations in cloud computing impact efficiency and profitability, necessitating automated solutions for optimal algorithm selection under constraints.Method: Proposes an ML-based framework to predict and rank algorithm-hardware pairs, validated on the 0-1 knapsack problem with a curated dataset.
Result: Evaluated on classification and regression tasks, with insights from ablation studies on hyperparameters, learning approaches, and interpretability.
Conclusion: The framework effectively addresses SLA-aware optimization, demonstrating practical utility and interpretability in cloud resource management.
Abstract: Cloud computing offers on-demand resource access, regulated by Service-Level Agreements (SLAs) between consumers and Cloud Service Providers (CSPs). SLA violations can impact efficiency and CSP profitability. In this work, we propose an SLA-aware automated algorithm-selection framework for combinatorial optimization problems in resource-constrained cloud environments. The framework uses an ensemble of machine learning models to predict performance and rank algorithm-hardware pairs based on SLA constraints. We also apply our framework to the 0-1 knapsack problem. We curate a dataset comprising instance specific features along with memory usage, runtime, and optimality gap for 6 algorithms. As an empirical benchmark, we evaluate the framework on both classification and regression tasks. Our ablation study explores the impact of hyperparameters, learning approaches, and large language models effectiveness in regression, and SHAP-based interpretability.
[399] Improving Generative Ad Text on Facebook using Reinforcement Learning
Daniel R. Jiang, Alex Nikulkov, Yu-Chia Chen, Yang Bai, Zheqing Zhu
Main category: cs.LG
TL;DR: The paper explores the economic impact of RL post-training for LLMs, introducing RLPF and demonstrating its success in improving ad performance on Facebook.
Details
Motivation: To quantify the economic impact of RL post-training for LLMs and bridge the gap between general language models and real-world applications.Method: Developed AdLlama, an RL-trained LLM using RLPF (reinforcement learning with performance feedback), and tested it in a large-scale A/B experiment on Facebook.
Result: AdLlama improved click-through rates by 6.7% and increased advertiser satisfaction, demonstrating significant ROI improvement.
Conclusion: RLPF is a promising, generalizable method for metric-driven post-training, showcasing tangible benefits of RL in real-world generative AI applications.
Abstract: Generative artificial intelligence (AI), in particular large language models (LLMs), is poised to drive transformative economic change. LLMs are pre-trained on vast text data to learn general language patterns, but a subsequent post-training phase is critical to align them for specific real-world tasks. Reinforcement learning (RL) is the leading post-training technique, yet its economic impact remains largely underexplored and unquantified. We examine this question through the lens of the first deployment of an RL-trained LLM for generative advertising on Facebook. Integrated into Meta’s Text Generation feature, our model, “AdLlama,” powers an AI tool that helps advertisers create new variations of human-written ad text. To train this model, we introduce reinforcement learning with performance feedback (RLPF), a post-training method that uses historical ad performance data as a reward signal. In a large-scale 10-week A/B test on Facebook spanning nearly 35,000 advertisers and 640,000 ad variations, we find that AdLlama improves click-through rates by 6.7% (p=0.0296) compared to a supervised imitation model trained on curated ads. This represents a substantial improvement in advertiser return on investment on Facebook. We also find that advertisers who used AdLlama generated more ad variations, indicating higher satisfaction with the model’s outputs. To our knowledge, this is the largest study to date on the use of generative AI in an ecologically valid setting, offering an important data point quantifying the tangible impact of RL post-training. Furthermore, the results show that RLPF is a promising and generalizable approach for metric-driven post-training that bridges the gap between highly capable language models and tangible outcomes.
[400] Teach Me to Trick: Exploring Adversarial Transferability via Knowledge Distillation
Siddhartha Pradhan, Shikshya Shiwakoti, Neha Bathuri
Main category: cs.LG
TL;DR: Knowledge distillation from multiple heterogeneous teachers improves adversarial example generation, matching ensemble-based baselines while being six times faster.
Details
Motivation: To explore if knowledge distillation (KD) from multiple teachers can enhance transferable adversarial example generation efficiently.Method: Train a lightweight student model using curriculum-based switching and joint optimization KD strategies with ResNet50 and DenseNet-161 as teachers. Generate adversarial examples using FG, FGS, and PGD attacks, evaluated against GoogLeNet.
Result: Student models achieve attack success rates comparable to ensemble baselines, with six times faster generation. Lower temperature and hard-label supervision boost transferability.
Conclusion: KD is not just for model compression but also improves black-box adversarial attack efficiency and effectiveness.
Abstract: We investigate whether knowledge distillation (KD) from multiple heterogeneous teacher models can enhance the generation of transferable adversarial examples. A lightweight student model is trained using two KD strategies: curriculum-based switching and joint optimization, with ResNet50 and DenseNet-161 as teachers. The trained student is then used to generate adversarial examples using FG, FGS, and PGD attacks, which are evaluated against a black-box target model (GoogLeNet). Our results show that student models distilled from multiple teachers achieve attack success rates comparable to ensemble-based baselines, while reducing adversarial example generation time by up to a factor of six. An ablation study further reveals that lower temperature settings and the inclusion of hard-label supervision significantly enhance transferability. These findings suggest that KD can serve not only as a model compression technique but also as a powerful tool for improving the efficiency and effectiveness of black-box adversarial attacks.
[401] Classification of Honey Botanical and Geographical Sources using Mineral Profiles and Machine Learning
Mokhtar Al-Awadhi, Ratnadeep Deshmukh
Main category: cs.LG
TL;DR: A machine learning approach using mineral element profiles to classify honey’s floral and geographical sources, with Random Forests achieving high accuracy.
Details
Motivation: To identify honey's botanical and geographical origins using mineral element profiles for authenticity and traceability.Method: Two-step process: preprocessing (missing-value treatment, normalization) and classification (supervised models, tested on a public dataset).
Result: Mineral elements effectively classify honey origins; Random Forests achieved 99.30% (botanical) and 98.01% (geographical) accuracy.
Conclusion: Mineral element profiles are reliable for honey origin classification, with Random Forests as the top-performing model.
Abstract: This paper proposes a machine learning-based approach for identifying honey floral and geographical sources using mineral element profiles. The proposed method comprises two steps: preprocessing and classification. The preprocessing phase involves missing-value treatment and data normalization. In the classification phase, we employ various supervised classification models for discriminating between six botanical sources and 13 geographical origins of honey. We test the classifiers’ performance on a publicly available honey mineral element dataset. The dataset contains mineral element profiles of honeys from various floral and geographical origins. Results show that mineral element content in honey provides discriminative information useful for classifying honey botanical and geographical sources. Results also show that the Random Forests (RF) classifier obtains the best performance on this dataset, achieving a cross-validation accuracy of 99.30% for classifying honey botanical origins and 98.01% for classifying honey geographical origins.
[402] Structure-Informed Deep Reinforcement Learning for Inventory Management
Alvaro Maggiar, Sohrab Andaz, Akhil Bagaria, Carson Eisenach, Dean Foster, Omer Gottesman, Dominique Perrault-Joncas
Main category: cs.LG
TL;DR: The paper explores using Deep Reinforcement Learning (DRL) for inventory management, showing it outperforms benchmarks and heuristics while requiring minimal tuning. It introduces a Structure-Informed Policy Network for better interpretability and robustness.
Details
Motivation: To bridge the gap between data-driven learning and analytical insights in inventory management, focusing on practical implementation without unrealistic assumptions.Method: Applies a DRL algorithm (DirectBackprop) to various inventory scenarios, using historical data. Introduces a Structure-Informed Policy Network to incorporate analytical insights.
Result: DRL performs competitively or better than benchmarks, captures optimal policy structures, and improves interpretability and robustness.
Conclusion: DRL effectively combines data-driven and analytical approaches in inventory management, offering practical and interpretable solutions.
Abstract: This paper investigates the application of Deep Reinforcement Learning (DRL) to classical inventory management problems, with a focus on practical implementation considerations. We apply a DRL algorithm based on DirectBackprop to several fundamental inventory management scenarios including multi-period systems with lost sales (with and without lead times), perishable inventory management, dual sourcing, and joint inventory procurement and removal. The DRL approach learns policies across products using only historical information that would be available in practice, avoiding unrealistic assumptions about demand distributions or access to distribution parameters. We demonstrate that our generic DRL implementation performs competitively against or outperforms established benchmarks and heuristics across these diverse settings, while requiring minimal parameter tuning. Through examination of the learned policies, we show that the DRL approach naturally captures many known structural properties of optimal policies derived from traditional operations research methods. To further improve policy performance and interpretability, we propose a Structure-Informed Policy Network technique that explicitly incorporates analytically-derived characteristics of optimal policies into the learning process. This approach can help interpretability and add robustness to the policy in out-of-sample performance, as we demonstrate in an example with realistic demand data. Finally, we provide an illustrative application of DRL in a non-stationary setting. Our work bridges the gap between data-driven learning and analytical insights in inventory management while maintaining practical applicability.
[403] Weight-Parameterization in Continuous Time Deep Neural Networks for Surrogate Modeling
Haley Rosso, Lars Ruthotto, Khachik Sargsyan
Main category: cs.LG
TL;DR: The paper explores weight parameterization strategies for neural ODEs and ResNets using polynomial bases, finding Legendre polynomials improve stability and efficiency.
Details
Motivation: To address the challenge of learning expressive yet stable time-varying weights in continuous-time deep learning models under computational constraints.Method: Evaluates monomial and Legendre polynomial bases in neural ODE and ResNet architectures under two training paradigms: discretize-then-optimize and optimize-then-discretize.
Result: Legendre parameterizations yield more stable training, lower computational cost, and comparable or better accuracy than monomial or unconstrained models.
Conclusion: Orthogonal polynomial bases, like Legendre, offer a favorable balance between expressivity and training efficiency in time-dependent weight parameterization.
Abstract: Continuous-time deep learning models, such as neural ordinary differential equations (ODEs), offer a promising framework for surrogate modeling of complex physical systems. A central challenge in training these models lies in learning expressive yet stable time-varying weights, particularly under computational constraints. This work investigates weight parameterization strategies that constrain the temporal evolution of weights to a low-dimensional subspace spanned by polynomial basis functions. We evaluate both monomial and Legendre polynomial bases within neural ODE and residual network (ResNet) architectures under discretize-then-optimize and optimize-then-discretize training paradigms. Experimental results across three high-dimensional benchmark problems show that Legendre parameterizations yield more stable training dynamics, reduce computational cost, and achieve accuracy comparable to or better than both monomial parameterizations and unconstrained weight models. These findings elucidate the role of basis choice in time-dependent weight parameterization and demonstrate that using orthogonal polynomial bases offers a favorable tradeoff between model expressivity and training efficiency.
[404] Foundation Models for Demand Forecasting via Dual-Strategy Ensembling
Wei Yang, Defu Cao, Yan Liu
Main category: cs.LG
TL;DR: A unified ensemble framework improves sales forecasting by combining Hierarchical and Architectural Ensembles, outperforming baselines in accuracy and generalization.
Details
Motivation: Demand forecasting is challenging due to hierarchical complexity and evolving external factors. Foundation models lack robustness under distributional changes.Method: Proposes Hierarchical Ensemble (HE) for localized patterns and Architectural Ensemble (AE) for diverse model integration.
Result: Outperforms baselines on M5 benchmark and external datasets, improving accuracy and generalization.
Conclusion: The framework effectively boosts forecasting performance in complex supply chain environments.
Abstract: Accurate demand forecasting is critical for supply chain optimization, yet remains difficult in practice due to hierarchical complexity, domain shifts, and evolving external factors. While recent foundation models offer strong potential for time series forecasting, they often suffer from architectural rigidity and limited robustness under distributional change. In this paper, we propose a unified ensemble framework that enhances the performance of foundation models for sales forecasting in real-world supply chains. Our method combines two complementary strategies: (1) Hierarchical Ensemble (HE), which partitions training and inference by semantic levels (e.g., store, category, department) to capture localized patterns; and (2) Architectural Ensemble (AE), which integrates predictions from diverse model backbones to mitigate bias and improve stability. We conduct extensive experiments on the M5 benchmark and three external sales datasets, covering both in-domain and zero-shot forecasting. Results show that our approach consistently outperforms strong baselines, improves accuracy across hierarchical levels, and provides a simple yet effective mechanism for boosting generalization in complex forecasting environments.
[405] Online hierarchical partitioning of the output space in extreme multi-label data stream
Lara Neves, Afonso Lourenço, Alberto Cano, Goreti Marreiros
Main category: cs.LG
TL;DR: iHOMER is an online multi-label learning framework that dynamically clusters labels and adapts to concept drift, outperforming state-of-the-art methods.
Details
Motivation: Address challenges in multi-label data streams like evolving distributions, high-dimensional label spaces, and concept drift.Method: Uses incremental divisive-agglomerative clustering and a global tree-based learner with drift detection.
Result: Outperforms 5 global baselines by 23% and 12 local baselines by 32%.
Conclusion: iHOMER is robust for online multi-label classification.
Abstract: Mining data streams with multi-label outputs poses significant challenges due to evolving distributions, high-dimensional label spaces, sparse label occurrences, and complex label dependencies. Moreover, concept drift affects not only input distributions but also label correlations and imbalance ratios over time, complicating model adaptation. To address these challenges, structured learners are categorized into local and global methods. Local methods break down the task into simpler components, while global methods adapt the algorithm to the full output space, potentially yielding better predictions by exploiting label correlations. This work introduces iHOMER (Incremental Hierarchy Of Multi-label Classifiers), an online multi-label learning framework that incrementally partitions the label space into disjoint, correlated clusters without relying on predefined hierarchies. iHOMER leverages online divisive-agglomerative clustering based on \textit{Jaccard} similarity and a global tree-based learner driven by a multivariate \textit{Bernoulli} process to guide instance partitioning. To address non-stationarity, it integrates drift detection mechanisms at both global and local levels, enabling dynamic restructuring of label partitions and subtrees. Experiments across 23 real-world datasets show iHOMER outperforms 5 state-of-the-art global baselines, such as MLHAT, MLHT of Pruned Sets and iSOUPT, by 23%, and 12 local baselines, such as binary relevance transformations of kNN, EFDT, ARF, and ADWIN bagging/boosting ensembles, by 32%, establishing its robustness for online multi-label classification.
[406] Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper
Main category: cs.LG
TL;DR: Targeted latent adversarial training (LAT) improves LLM robustness against harmful behaviors like jailbreaking, backdoors, and undesirable knowledge retention, outperforming existing methods with less compute.
Details
Motivation: LLMs often exhibit undesirable behaviors despite fine-tuning, as adversarial fine-tuning suppresses but doesn't remove such capabilities. Targeted LAT aims to address specific failure modes more effectively.Method: Targeted LAT involves adversaries minimizing loss on specific competing tasks, enhancing robustness. It is applied to jailbreak resistance, backdoor removal, and knowledge unlearning.
Result: Targeted LAT outperforms R2D2 with less compute, removes backdoors without trigger knowledge, and robustly unlearns undesirable tasks.
Conclusion: Targeted LAT is an effective tool for defending against harmful LLM behaviors, offering broad applicability and efficiency.
Abstract: Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of ‘jailbreaking’ techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.
[407] SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang, Ligong Han, Kai Xu, Akash Srivastava
Main category: cs.LG
TL;DR: SQuat introduces a subspace-orthogonal KV cache quantization method to reduce memory usage and improve throughput in LLMs without fine-tuning or additional datasets.
Details
Motivation: Existing KV cache quantization methods accumulate errors over time, leading to undesired outputs. SQuat aims to minimize these errors while maintaining efficiency.Method: SQuat constructs a subspace from query tensors to capture critical information and enforces orthogonality during key tensor quantization to reduce error impact.
Result: SQuat reduces peak memory by 2.17-2.82x, improves throughput by 2.45-3.60x, and outperforms existing KV cache quantization benchmarks.
Conclusion: SQuat is a theoretically grounded, efficient solution for KV cache quantization, offering significant memory and performance benefits without requiring additional resources.
Abstract: The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism’s outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
[408] Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling
Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu
Main category: cs.LG
TL;DR: SWIFT is a lightweight technique using LLM hidden states for efficient performance enhancement, outperforming baselines with minimal parameters and training samples.
Details
Motivation: The computational cost of text-based reward models in best-of-N sampling for LLMs is prohibitive, prompting the need for a more efficient method.Method: SWIFT leverages token-level hidden states of LLMs, using only linear layers, to provide intrinsic feedback, reducing computational demands.
Result: SWIFT outperforms baselines with less than 0.005% of their parameters, requires few training samples, and shows robust scalability and compatibility.
Conclusion: SWIFT offers a practical, efficient solution for enhancing LLM performance, with potential for integration with traditional reward models.
Abstract: Enhancing Large Language Model (LLM)’s performance with best-of-N sampling is effective and has attracted significant attention. However, it is computationally prohibitive due to massive, data-hungry text-based reward models. By changing the data source from text to hidden states, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel, lightweight technique that leverages the rich information embedded in LLM hidden states to address these issues, which operates on token-level and consists of only linear layers. Extensive experiments show that SWIFT outperforms baselines with less than 0.005% of the parameters of baselines, requiring only a few samples for training, demonstrating significant efficiency improvement. SWIFT’s robust scalability, applicability to some closed-source models via logits, and ability to be combined with traditional reward models to yield further performance gains underscore its practical value.
[409] Hierarchical mixtures of Gaussians for combined dimensionality reduction and clustering
Sacha Sokoloski, Philipp Berens
Main category: cs.LG
TL;DR: HMoGs unify dimensionality reduction and clustering into a single probabilistic model, offering efficiency and interpretability for high-dimensional data.
Details
Motivation: To bridge classical statistical modeling with modern data scale, preserving rigor and interpretability often missing in other methods.Method: Hierarchical mixtures of Gaussians with closed-form likelihood, exact inference, and maximum-likelihood optimization.
Result: Efficient modeling of high-dimensional data, improved performance in synthetic and MNIST experiments, and enhanced interpretability.
Conclusion: HMoGs provide a practical, rigorous approach to high-dimensional clustering, outperforming embedding-based and variational methods.
Abstract: We introduce hierarchical mixtures of Gaussians (HMoGs), which unify dimensionality reduction and clustering into a single probabilistic model. HMoGs provide closed-form expressions for the model likelihood, exact inference over latent states and cluster membership, and exact algorithms for maximum-likelihood optimization. The novel exponential family parameterization of HMoGs greatly reduces their computational complexity relative to similar model-based methods, allowing them to efficiently model hundreds of latent dimensions, and thereby capture additional structure in high-dimensional data. We demonstrate HMoGs on synthetic experiments and MNIST, and show how joint optimization of dimensionality reduction and clustering facilitates increased model performance. We also explore how sparsity-constrained dimensionality reduction can further improve clustering performance while encouraging interpretability. By bridging classical statistical modelling with the scale of modern data and compute, HMoGs offer a practical approach to high-dimensional clustering that preserves statistical rigour, interpretability, and uncertainty quantification that is often missing from embedding-based, variational, and self-supervised methods.
[410] Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees
Jihao Xin, Marco Canini, Peter Richtárik, Samuel Horváth
Main category: cs.LG
TL;DR: Global-QSGD is an Allreduce-compatible gradient quantization method that reduces communication overhead in distributed deep learning while maintaining accuracy, backed by theoretical guarantees and practical performance improvements.
Details
Motivation: The high communication overhead in distributed training, especially with growing models and datasets, motivates the need for efficient gradient compression methods like quantization. Existing methods often lack compatibility with Allreduce or theoretical guarantees.Method: Global-QSGD introduces global norm scaling for gradient quantization, ensuring compatibility with Allreduce. It includes rigorous theoretical analysis and a performance model for evaluation.
Result: Global-QSGD reduces communication overhead and accelerates distributed training by up to 3.51% over baseline quantization methods in various hardware environments.
Conclusion: Global-QSGD is a practical and efficient solution for large-scale deep learning, offering theoretical guarantees and significant performance improvements.
Abstract: Distributed training enables large-scale deep learning, but suffers from high communication overhead, especially as models and datasets grow. Gradient compression, particularly quantization, is a promising approach to mitigate this bottleneck. However, existing quantization schemes are often incompatible with Allreduce, the dominant communication primitive in distributed deep learning, and many prior solutions rely on heuristics without theoretical guarantees. We introduce Global-QSGD, an Allreduce-compatible gradient quantization method that leverages global norm scaling to reduce communication overhead while preserving accuracy. Global-QSGD is backed by rigorous theoretical analysis, extending standard unbiased compressor frameworks to establish formal convergence guarantees. Additionally, we develop a performance model to evaluate its impact across different hardware configurations. Extensive experiments on NVLink, PCIe, and large-scale cloud environments show that Global-QSGD accelerates distributed training by up to 3.51% over baseline quantization methods, making it a practical and efficient solution for large-scale deep learning workloads.
[411] Long-Term Fairness Inquiries and Pursuits in Machine Learning: A Survey of Notions, Methods, and Challenges
Usman Gohar, Zeyu Tang, Jialu Wang, Kun Zhang, Peter L. Spirtes, Yang Liu, Lu Cheng
Main category: cs.LG
TL;DR: Survey on long-term fairness in ML, addressing challenges beyond static measures due to feedback loops and model-environment interactions.
Details
Motivation: Concerns about fairness in ML systems, especially in high-stakes domains, and limitations of static fairness measures in achieving long-term fairness.Method: Review of existing literature on long-term fairness, presenting a taxonomy and analyzing challenges.
Result: Identified key challenges and gaps in achieving long-term fairness, with insights into feedback loops and model-environment interactions.
Conclusion: Highlights the need for further research to address long-term fairness, considering dynamic and interactive aspects of ML systems.
Abstract: The widespread integration of Machine Learning systems in daily life, particularly in high-stakes domains, has raised concerns about the fairness implications. While prior works have investigated static fairness measures, recent studies reveal that automated decision-making has long-term implications and that off-the-shelf fairness approaches may not serve the purpose of achieving long-term fairness. Additionally, the existence of feedback loops and the interaction between models and the environment introduces additional complexities that may deviate from the initial fairness goals. In this survey, we review existing literature on long-term fairness from different perspectives and present a taxonomy for long-term fairness studies. We highlight key challenges and consider future research directions, analyzing both current issues and potential further explorations.
[412] MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data
Yaobin Ling, Xiaoqian Jiang, Yejin Kim
Main category: cs.LG
TL;DR: A novel framework using LLMs and GAN architecture to generate high-quality synthetic tabular data, especially effective for small datasets while preserving privacy.
Details
Motivation: Addressing data scarcity and privacy concerns in healthcare and other domains by enabling synthetic data generation without requiring large training datasets.Method: Proposes a framework combining LLMs with GAN architecture, using contextual data generation and LLM optimization to enhance synthetic data quality for small sample sizes.
Result: Outperforms state-of-the-art models in generating high-quality synthetic data for downstream tasks, as validated on public and private datasets.
Conclusion: The framework effectively tackles data scarcity and privacy issues, offering a viable solution for synthetic data generation in resource-limited settings.
Abstract: In the era of big data, access to abundant data is crucial for driving research forward. However, such data is often inaccessible due to privacy concerns or high costs, particularly in healthcare domain. Generating synthetic (tabular) data can address this, but existing models typically require substantial amounts of data to train effectively, contradicting our objective to solve data scarcity. To address this challenge, we propose a novel framework to generate synthetic tabular data, powered by large language models (LLMs) that emulates the architecture of a Generative Adversarial Network (GAN). By incorporating data generation process as contextual information and utilizing LLM as the optimizer, our approach significantly enhance the quality of synthetic data generation in common scenarios with small sample sizes. Our experimental results on public and private datasets demonstrate that our model outperforms several state-of-art models regarding generating higher quality synthetic data for downstream tasks while keeping privacy of the real data.
[413] Persistent Backdoor Attacks in Continual Learning
Zhen Guo, Abhinav Kumar, Reza Tourani
Main category: cs.LG
TL;DR: The paper introduces two persistent backdoor attacks for neural networks in continual learning, demonstrating their effectiveness and evasion of defenses.
Details
Motivation: Backdoor attacks in continual learning are understudied, particularly their persistence and practicality as models update over time.Method: Two attacks are proposed: Blind Task Backdoor (alters loss computation) and Latent Task Backdoor (influences one task’s training). Evaluated with various triggers.
Result: Both attacks achieve high success rates across continual learning algorithms and evade defenses like SentiNet and I-BAU.
Conclusion: The attacks highlight vulnerabilities in continual learning systems, emphasizing the need for robust defenses against persistent backdoors.
Abstract: Backdoor attacks pose a significant threat to neural networks, enabling adversaries to manipulate model outputs on specific inputs, often with devastating consequences, especially in critical applications. While backdoor attacks have been studied in various contexts, little attention has been given to their practicality and persistence in continual learning, particularly in understanding how the continual updates to model parameters, as new data distributions are learned and integrated, impact the effectiveness of these attacks over time. To address this gap, we introduce two persistent backdoor attacks-Blind Task Backdoor and Latent Task Backdoor-each leveraging minimal adversarial influence. Our blind task backdoor subtly alters the loss computation without direct control over the training process, while the latent task backdoor influences only a single task’s training, with all other tasks trained benignly. We evaluate these attacks under various configurations, demonstrating their efficacy with static, dynamic, physical, and semantic triggers. Our results show that both attacks consistently achieve high success rates across different continual learning algorithms, while effectively evading state-of-the-art defenses, such as SentiNet and I-BAU.
[414] Recovering Manifold Structure Using Ollivier-Ricci Curvature
Tristan Luca Saidi, Abigail Hickok, Andrew J. Blumberg
Main category: cs.LG
TL;DR: ORC-ManL is a new algorithm for pruning spurious edges in nearest-neighbor graphs using Ollivier-Ricci curvature and metric distortion, improving downstream tasks like manifold learning and clustering.
Details
Motivation: The algorithm addresses the challenge of noisy samples from low-dimensional manifolds, where spurious edges shortcut through ambient space, disrupting accurate geometric analysis.Method: ORC-ManL prunes edges based on Ollivier-Ricci curvature and metric distortion, targeting edges that deviate from the data manifold.
Result: The method outperforms other pruning techniques and enhances performance in manifold learning, persistent homology, dimension estimation, and single-cell RNA sequencing analysis.
Conclusion: ORC-ManL effectively improves geometric data analysis tasks and supports theoretical findings with empirical evidence.
Abstract: We introduce ORC-ManL, a new algorithm to prune spurious edges from nearest neighbor graphs using a criterion based on Ollivier-Ricci curvature and estimated metric distortion. Our motivation comes from manifold learning: we show that when the data generating the nearest-neighbor graph consists of noisy samples from a low-dimensional manifold, edges that shortcut through the ambient space have more negative Ollivier-Ricci curvature than edges that lie along the data manifold. We demonstrate that our method outperforms alternative pruning methods and that it significantly improves performance on many downstream geometric data analysis tasks that use nearest neighbor graphs as input. Specifically, we evaluate on manifold learning, persistent homology, dimension estimation, and others. We also show that ORC-ManL can be used to improve clustering and manifold learning of single-cell RNA sequencing data. Finally, we provide empirical convergence experiments that support our theoretical findings.
[415] Local Attention Mechanism: Boosting the Transformer Architecture for Long-Sequence Time Series Forecasting
Ignacio Aguilera-Martos, Andrés Herrera-Poyatos, Julián Luengo, Francisco Herrera
Main category: cs.LG
TL;DR: The paper introduces Local Attention Mechanism (LAM) for efficient time series forecasting, reducing complexity to O(nlogn) and outperforming traditional attention mechanisms. It also proposes new datasets for better evaluation.
Details
Motivation: Transformers are dominant in NLP and time series analysis, but traditional attention mechanisms have high computational costs (O(n^2)). LAM addresses this by leveraging time series continuity.Method: LAM reduces attention scores by exploiting time series continuity. An algorithm implements LAM in O(nlogn) time and memory. New datasets are introduced for long-horizon forecasting evaluation.
Result: LAM-enhanced transformers outperform state-of-the-art models, including vanilla attention, in performance and efficiency.
Conclusion: LAM is effective for time series forecasting, offering computational efficiency and superior performance, with future challenges identified in long-sequence forecasting.
Abstract: Transformers have become the leading choice in natural language processing over other deep learning architectures. This trend has also permeated the field of time series analysis, especially for long-horizon forecasting, showcasing promising results both in performance and running time. In this paper, we introduce Local Attention Mechanism (LAM), an efficient attention mechanism tailored for time series analysis. This mechanism exploits the continuity properties of time series to reduce the number of attention scores computed. We present an algorithm for implementing LAM in tensor algebra that runs in time and memory O(nlogn), significantly improving upon the O(n^2) time and memory complexity of traditional attention mechanisms. We also note the lack of proper datasets to evaluate long-horizon forecast models. Thus, we propose a novel set of datasets to improve the evaluation of models addressing long-horizon forecasting challenges. Our experimental analysis demonstrates that the vanilla transformer architecture magnified with LAM surpasses state-of-the-art models, including the vanilla attention mechanism. These results confirm the effectiveness of our approach and highlight a range of future challenges in long-sequence time series forecasting.
[416] Can sparse autoencoders make sense of gene expression latent variable models?
Viktoria Schuster
Main category: cs.LG
TL;DR: SAEs are explored for decomposing biological data, showing efficacy in extracting interpretable features and uncovering subtle signals, with an automated tool (scFeatureLens) introduced for large-scale analysis.
Details
Motivation: To leverage SAEs for interpretable feature extraction in high-dimensional biological data, addressing limitations and enabling large-scale hypothesis generation.Method: SAEs are applied to simulated and pretrained single-cell data, analyzing efficacy, hyperparameters, and limitations. scFeatureLens automates feature-biological concept linking.
Result: SAEs effectively disentangle and interpret latent features, uncovering subtle biological signals and enabling large-scale analysis.
Conclusion: SAEs and scFeatureLens offer a powerful approach for interpretability and hypothesis generation in complex biological data.
Abstract: Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. By projecting dense embeddings into a much higher-dimensional and sparse space, learned features become disentangled and easier to interpret. This work explores the potential of SAEs for decomposing embeddings in complex and high-dimensional biological data. Using simulated data, it outlines the efficacy, hyperparameter landscape, and limitations of SAEs when it comes to extracting ground truth generative variables from latent space. The application to embeddings from pretrained single-cell models shows that SAEs can find and steer key biological processes and even uncover subtle biological signals that might otherwise be missed. This work further introduces scFeatureLens, an automated interpretability approach for linking SAE features and biological concepts from gene sets to enable large-scale analysis and hypothesis generation in single-cell gene expression models.
[417] Generalists vs. Specialists: Evaluating LLMs on Highly-Constrained Biophysical Sequence Optimization Tasks
Angelica Chen, Samuel D. Stanton, Frances Ding, Robert G. Alberstein, Andrew M. Watkins, Richard Bonneau, Vladimir Gligorijević, Kyunghyun Cho, Nathan C. Frey
Main category: cs.LG
TL;DR: The paper compares LLMs and specialized solvers like LaMBO-2 for biomolecule optimization, introduces Ehrlich functions as a synthetic benchmark, and proposes LLOME, a bilevel optimization method that improves LLM performance.
Details
Motivation: To address the high computational costs and constraint-satisfaction challenges of LLMs in biomolecule optimization, and the domain expertise required for specialized solvers.Method: Introduces Ehrlich functions as a synthetic test suite and proposes LLOME, a bilevel optimization routine with a preference learning loss.
Result: LLOME improves LLM performance on Ehrlich functions, sometimes matching or surpassing LaMBO-2, but LLMs still struggle with likelihood-reward miscalibration and lack of explicit rewards.
Conclusion: LLMs can offer benefits in biomolecule optimization, but specialized solvers remain competitive with lower overhead.
Abstract: Although large language models (LLMs) have shown promise in biomolecule optimization problems, they incur heavy computational costs and struggle to satisfy precise constraints. On the other hand, specialized solvers like LaMBO-2 offer efficiency and fine-grained control but require more domain expertise. Comparing these approaches is challenging due to expensive laboratory validation and inadequate synthetic benchmarks. We address this by introducing Ehrlich functions, a synthetic test suite that captures the geometric structure of biophysical sequence optimization problems. With prompting alone, off-the-shelf LLMs struggle to optimize Ehrlich functions. In response, we propose LLOME (Language Model Optimization with Margin Expectation), a bilevel optimization routine for online black-box optimization. When combined with a novel preference learning loss, we find LLOME can not only learn to solve some Ehrlich functions, but can even perform as well as or better than LaMBO-2 on moderately difficult Ehrlich variants. However, LLMs also exhibit some likelihood-reward miscalibration and struggle without explicit rewards. Our results indicate LLMs can occasionally provide significant benefits, but specialized solvers are still competitive and incur less overhead.
[418] HI-PMK: A Data-Dependent Kernel for Incomplete Heterogeneous Data Representation
Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, Sunil Aryal
Main category: cs.LG
TL;DR: HI-PMK is a novel method for handling incomplete and heterogeneous data without imputation, using a probability mass-based dissimilarity measure and missingness-aware uncertainty strategy.
Details
Motivation: Addressing the challenges of incomplete and heterogeneous data in machine learning, where existing methods like imputation introduce bias or privacy risks.Method: HI-PMK uses a probability mass-based dissimilarity measure for heterogeneous features and a MaxU strategy to handle missingness mechanisms.
Result: Outperforms traditional imputation-based methods and kernel approaches across 15 benchmark datasets.
Conclusion: HI-PMK is a scalable, privacy-preserving solution for incomplete and heterogeneous data, suitable for downstream tasks.
Abstract: Handling incomplete and heterogeneous data remains a central challenge in real-world machine learning, where missing values may follow complex mechanisms (MCAR, MAR, MNAR) and features can be of mixed types (numerical and categorical). Existing methods often rely on imputation, which may introduce bias or privacy risks, or fail to jointly address data heterogeneity and structured missingness. We propose the \textbf{H}eterogeneous \textbf{I}ncomplete \textbf{P}robability \textbf{M}ass \textbf{K}ernel (\textbf{HI-PMK}), a novel data-dependent representation learning approach that eliminates the need for imputation. HI-PMK introduces two key innovations: (1) a probability mass-based dissimilarity measure that adapts to local data distributions across heterogeneous features (numerical, ordinal, nominal), and (2) a missingness-aware uncertainty strategy (MaxU) that conservatively handles all three missingness mechanisms by assigning maximal plausible dissimilarity to unobserved entries. Our approach is privacy-preserving, scalable, and readily applicable to downstream tasks such as classification and clustering. Extensive experiments on over 15 benchmark datasets demonstrate that HI-PMK consistently outperforms traditional imputation-based pipelines and kernel methods across a wide range of missing data settings. Code is available at: https://github.com/echoid/Incomplete-Heter-Kernel
[419] Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training
Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li
Main category: cs.LG
TL;DR: Proposes Gradient Wavelet Transform (GWT) to reduce memory usage in LLM training without performance loss.
Details
Motivation: Address memory challenges in LLM training due to large parameters and memory-intensive optimizers like Adam.Method: Introduces GWT, applying wavelet transforms to gradients to reduce optimizer state memory.
Result: GWT achieves state-of-the-art performance in memory usage and training efficiency.
Conclusion: GWT is a novel, effective solution for memory-efficient LLM training.
Abstract: Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.
[420] A Survey on Memory-Efficient Transformer-Based Model Training in AI for Science
Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Shanshan Li, Dongsheng Li
Main category: cs.LG
TL;DR: The paper surveys memory-efficient pre-training techniques for large-scale transformers in scientific research, addressing high costs and inefficiencies of traditional methods.
Details
Motivation: High costs and inefficiencies in traditional scientific research methods, coupled with the memory demands of large language models (LLMs), motivate the need for memory-efficient solutions.Method: The survey reviews and categorizes memory-efficient pre-training techniques for transformers, including algorithm-level, system-level, and hardware-software co-optimization, using AlphaFold 2 as an example.
Result: Tailored memory optimization methods can reduce storage needs while maintaining prediction accuracy, as demonstrated by AlphaFold 2.
Conclusion: The paper aims to bridge model efficiency and scientific application needs, offering insights for scalable and cost-effective LLM training in AI for science.
Abstract: Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews transformer-based LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. This survey systematically reviews and categorizes memory-efficient pre-training techniques for large-scale transformers, including algorithm-level, system-level, and hardware-software co-optimization. Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy. By bridging model efficiency and scientific application needs, we hope to provide insights for scalable and cost-effective LLM training in AI for science.
[421] PAR-AdvGAN: Improving Adversarial Attack Capability with Progressive Auto-Regression AdvGAN
Jiayu Zhang, Zhiyu Zhu, Xinyi Wang, Silin Liao, Zhibo Jin, Flora D. Salim, Huaming Chen
Main category: cs.LG
TL;DR: PAR-AdvGAN improves adversarial example generation by combining auto-regressive iteration with progressive generation, outperforming AdvGAN and other methods in speed and attack capability.
Details
Motivation: Deep neural networks are vulnerable to adversarial examples, and existing GAN-based methods like AdvGAN have limitations in fully exploiting their potential due to single-iteration perturbation generation.Method: PAR-AdvGAN integrates an auto-regressive iteration mechanism within a progressive generation network to create more effective adversarial examples.
Result: PAR-AdvGAN outperforms state-of-the-art black-box attacks and AdvGAN, achieving speeds up to 335.5 frames per second on Inception-v3.
Conclusion: PAR-AdvGAN enhances adversarial attack capability and speed, offering a superior alternative to existing methods.
Abstract: Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original AdvGAN.Moreover, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: https://github.com/LMBTough/PAR
[422] Multi-branch of Attention Yields Accurate Results for Tabular Data
Xuechen Li, Yupeng Li, Jian Liu, Xiaolin Jin, Xin Hu
Main category: cs.LG
TL;DR: MAYA is a transformer-based framework with a Multi-Branch of Attention (MBA) encoder and collaborative learning, designed to handle feature heterogeneity in tabular data, outperforming other transformer methods in classification and regression tasks.
Details
Motivation: Existing transformer-based methods lack mechanisms to address feature heterogeneity in tabular data, limiting their effectiveness.Method: MAYA uses an encoder-decoder framework: the encoder employs MBA for parallel attention branches and feature fusion, while the decoder integrates tabular data with label features via cross-attention. Collaborative learning with dynamic consistency weight is also applied.
Result: MAYA achieves superior performance in tabular classification and regression compared to other transformer-based methods.
Conclusion: MAYA effectively addresses feature heterogeneity in tabular data, demonstrating robust performance and outperforming state-of-the-art transformer methods.
Abstract: Tabular data inherently exhibits significant feature heterogeneity, but existing transformer-based methods lack specialized mechanisms to handle this property. To bridge the gap, we propose MAYA, an encoder-decoder transformer-based framework. In the encoder, we design a Multi-Branch of Attention (MBA) that constructs multiple parallel attention branches and averages the features at each branch, effectively fusing heterogeneous features while limiting parameter growth. Additionally, we employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations. In the decoder stage, cross-attention is utilized to seamlessly integrate tabular data with corresponding label features. This dual-attention mechanism effectively captures both intra-instance and inter-instance interactions. We evaluate the proposed method on a wide range of datasets and compare it with other state-of-the-art transformer-based methods. Extensive experiments demonstrate that our model achieves superior performance among transformer-based methods in both tabular classification and regression tasks.
[423] A calibration test for evaluating set-based epistemic uncertainty representations
Mira Jürgens, Thomas Mortier, Eyke Hüllermeier, Viktor Bengs, Willem Waegeman
Main category: cs.LG
TL;DR: The paper introduces a novel statistical test to ensure calibration of convex combinations of credal sets in machine learning, allowing instance-dependent combinations for better accuracy.
Details
Motivation: Accurate representation of epistemic uncertainty is crucial in machine learning, and credal sets are a common approach. Ensuring these sets contain the true data-generating distribution requires strong calibration.Method: Proposes a statistical test for calibration of convex combinations of credal sets, using instance-dependent combinations and proper scoring rules. Introduces a nonparametric testing procedure based on kernel-based calibration error estimators.
Result: Demonstrates improved calibration by capturing instance-level variability in synthetic and real-world experiments.
Conclusion: The framework enhances calibration of epistemic uncertainty representations by allowing flexible, instance-dependent combinations of predictors.
Abstract: The accurate representation of epistemic uncertainty is a challenging yet essential task in machine learning. A widely used representation corresponds to convex sets of probabilistic predictors, also known as credal sets. One popular way of constructing these credal sets is via ensembling or specialized supervised learning methods, where the epistemic uncertainty can be quantified through measures such as the set size or the disagreement among members. In principle, these sets should contain the true data-generating distribution. As a necessary condition for this validity, we adopt the strongest notion of calibration as a proxy. Concretely, we propose a novel statistical test to determine whether there is a convex combination of the set’s predictions that is calibrated in distribution. In contrast to previous methods, our framework allows the convex combination to be instance dependent, recognizing that different ensemble members may be better calibrated in different regions of the input space. Moreover, we learn this combination via proper scoring rules, which inherently optimize for calibration. Building on differentiable, kernel-based estimators of calibration errors, we introduce a nonparametric testing procedure and demonstrate the benefits of capturing instance-level variability on of synthetic and real-world experiments.
[424] Conceptualizing Uncertainty: A Concept-based Approach to Explaining Uncertainty
Isaac Roberts, Alexander Schulz, Sarah Schroeder, Fabian Hinder, Barbara Hammer
Main category: cs.LG
TL;DR: Proposes using concept activation vectors to explain uncertainty in high-dimensional data classification, offering both local and global explanations.
Details
Motivation: Existing uncertainty quantification methods lack global explanations, limiting interpretability and trust in model predictions.Method: Uses concept activation vectors to generate local and global explanations of uncertainty in high-dimensional settings.
Result: Demonstrates the utility of explanations for refining and improving the model.
Conclusion: The approach enhances interpretability and trust by providing comprehensive uncertainty explanations.
Abstract: Uncertainty in machine learning refers to the degree of confidence or lack thereof in a model’s predictions. While uncertainty quantification methods exist, explanations of uncertainty, especially in high-dimensional settings, remain an open challenge. Existing work focuses on feature attribution approaches which are restricted to local explanations. Understanding uncertainty, its origins, and characteristics on a global scale is crucial for enhancing interpretability and trust in a model’s predictions. In this work, we propose to explain the uncertainty in high-dimensional data classification settings by means of concept activation vectors which give rise to local and global explanations of uncertainty. We demonstrate the utility of the generated explanations by leveraging them to refine and improve our model.
[425] Compton Form Factor Extraction using Quantum Deep Neural Networks
Brandon B. Le, Dustin Keller
Main category: cs.LG
TL;DR: QDNNs outperform CDNNs in extracting CFFs from DVCS data, offering better accuracy and precision with less complexity. A metric for quantum advantage is introduced.
Details
Motivation: To improve the extraction of Compton Form Factors (CFFs) from DVCS experiments by leveraging quantum computing advantages.Method: Uses QDNNs and CDNNs for pseudodata extraction, comparing their performance. A fitting procedure minimizes model dependence.
Result: QDNNs show superior predictive accuracy and precision over CDNNs in certain cases. A metric quantifies quantum advantage.
Conclusion: QDNNs hold promise for advancing studies in parton distributions and hadronic physics.
Abstract: We present an extraction of Compton Form Factors (CFFs) from Deeply Virtual Compton Scattering (DVCS) experiments conducted at Thomas Jefferson National Accelerator Facility, utilizing Quantum Deep Neural Networks (QDNNs). The analysis employs the standard Belitsky, Kirchner, and M"uller formalism at twist-two, complemented by a fitting procedure designed to minimize model dependence in a manner analogous to conventional local fits. A pseudodata extraction test of the CFFs is performed using both Classical Deep Neural Networks (CDNNs) and QDNNs, with a detailed comparative analysis. Results indicate that QDNNs can outperform CDNNs in particular cases, offering enhanced predictive accuracy and precision even with limited model complexity. Motivated by this, we develop a metric to quantify the extent of the quantum advantage based on characteristics of DVCS experimental data. These findings underscore the promising role of QDNNs in advancing future investigations into multidimensional parton distributions and hadronic physics.
[426] An $\tilde{O}$ptimal Differentially Private Learner for Concept Classes with VC Dimension 1
Chao Yan
Main category: cs.LG
TL;DR: First nearly optimal differentially private PAC learner for VC dimension 1 and Littlestone dimension $d$, achieving sample complexity $ ilde{O}(\log^* d)$, nearly matching the lower bound.
Details
Motivation: To improve upon prior upper bounds for differentially private PAC learning, which were suboptimal for general VC classes.Method: Develops a new algorithm tailored for concept classes with VC dimension 1 and Littlestone dimension $d$.
Result: Achieves sample complexity $ ilde{O}(\log^* d)$, nearly optimal and significantly better than previous bounds.
Conclusion: The work provides a nearly optimal solution for differentially private PAC learning in specific concept classes, closing the gap with the lower bound.
Abstract: We present the first nearly optimal differentially private PAC learner for any concept class with VC dimension 1 and Littlestone dimension $d$. Our algorithm achieves the sample complexity of $\tilde{O}_{\varepsilon,\delta,\alpha,\delta}(\log^* d)$, nearly matching the lower bound of $\Omega(\log^* d)$ proved by Alon et al. [STOC19]. Prior to our work, the best known upper bound is $\tilde{O}(VC\cdot d^5)$ for general VC classes, as shown by Ghazi et al. [STOC21].
[427] Context-Aware Probabilistic Modeling with LLM for Multimodal Time Series Forecasting
Yueyang Yao, Jiajun Li, Xingyuan Dai, MengMeng Zhang, Xiaoyan Gong, Fei-Yue Wang, Yisheng Lv
Main category: cs.LG
TL;DR: CAPTime is a novel method for time series forecasting that integrates text and probabilistic LLM decoding, outperforming existing approaches in accuracy and generalization.
Details
Motivation: Existing methods fail to effectively combine exogenous texts with LLMs' probabilistic nature, limiting contextual awareness and distribution modeling.Method: CAPTime uses a pretrained time series encoder and learnable interactions to align temporal patterns with textual contexts, enabling joint multimodal representations. It combines a mixture of distribution experts with frozen LLMs for probabilistic forecasting.
Result: CAPTime shows superior accuracy and generalization in diverse forecasting tasks, especially in multimodal scenarios, and is robust in data-scarce settings.
Conclusion: CAPTime addresses key limitations of current methods by integrating text-informed abstraction and probabilistic LLM decoding, offering improved forecasting performance.
Abstract: Time series forecasting is important for applications spanning energy markets, climate analysis, and traffic management. However, existing methods struggle to effectively integrate exogenous texts and align them with the probabilistic nature of large language models (LLMs). Current approaches either employ shallow text-time series fusion via basic prompts or rely on deterministic numerical decoding that conflict with LLMs’ token-generation paradigm, which limits contextual awareness and distribution modeling. To address these limitations, we propose CAPTime, a context-aware probabilistic multimodal time series forecasting method that leverages text-informed abstraction and autoregressive LLM decoding. Our method first encodes temporal patterns using a pretrained time series encoder, then aligns them with textual contexts via learnable interactions to produce joint multimodal representations. By combining a mixture of distribution experts with frozen LLMs, we enable context-aware probabilistic forecasting while preserving LLMs’ inherent distribution modeling capabilities. Experiments on diverse time series forecasting tasks demonstrate the superior accuracy and generalization of CAPTime, particularly in multimodal scenarios. Additional analysis highlights its robustness in data-scarce scenarios through hybrid probabilistic decoding.
[428] Learning Pareto-Optimal Rewards from Noisy Preferences: A Framework for Multi-Objective Inverse Reinforcement Learning
Kalyan Cherukuri, Aarav Lala
Main category: cs.LG
TL;DR: The paper introduces a theoretical framework for Multi-Objective Inverse Reinforcement Learning (MO-IRL) to align generative agents with complex human values by modeling preferences as latent vector-valued rewards. It formalizes recovering Pareto-optimal rewards from noisy preferences, provides sample complexity bounds, and proposes a convergent algorithm for policy optimization.
Details
Motivation: Alignment of generative agents with human values is challenging due to the oversimplification of human intent as scalar rewards. Existing methods ignore the multi-faceted nature of human feedback.Method: The authors propose MO-IRL, modeling human preferences as latent vector-valued rewards. They formalize the recovery of Pareto-optimal rewards from noisy preferences, establish identification conditions, and derive sample complexity bounds. A provably convergent algorithm for policy optimization is introduced.
Result: The framework provides tight sample complexity bounds for recovering Pareto-optimal rewards and introduces a regret formulation for suboptimality. The proposed algorithm ensures convergence in policy optimization.
Conclusion: The work bridges practical alignment techniques with theoretical guarantees, offering a principled approach for learning aligned behaviors in complex, value-pluralistic environments.
Abstract: As generative agents become increasingly capable, alignment of their behavior with complex human values remains a fundamental challenge. Existing approaches often simplify human intent through reduction to a scalar reward, overlooking the multi-faceted nature of human feedback. In this work, we introduce a theoretical framework for preference-based Multi-Objective Inverse Reinforcement Learning (MO-IRL), where human preferences are modeled as latent vector-valued reward functions. We formalize the problem of recovering a Pareto-optimal reward representation from noisy preference queries and establish conditions for identifying the underlying multi-objective structure. We derive tight sample complexity bounds for recovering $\epsilon$-approximations of the Pareto front and introduce a regret formulation to quantify suboptimality in this multi-objective setting. Furthermore, we propose a provably convergent algorithm for policy optimization using preference-inferred reward cones. Our results bridge the gap between practical alignment techniques and theoretical guarantees, providing a principled foundation for learning aligned behaviors in a high-dimension and value-pluralistic environment.
[429] Position: Adopt Constraints Over Penalties in Deep Learning
Juan Ramirez, Meraj Hashemizadeh, Simon Lacoste-Julien
Main category: cs.LG
TL;DR: The paper critiques fixed-weight penalization in AI systems for enforcing constraints, advocating for Lagrangian methods as a more effective and accountable alternative.
Details
Motivation: Current methods for enforcing constraints in AI systems via fixed-weight penalization are flawed, as they may not ensure constraint satisfaction or optimal performance, and require costly tuning.Method: The paper proposes using tailored constrained optimization methods, like the Lagrangian approach, which jointly optimizes Lagrange multipliers and model parameters.
Result: Lagrangian methods ensure constraint satisfaction, eliminate the need for extensive penalty tuning, and integrate well with modern deep learning pipelines.
Conclusion: The paper concludes that Lagrangian approaches are superior for solving constrained problems in AI, offering accountability and efficiency.
Abstract: Recent efforts to develop trustworthy AI systems with accountability guarantees have led to widespread use of machine learning formulations incorporating external requirements, or constraints. These requirements are often enforced via penalization–adding fixed-weight terms to the task loss. We argue this approach is fundamentally ill-suited since there may be no penalty coefficient that simultaneously ensures constraint satisfaction and optimal constrained performance, i.e., that truly solves the constrained problem. Moreover, tuning these coefficients requires costly trial-and-error, incurring significant time and computational overhead. We, therefore, advocate for broader adoption of tailored constrained optimization methods–such as the Lagrangian approach, which jointly optimizes the penalization “coefficients” (the Lagrange multipliers) and the model parameters. Such methods (i) truly solve the constrained problem and do so accountably, by clearly defining feasibility and verifying when it is achieved, (ii) eliminate the need for extensive penalty tuning, and (iii) integrate seamlessly with modern deep learning pipelines.
[430] Adversarial bandit optimization for approximately linear functions
Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto
Main category: cs.LG
TL;DR: The paper analyzes a bandit optimization problem for nonconvex, nonsmooth functions with linear and perturbed components, providing regret bounds and a lower bound.
Details
Motivation: To address the challenge of optimizing nonconvex and nonsmooth functions in bandit settings, especially with adversarial perturbations.Method: Analyzes the problem by considering linear functions with arbitrary perturbations, deriving expected and high-probability regret bounds.
Result: Presents improved high-probability regret bounds for bandit linear optimization and a lower bound on expected regret.
Conclusion: The study advances understanding of bandit optimization under nonconvexity and adversarial perturbations, with practical implications.
Abstract: We consider a bandit optimization problem for nonconvex and non-smooth functions, where in each trial the loss function is the sum of a linear function and a small but arbitrary perturbation chosen after observing the player’s choice. We give both expected and high probability regret bounds for the problem. Our result also implies an improved high-probability regret bound for the bandit linear optimization, a special case with no perturbation. We also give a lower bound on the expected regret.
[431] Unsupervised risk factor identification across cancer types and data modalities via explainable artificial intelligence
Maximilian Ferle, Jonas Ader, Thomas Wiemers, Nora Grieb, Adrian Lindenmeyer, Hans-Jonas Meyer, Thomas Neumuth, Markus Kreuz, Kristin Reiche, Maximilian Merz
Main category: cs.LG
TL;DR: A novel unsupervised machine learning method optimizes survival heterogeneity across patient clusters using a differentiable logrank statistic, validated in simulations and real-world cancer datasets.
Details
Motivation: Current risk stratification methods often fail to translate survival analysis into actionable clinical criteria, necessitating a more direct and interpretable approach.Method: The method uses a differentiable adaptation of the multivariate logrank statistic to train neural networks on diverse data modalities, identifying prognostically distinct patient groups.
Result: Applied to multiple myeloma and non-small cell lung cancer datasets, the method identified subgroups with significantly different survival outcomes and clinically meaningful features.
Conclusion: This pan-cancer, model-agnostic approach advances clinical risk stratification by providing interpretable results for treatment personalization and decision-making.
Abstract: Risk stratification is a key tool in clinical decision-making, yet current approaches often fail to translate sophisticated survival analysis into actionable clinical criteria. We present a novel method for unsupervised machine learning that directly optimizes for survival heterogeneity across patient clusters through a differentiable adaptation of the multivariate logrank statistic. Unlike most existing methods that rely on proxy metrics, our approach represents novel methodology for training any neural network architecture on any data modality to identify prognostically distinct patient groups. We thoroughly evaluate the method in simulation experiments and demonstrate its utility in practice by applying it to two distinct cancer types: analyzing laboratory parameters from multiple myeloma patients and computed tomography images from non-small cell lung cancer patients, identifying prognostically distinct patient subgroups with significantly different survival outcomes in both cases. Post-hoc explainability analyses uncover clinically meaningful features determining the group assignments which align well with established risk factors and thus lend strong weight to the methods utility. This pan-cancer, model-agnostic approach represents a valuable advancement in clinical risk stratification, enabling the discovery of novel prognostic signatures across diverse data types while providing interpretable results that promise to complement treatment personalization and clinical decision-making in oncology and beyond.
[432] HiPreNets: High-Precision Neural Networks through Progressive Training
Ethan Mulle, Wei Kang, Qi Gong
Main category: cs.LG
TL;DR: A progressive framework (HiPreNets) is proposed to train high-precision neural networks by refining prediction residuals, improving accuracy and addressing challenges like non-convex optimization and hyperparameter tuning.
Details
Motivation: Training highly accurate deep neural networks is difficult due to non-convex optimization, hyperparameter tuning, and the neglect of $L^{\infty}$ error in favor of MSE.Method: A staged training technique refines prediction residuals using additional networks, guided by residual structure for loss function choice, parameter count, and adaptive data sampling.
Result: The framework is validated on benchmark problems, demonstrating improved accuracy.
Conclusion: HiPreNets effectively addresses training challenges and enhances neural network precision.
Abstract: Deep neural networks are powerful tools for solving nonlinear problems in science and engineering, but training highly accurate models becomes challenging as problem complexity increases. Non-convex optimization and numerous hyperparameters to tune make performance improvement difficult, and traditional approaches often prioritize minimizing mean squared error (MSE) while overlooking $L^{\infty}$ error, which is the critical focus in many applications. To address these challenges, we present a progressive framework for training and tuning high-precision neural networks (HiPreNets). Our approach refines a previously explored staged training technique for neural networks that improves an existing fully connected neural network by sequentially learning its prediction residuals using additional networks, leading to improved overall accuracy. We discuss how to take advantage of the structure of the residuals to guide the choice of loss function, number of parameters to use, and ways to introduce adaptive data sampling techniques. We validate our framework’s effectiveness through several benchmark problems.
[433] Automated Generation of Diverse Courses of Actions for Multi-Agent Operations using Binary Optimization and Graph Learning
Prithvi Poddar, Ehsan Tarkesh Esfahani, Karthik Dantu, Souma Chowdhury
Main category: cs.LG
TL;DR: A framework for generating diverse COAs in multi-agent missions using a graph abstraction and genetic algorithm, with task sequencing optimized via a graph neural network.
Details
Motivation: Automated planning for diverse COAs is needed in dynamic environments with varying agent capabilities, ensuring adaptability and performance.Method: Uses a graph abstraction for task space, genetic algorithm for diverse COA generation, and graph neural network for task sequencing.
Result: Outperforms random walk baseline, achieves near-optimal task sequencing, and plans 20 COAs for 5 agents/100 tasks in ~50 minutes.
Conclusion: The framework effectively generates diverse COAs, balancing diversity and compatibility, with practical computational efficiency.
Abstract: Operations in disaster response, search & rescue, and military missions that involve multiple agents demand automated processes to support the planning of the courses of action (COA). Moreover, traverse-affecting changes in the environment (rain, snow, blockades, etc.) may impact the expected performance of a COA, making it desirable to have a pool of COAs that are diverse in task distributions across agents. Further, variations in agent capabilities, which could be human crews and/or autonomous systems, present practical opportunities and computational challenges to the planning process. This paper presents a new theoretical formulation and computational framework to generate such diverse pools of COAs for operations with soft variations in agent-task compatibility. Key to the problem formulation is a graph abstraction of the task space and the pool of COAs itself to quantify its diversity. Formulating the COAs as a centralized multi-robot task allocation problem, a genetic algorithm is used for (order-ignoring) allocations of tasks to each agent that jointly maximize diversity within the COA pool and overall compatibility of the agent-task mappings. A graph neural network is trained using a policy gradient approach to then perform single agent task sequencing in each COA, which maximizes completion rates adaptive to task features. Our tests of the COA generation process in a simulated environment demonstrate significant performance gain over a random walk baseline, small optimality gap in task sequencing, and execution time of about 50 minutes to plan up to 20 COAs for 5 agent/100 task operations.
[434] TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C Lisaius, Markus Immitzer, David A. Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav
Main category: cs.LG
TL;DR: TESSERA is a global, open-source remote sensing foundation model using self-supervised learning to generate 10m-scale embeddings from satellite data, outperforming task-specific models in diverse applications.
Details
Motivation: Satellite time series data is voluminous and noisy, making it challenging to use for applications like climate modeling and land use. TESSERA aims to simplify this process.Method: TESSERA uses two Transformer-based encoders to combine optical and radar data from Sentinel-2 and Sentinel-1, fusing them with a multilayer perceptron to create annual global embeddings.
Result: TESSERA matches or outperforms state-of-the-art models in five downstream tasks, demonstrating high performance and efficiency.
Conclusion: TESSERA’s ease of use, openness, and efficiency make it transformative for ecological and agricultural applications.
Abstract: Satellite remote sensing from repeated observations and multiple sensors
enables a wide range of downstream applications, including climate modeling,
carbon accounting, and strategies for conservation and sustainable land use.
However, satellite time series are voluminous, often corrupted by sensor noise,
clouds, and atmospheric conditions, and unevenly spaced in time, making them
challenging to use. We present TESSERA, an open, global, land-oriented remote
sensing foundation model that uses self-supervised learning to generate
`ready-to-use’ embeddings at 10m scale from pixel-level satellite time series
data. TESSERA uses two parallel Transformer-based encoders to combine optical
data from ten Sentinel-2 spectral bands at 10-60m spatial resolution and two
Sentinel-1 synthetic aperture radar backscatter coefficients at 10~m resolution
to create embeddings that are subsequently fused with a multilayer perceptron
to create annual global embedding maps. We compare our work with
state-of-the-art task-specific models and other foundation models in five
diverse downstream tasks and find that TESSERA closely matches or outperforms
these baselines. We believe that TESSERA’s ease of use, openness, computation-,
label-, and data-efficiency, and high performance will prove transformative in
a wide range of vegetation-oriented ecological and agricultural applications.
[435] Generating Heterogeneous Multi-dimensional Data : A Comparative Study
Michael Corbeau, Emmanuelle Claeys, Mathieu Serrurier, Pascale Zaraté
Main category: cs.LG
TL;DR: The paper compares data generation methods for optimizing firefighter resource allocation, evaluating their effectiveness using domain-specific and standard metrics.
Details
Motivation: To improve firefighter response optimization by generating high-quality synthetic data for scenario simulations.Method: Comparison of Random Sampling, Tabular Variational Autoencoders, GANs, Conditional Tabular GANs, and Diffusion Probabilistic Models, evaluated with domain-specific and standard metrics.
Result: The study highlights the challenges of generating synthetic data for highly unbalanced, non-Gaussian distributions and the need for tailored evaluation metrics.
Conclusion: Domain-specific metrics are crucial for assessing synthetic data quality in firefighting scenarios, as traditional metrics may not suffice.
Abstract: Allocation of personnel and material resources is highly sensible in the case of firefighter interventions. This allocation relies on simulations to experiment with various scenarios. The main objective of this allocation is the global optimization of the firefighters response. Data generation is then mandatory to study various scenarios In this study, we propose to compare different data generation methods. Methods such as Random Sampling, Tabular Variational Autoencoders, standard Generative Adversarial Networks, Conditional Tabular Generative Adversarial Networks and Diffusion Probabilistic Models are examined to ascertain their efficacy in capturing the intricacies of firefighter interventions. Traditional evaluation metrics often fall short in capturing the nuanced requirements of synthetic datasets for real-world scenarios. To address this gap, an evaluation of synthetic data quality is conducted using a combination of domain-specific metrics tailored to the firefighting domain and standard measures such as the Wasserstein distance. Domain-specific metrics include response time distribution, spatial-temporal distribution of interventions, and accidents representation. These metrics are designed to assess data variability, the preservation of fine and complex correlations and anomalies such as event with a very low occurrence, the conformity with the initial statistical distribution and the operational relevance of the synthetic data. The distribution has the particularity of being highly unbalanced, none of the variables following a Gaussian distribution, adding complexity to the data generation process.
[436] “So, Tell Me About Your Policy…”: Distillation of interpretable policies from Deep Reinforcement Learning agents
Giovanni Dispoto, Paolo Bonetti, Marcello Restelli
Main category: cs.LG
TL;DR: A novel algorithm is proposed to extract interpretable policies (e.g., linear policies) from complex expert behaviors in Deep Reinforcement Learning (DRL), addressing the interpretability challenge while retaining performance.
Details
Motivation: The lack of interpretability in DRL policies hinders their deployment in mission-critical and real-world applications, where simpler, interpretable algorithms are often preferred despite lower performance.Method: The algorithm leverages the advantage function to extract interpretable policies from expert behavior, enabling training with previously collected experience.
Result: Empirical evaluation on classic control environments and a financial trading scenario shows the algorithm successfully extracts meaningful information from complex expert policies.
Conclusion: The proposed method bridges the gap between interpretability and performance in DRL, offering a practical solution for real-world applications.
Abstract: Recent advances in Reinforcement Learning (RL) largely benefit from the inclusion of Deep Neural Networks, boosting the number of novel approaches proposed in the field of Deep Reinforcement Learning (DRL). These techniques demonstrate the ability to tackle complex games such as Atari, Go, and other real-world applications, including financial trading. Nevertheless, a significant challenge emerges from the lack of interpretability, particularly when attempting to comprehend the underlying patterns learned, the relative importance of the state features, and how they are integrated to generate the policy’s output. For this reason, in mission-critical and real-world settings, it is often preferred to deploy a simpler and more interpretable algorithm, although at the cost of performance. In this paper, we propose a novel algorithm, supported by theoretical guarantees, that can extract an interpretable policy (e.g., a linear policy) without disregarding the peculiarities of expert behavior. This result is obtained by considering the advantage function, which includes information about why an action is superior to the others. In contrast to previous works, our approach enables the training of an interpretable policy using previously collected experience. The proposed algorithm is empirically evaluated on classic control environments and on a financial trading scenario, demonstrating its ability to extract meaningful information from complex expert policies.
[437] TolerantECG: A Foundation Model for Imperfect Electrocardiogram
Huynh Dang Nguyen, Trong-Thang Pham, Ngan Le, Van Nguyen
Main category: cs.LG
TL;DR: TolerantECG is a robust foundation model for ECG signals, handling noise and incomplete leads, outperforming benchmarks on PTB-XL and MIT-BIH datasets.
Details
Motivation: ECG effectiveness is limited by noise or missing leads, leading to diagnostic errors. TolerantECG aims to address these issues.Method: Combines contrastive and self-supervised learning to learn ECG representations, including corrupted or incomplete signals and text reports.
Result: TolerantECG ranks top or second-best on PTB-XL and achieves highest performance on MIT-BIH Arrhythmia Database.
Conclusion: TolerantECG is a robust solution for noisy or incomplete ECG data, enhancing diagnostic accuracy.
Abstract: The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.
[438] FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning
Md Rafid Haque, Abu Raihan Mostofa Kamal, Md. Azam Hossain
Main category: cs.LG
TL;DR: FedStrategist is a meta-learning framework for dynamic defense selection in Federated Learning, outperforming static methods against adaptive attacks and heterogeneous data.
Details
Motivation: Existing static defenses in FL are ineffective against adaptive adversaries and heterogeneous environments, necessitating a dynamic solution.Method: FedStrategist uses a contextual bandit agent to dynamically choose the best aggregation rule from a set of defenses based on real-time metrics.
Result: The adaptive agent learns superior policies, handles diverse scenarios, and balances performance-security trade-offs via a risk tolerance parameter.
Conclusion: FedStrategist offers a practical, analyzable approach to resilient decentralized AI, prioritizing model integrity and adaptability.
Abstract: Federated Learning (FL) offers a paradigm for privacy-preserving collaborative AI, but its decentralized nature creates significant vulnerabilities to model poisoning attacks. While numerous static defenses exist, their effectiveness is highly context-dependent, often failing against adaptive adversaries or in heterogeneous data environments. This paper introduces FedStrategist, a novel meta-learning framework that reframes robust aggregation as a real-time, cost-aware control problem. We design a lightweight contextual bandit agent that dynamically selects the optimal aggregation rule from an arsenal of defenses based on real-time diagnostic metrics. Through comprehensive experiments, we demonstrate that no single static rule is universally optimal. We show that our adaptive agent successfully learns superior policies across diverse scenarios, including a ``Krum-favorable" environment and against a sophisticated “stealth” adversary designed to neutralize specific diagnostic signals. Critically, we analyze the paradoxical scenario where a non-robust baseline achieves high but compromised accuracy, and demonstrate that our agent learns a conservative policy to prioritize model integrity. Furthermore, we prove the agent’s policy is controllable via a single “risk tolerance” parameter, allowing practitioners to explicitly manage the trade-off between performance and security. Our work provides a new, practical, and analyzable approach to creating resilient and intelligent decentralized AI systems.
[439] Prediction accuracy versus rescheduling flexibility in elective surgery management
Pieter Smet, Martina Doneda, Ettore Lanzarone, Giuliana Carello
Main category: cs.LG
TL;DR: The paper explores how accurate length-of-stay (LOS) predictions impact rescheduling flexibility in elective surgery admissions, aiming to optimize bed utilization and prevent overflows.
Details
Motivation: Downstream resource availability, especially inpatient beds, is critical for elective surgery planning. Accurate LOS predictions can reduce rescheduling needs, but training such ML models is costly.Method: The study uses simulated ML to evaluate data-driven approaches, analyzing the relationship between LOS prediction accuracy and rescheduling flexibility under various corrective policies.
Result: The research identifies effective patient rescheduling strategies to mitigate LOS prediction errors, balancing bed availability and resource utilization.
Conclusion: Better LOS predictions enhance rescheduling flexibility, but the study highlights the trade-off between prediction accuracy and the cost of training ML models.
Abstract: The availability of downstream resources plays is critical in planning the admission of elective surgery patients. The most crucial one is inpatient beds. To ensure bed availability, hospitals may use machine learning (ML) models to predict patients’ length-of-stay (LOS) in the admission planning stage. However, the real value of the LOS for each patient may differ from the predicted one, potentially making the schedule infeasible. To address such infeasibilities, it is possible to implement rescheduling strategies that take advantage of operational flexibility. For example, planners may postpone admission dates, relocate patients to different wards, or even transfer patients who are already admitted among wards. A straightforward assumption is that better LOS predictions can help reduce the impact of rescheduling. However, the training process of ML models that can make such accurate predictions can be very costly. Building on previous work that proposed simulated ML for evaluating data-driven approaches, this paper explores the relationship between LOS prediction accuracy and rescheduling flexibility across various corrective policies. Specifically, we examine the most effective patient rescheduling strategies under LOS prediction errors to prevent bed overflows while optimizing resource utilization
[440] Diffusion Beats Autoregressive in Data-Constrained Settings
Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, Deepak Pathak
Main category: cs.LG
TL;DR: Diffusion models outperform autoregressive (AR) models in data-scarce settings due to better data utilization and implicit augmentation.
Details
Motivation: To explore the advantages of diffusion-based language models over AR models, especially in data-constrained scenarios.Method: Systematic study of masked diffusion models in data-constrained settings, comparing their performance with AR models.
Result: Diffusion models achieve lower validation loss and superior downstream performance when compute is abundant but data is scarce.
Conclusion: Diffusion models are a compelling alternative to AR models when data is the bottleneck, offering better scaling and performance.
Abstract: Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promising alternative, though their advantages over AR models remain underexplored. In this paper, we systematically study masked diffusion models in data-constrained settings-where training involves repeated passes over limited data-and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior downstream performance. We interpret this advantage as implicit data augmentation: masked diffusion exposes the model to a diverse distribution of token orderings and prediction tasks, unlike AR’s fixed left-to-right factorization. We find new scaling laws for diffusion models and derive a closed-form expression for the critical compute threshold at which diffusion begins to outperform AR. These results suggest that when data, not compute, is the bottleneck, diffusion models offer a compelling alternative to the standard AR paradigm. Our code is available at: https://diffusion-scaling.github.io.
[441] Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
Andrii Balashov
Main category: cs.LG
TL;DR: RL fine-tuning modifies only a small subnetwork (5-30% of weights) in LLMs, achieving full performance without updating most parameters. This sparsity arises naturally and is transferable across seeds, datasets, and algorithms.
Details
Motivation: To challenge the assumption that RL fine-tuning requires updating most model parameters and explore the phenomenon of RL-induced parameter update sparsity.Method: Analyze RL fine-tuning across various algorithms (PPO, DPO, SimPO, PRIME) and model families (OpenAI, Meta, open-source LLMs) to identify sparsity patterns.
Result: RL fine-tuning updates only a small, consistent subnetwork, recovering full model performance. Sparsity is transferable and not significantly affected by KL penalties or gradient clipping.
Conclusion: RL adapts models by focusing on a small subnetwork, enabling efficient fine-tuning and reframing sparsity through the lottery ticket hypothesis.
Abstract: Reinforcement learning (RL) is a key post-pretraining step for aligning large language models (LLMs) with complex tasks and human preferences. While it is often assumed that RL fine-tuning requires updating most of a model’s parameters, we challenge this assumption with a surprising finding: RL fine-tuning consistently modifies only a small subnetwork (typically 5-30% of weights), leaving most parameters unchanged. We call this phenomenon RL-induced parameter update sparsity. It arises naturally, without any sparsity constraints or parameter-efficient tuning, and appears across multiple RL algorithms (e.g., PPO, DPO, SimPO, PRIME) and model families (e.g., OpenAI, Meta, and open-source LLMs). Moreover, the subnetworks updated by RL show substantial overlap across different seeds, datasets, and algorithms-far exceeding chance-suggesting a partially transferable structure in the pretrained model. We show that fine-tuning only this sparse subnetwork recovers full model performance and yields parameters nearly identical to the fully fine-tuned model. Our analysis suggests this sparsity emerges because RL operates near the model’s original distribution, requiring only targeted changes. KL penalties, gradient clipping, and on-policy dynamics have limited effect on the sparsity pattern. These findings shed new light on how RL adapts models: not by shifting all weights, but by focusing training on a small, consistently updated subnetwork. This insight enables more efficient RL methods and reframes sparsity through the lens of the lottery ticket hypothesis.
[442] Machine Learning Risk Intelligence for Green Hydrogen Investment: Insights for Duqm R3 Auction
Obumneme Nwafor, Mohammed Abdul Majeed Al Hooti
Main category: cs.LG
TL;DR: Oman’s green hydrogen projects face risks due to environmental fluctuations. This paper proposes an AI tool using meteorological data to predict maintenance needs, aiding auction decisions.
Details
Motivation: The lack of historical data for large-scale hydrogen projects in deserts creates a knowledge gap for risk assessment, necessitating a reliable proxy like environmental conditions.Method: An AI decision support system uses meteorological data to create a Maintenance Pressure Index (MPI) for predicting infrastructure risks and maintenance demands.
Result: The MPI tool enables temporal benchmarking and risk assessment, improving regulatory foresight and auction evaluations.
Conclusion: The proposed AI system addresses data gaps by leveraging environmental data, enhancing decision-making for green hydrogen infrastructure in Oman.
Abstract: As green hydrogen emerges as a major component of global decarbonisation, Oman has positioned itself strategically through national auctions and international partnerships. Following two successful green hydrogen project rounds, the country launched its third auction (R3) in the Duqm region. While this area exhibits relative geospatial homogeneity, it is still vulnerable to environmental fluctuations that pose inherent risks to productivity. Despite growing global investment in green hydrogen, operational data remains scarce, with major projects like Saudi Arabia’s NEOM facility not expected to commence production until 2026, and Oman’s ACME Duqm project scheduled for 2028. This absence of historical maintenance and performance data from large-scale hydrogen facilities in desert environments creates a major knowledge gap for accurate risk assessment for infrastructure planning and auction decisions. Given this data void, environmental conditions emerge as accessible and reliable proxy for predicting infrastructure maintenance pressures, because harsh desert conditions such as dust storms, extreme temperatures, and humidity fluctuations are well-documented drivers of equipment degradation in renewable energy systems. To address this challenge, this paper proposes an Artificial Intelligence decision support system that leverages publicly available meteorological data to develop a predictive Maintenance Pressure Index (MPI), which predicts risk levels and future maintenance demands on hydrogen infrastructure. This tool strengthens regulatory foresight and operational decision-making by enabling temporal benchmarking to assess and validate performance claims over time. It can be used to incorporate temporal risk intelligence into auction evaluation criteria despite the absence of historical operational benchmarks.
[443] A Scalable and High Availability Solution for Recommending Resolutions to Problem Tickets
Harish Saragadam, Chetana K Nayak, Joy Bose
Main category: cs.LG
TL;DR: The paper proposes an ML-driven solution using clustering, supervised learning, and NLP to resolve telecom problem tickets, addressing challenges like data drift and missing data.
Details
Motivation: To improve resolution of telecom problem tickets by leveraging historical data despite challenges like data drift and incomplete records.Method: Combines clustering, supervised learning (LDA, Siamese networks, One-shot learning), and NLP (Index embedding) with a real-time dashboard and Kubernetes deployment.
Result: High prediction accuracy demonstrated on both open-source (Bitext) and proprietary telecom datasets.
Conclusion: The proposed solution effectively tackles ticket resolution challenges, offering a robust, scalable approach for service industries.
Abstract: Resolution of incidents or problem tickets is a common theme in service industries in any sector, including billing and charging systems in telecom domain. Machine learning can help to identify patterns and suggest resolutions for the problem tickets, based on patterns in the historical data of the tickets. However, this process may be complicated due to a variety of phenomena such as data drift and issues such as missing data, lack of data pertaining to resolutions of past incidents, too many similar sounding resolutions due to free text and similar sounding text. This paper proposes a robust ML-driven solution employing clustering, supervised learning, and advanced NLP models to tackle these challenges effectively. Building on previous work, we demonstrate clustering-based resolution identification, supervised classification with LDA, Siamese networks, and One-shot learning, Index embedding. Additionally, we present a real-time dashboard and a highly available Kubernetes-based production deployment. Our experiments with both the open-source Bitext customer-support dataset and proprietary telecom datasets demonstrate high prediction accuracy.
[444] Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning
Aditya Sharma, Linh Nguyen, Ananya Gupta, Chengyu Wang, Chiamaka Adebayo, Jakub Kowalski
Main category: cs.LG
TL;DR: CWMI enhances LLMs with causal physics understanding via a Causal Physics Module and Causal Intervention Loss, improving zero-shot physical reasoning.
Details
Motivation: LLMs lack intuitive understanding of physical dynamics, limiting real-world causal reasoning.Method: Introduces CWMI with a Causal Physics Module and Causal Intervention Loss to learn cause-and-effect from multimodal data.
Result: Outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks like PIQA and PhysiCa-Bench.
Conclusion: Inducing a causal world model is key for more reliable and generalizable AI systems.
Abstract: Large Language Models (LLMs), despite their advanced linguistic capabilities, fundamentally lack an intuitive understanding of physical dynamics, which limits their effectiveness in real-world scenarios that require causal reasoning. In this paper, we introduce Causal World Model Induction (CWMI), a novel framework designed to embed an explicit model of causal physics within an LLM. Our approach incorporates a dedicated Causal Physics Module (CPM) and a new training objective called Causal Intervention Loss, encouraging the model to learn cause-and-effect relationships from multimodal data. By training the model to predict the outcomes of hypothetical interventions instead of merely capturing statistical correlations, CWMI develops a robust internal representation of physical laws. Experimental results show that CWMI significantly outperforms state-of-the-art LLMs on zero-shot physical reasoning tasks, including the PIQA benchmark and our newly proposed PhysiCa-Bench dataset. These findings demonstrate that inducing a causal world model is a critical step toward more reliable and generalizable AI systems.
cs.MA
[445] Replicating the behaviour of electric vehicle drivers using an agent-based reinforcement learning model
Zixin Feng, Qunshan Zhao, Alison Heppenstall
Main category: cs.MA
TL;DR: A multi-stage reinforcement learning framework is proposed to simulate EV charging demand across large areas, capturing adaptive behaviors and identifying ‘charging deserts.’
Details
Motivation: Existing EV charging simulations lack adaptability for private drivers and large-scale geographical analysis.Method: A multi-stage reinforcement learning framework is developed and validated against real-world data.
Result: The model identifies critical ‘charging deserts’ and aligns with policy shifts for rapid charging hubs.
Conclusion: The framework effectively models private EV driver behavior and highlights areas needing charging infrastructure.
Abstract: Despite the rapid expansion of electric vehicle (EV) charging networks, questions remain about their efficiency in meeting the growing needs of EV drivers. Previous simulation-based approaches, which rely on static behavioural rules, have struggled to capture the adaptive behaviours of human drivers. Although reinforcement learning has been introduced in EV simulation studies, its application has primarily focused on optimising fleet operations rather than modelling private drivers who make independent charging decisions. Additionally, long-distance travel remains a primary concern for EV drivers. However, existing simulation studies rarely explore charging behaviour over large geographical scales. To address these gaps, we propose a multi-stage reinforcement learning framework that simulates EV charging demand across large geographical areas. We validate the model against real-world data, and identify the training stage that most closely reflects actual driver behaviour, which captures both the adaptive behaviours and bounded rationality of private drivers. Based on the simulation results, we also identify critical ‘charging deserts’ where EV drivers consistently have low state of charge. Our findings also highlight recent policy shifts toward expanding rapid charging hubs along motorway corridors and city boundaries to meet the demand from long-distance trips.
[446] Agent-Based Exploration of Recommendation Systems in Misinformation Propagation
Lise Jakobsen, Anna Johanne Holden, Önder Gürcan, Özlem Özgöbek
Main category: cs.MA
TL;DR: Agent-based modeling reveals popularity-driven algorithms amplify misinformation, while collaborative and content-based filtering limit it.
Details
Motivation: To understand how recommendation algorithms affect misinformation spread on social networks.Method: Simulated a synthetic environment with heterogeneous agents (users, bots, influencers) and tested four recommendation strategies.
Result: Popularity-based algorithms worsen misinformation; collaborative and content-based filtering reduce it.
Conclusion: Algorithm design critically impacts misinformation spread; agent-based modeling offers realistic insights.
Abstract: This study uses agent-based modeling to examine the impact of various recommendation algorithms on the propagation of misinformation on online social networks. We simulate a synthetic environment consisting of heterogeneous agents, including regular users, bots, and influencers, interacting through a social network with recommendation systems. We evaluate four recommendation strategies: popularity-based, collaborative filtering, and content-based filtering, along with a random baseline. Our results show that popularity-driven algorithms significantly amplify misinformation, while item-based collaborative filtering and content-based approaches are more effective in limiting exposure to fake content. Item-based collaborative filtering was found to perform better than previously reported in related literature. These findings highlight the role of algorithm design in shaping online information exposure and show that agent-based modeling can be used to gain realistic insight into how misinformation spreads.
[447] Towards Cognitive Synergy in LLM-Based Multi-Agent Systems: Integrating Theory of Mind and Critical Evaluation
Adam Kostka, Jarosław A. Chudziak
Main category: cs.MA
TL;DR: The paper explores cognitive mechanisms like adaptive theory of mind (ToM) and structured critique to enhance multi-agent systems’ collaborative reasoning, showing improved coherence and adaptability in complex tasks.
Details
Motivation: Current AI systems lack human-like collaborative reasoning abilities, such as recursive reasoning and mental state inference, limiting their collective intelligence.Method: Investigates adaptive ToM and structured critique through empirical case studies on complex decision-making.
Result: Integration of these mechanisms leads to more coherent, adaptive, and rigorous agent interactions, surpassing individual capabilities.
Conclusion: The framework advances MAS by emulating human-like collaboration, emphasizing dynamic ToM and critical evaluation for real-world challenges.
Abstract: Recently, the field of Multi-Agent Systems (MAS) has gained popularity as researchers are trying to develop artificial intelligence capable of efficient collective reasoning. Agents based on Large Language Models (LLMs) perform well in isolated tasks, yet struggle with higher-order cognition required for adaptive collaboration. Human teams achieve synergy not only through knowledge sharing, but also through recursive reasoning, structured critique, and the ability to infer others’ mental states. Current artificial systems lack these essential mechanisms, limiting their ability to engage in sophisticated collective reasoning. This work explores cognitive processes that enable effective collaboration, focusing on adaptive theory of mind (ToM) and systematic critical evaluation. We investigate three key questions. First, how does the ability to model others’ perspectives enhance coordination and reduce redundant reasoning? Second, to what extent does structured critique improve reasoning quality by identifying logical gaps and mitigating biases? Third, the interplay of these mechanisms can lead to emergent cognitive synergy, where the collective intelligence of the system exceeds the sum of its parts. Through an empirical case study on complex decision making, we show that the integration of these cognitive mechanisms leads to more coherent, adaptive, and rigorous agent interactions. This article contributes to the field of cognitive science and AI research by presenting a structured framework that emulates human-like collaborative reasoning MAS. It highlights the significance of dynamic ToM and critical evaluation in advancing multi-agent systems’ ability to tackle complex, real-world challenges.
[448] Validating Generative Agent-Based Models of Social Norm Enforcement: From Replication to Novel Predictions
Logan Cross, Nick Haber, Daniel L. K. Yamins
Main category: cs.MA
TL;DR: A two-stage validation approach for LLM-based generative agent models is proposed, using social dilemmas to validate cognitive architectures and simulate novel conditions, revealing insights into human social behavior.
Details
Motivation: To address the challenge of validating LLM-based generative agent models (GABM) for simulating human social behavior.Method: A systematic two-stage validation using social dilemma paradigms, comparing cognitive architectures (persona-based differences and theory of mind) and testing novel conditions.
Result: Validated architectures replicated third-party punishment and public goods game behaviors, with novel predictions showing anonymous punishment reduces TPP and open discussions boost cooperation.
Conclusion: The framework validates generative agent models and demonstrates their potential to generate novel insights into human social behavior.
Abstract: As large language models (LLMs) advance, there is growing interest in using them to simulate human social behavior through generative agent-based modeling (GABM). However, validating these models remains a key challenge. We present a systematic two-stage validation approach using social dilemma paradigms from psychological literature, first identifying the cognitive components necessary for LLM agents to reproduce known human behaviors in mixed-motive settings from two landmark papers, then using the validated architecture to simulate novel conditions. Our model comparison of different cognitive architectures shows that both persona-based individual differences and theory of mind capabilities are essential for replicating third-party punishment (TPP) as a costly signal of trustworthiness. For the second study on public goods games, this architecture is able to replicate an increase in cooperation from the spread of reputational information through gossip. However, an additional strategic component is necessary to replicate the additional boost in cooperation rates in the condition that allows both ostracism and gossip. We then test novel predictions for each paper with our validated generative agents. We find that TPP rates significantly drop in settings where punishment is anonymous, yet a substantial amount of TPP persists, suggesting that both reputational and intrinsic moral motivations play a role in this behavior. For the second paper, we introduce a novel intervention and see that open discussion periods before rounds of the public goods game further increase contributions, allowing groups to develop social norms for cooperation. This work provides a framework for validating generative agent models while demonstrating their potential to generate novel and testable insights into human social behavior.
cs.MM
[449] Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion
Zeyu Deng, Yanhui Lu, Jiashu Liao, Shuang Wu, Chongfeng Wei
Main category: cs.MM
TL;DR: Sync-TVA is a graph-attention framework for multimodal emotion recognition, improving cross-modal interaction and balancing contributions via dynamic enhancement and structured fusion.
Details
Motivation: Existing MER methods lack effective cross-modal interaction and balanced modality contributions, limiting performance.Method: Proposes Sync-TVA with modality-specific dynamic enhancement, heterogeneous cross-modal graphs, and cross-attention fusion for text, audio, and visual features.
Result: Outperforms state-of-the-art models on MELD and IEMOCAP datasets in accuracy and F1 score, especially in class-imbalanced scenarios.
Conclusion: Sync-TVA effectively addresses cross-modal interaction and imbalance, enhancing MER performance.
Abstract: Multimodal emotion recognition (MER) is crucial for enabling emotionally intelligent systems that perceive and respond to human emotions. However, existing methods suffer from limited cross-modal interaction and imbalanced contributions across modalities. To address these issues, we propose Sync-TVA, an end-to-end graph-attention framework featuring modality-specific dynamic enhancement and structured cross-modal fusion. Our design incorporates a dynamic enhancement module for each modality and constructs heterogeneous cross-modal graphs to model semantic relations across text, audio, and visual features. A cross-attention fusion mechanism further aligns multimodal cues for robust emotion inference. Experiments on MELD and IEMOCAP demonstrate consistent improvements over state-of-the-art models in both accuracy and weighted F1 score, especially under class-imbalanced conditions.
[450] PC-JND: Subjective Study and Dataset on Just Noticeable Difference for Point Clouds in 6DoF Virtual Reality
Chunling Fan, Yun Zhang, Dietmar Saupe, Raouf Hamzaoui, Weisi Lin
Main category: cs.MM
TL;DR: The paper explores Just Noticeable Difference (JND) in point clouds for VR, finding texture JND is smaller than geometry JND and introducing a new dataset (PC-JND) for future research.
Details
Motivation: To understand JND characteristics in point clouds for VR, as this area was unexplored despite its relevance for immersive media.Method: Studied point cloud-wise JND (PCJND) in a 6DoF VR environment using a head-mounted display, analyzing texture and geometry JND.
Result: Texture PCJND is smaller than geometry PCJND; colorfulness correlates with texture JND but not geometry JND or point count.
Conclusion: The study provides insights into VR point cloud JND and introduces PC-JND dataset to aid perceptual optimization in immersive media.
Abstract: The Just Noticeable Difference (JND) accounts for the minimum distortion at which humans can perceive a difference between a pristine stimulus and its distorted version. The JND concept has been widely applied in visual signal processing tasks, including coding, transmission, rendering, and quality assessment, to optimize human-centric media experiences. A point cloud is a mainstream volumetric data representation consisting of both geometry information and attributes (e.g. color). Point clouds are used for advanced immersive 3D media such as Virtual Reality (VR). However, the JND characteristics of viewing point clouds in VR have not been explored before. In this paper, we study the point cloud-wise JND (PCJND) characteristics in a Six Degrees of Freedom (6DoF) VR environment using a head-mounted display. Our findings reveal that the texture PCJND of human eyes is smaller than the geometry PCJND for most point clouds. Furthermore, we identify a correlation between colorfulness and texture PCJND. However, there is no significant correlation between colorfulness and the geometry PCJND, nor between the number of points and neither the texture or geometry PCJND. To support future research in JND prediction and perception-driven signal processing, we introduce PC-JND, a novel point cloud-based JND dataset. This dataset will be made publicly available to facilitate advancements in perceptual optimization for immersive media.
[451] Efficient Sub-pixel Motion Compensation in Learned Video Codecs
Théo Ladune, Thomas Leguay, Pierrick Philippe, Gordon Clare, Félix Henry
Main category: cs.MM
TL;DR: The paper improves learned video codec motion compensation by adopting advanced techniques from conventional codecs, achieving better compression and lower complexity.
Details
Motivation: Learned codecs use simple bilinear filtering for sub-pixel motion compensation, while conventional codecs (HEVC/VVC) employ refined methods. This paper aims to bridge the gap.Method: The proposed method integrates advanced interpolation filters, block-based motion information, and finite motion accuracy into learned codecs.
Result: Experiments show a 10% rate decrease and reduced decoding complexity (from 391 to 214 MAC per pixel) in the Cool-chic codec.
Conclusion: Adopting conventional codec techniques enhances learned codec performance, with open-source contributions available.
Abstract: Motion compensation is a key component of video codecs. Conventional codecs (HEVC and VVC) have carefully refined this coding step, with an important focus on sub-pixel motion compensation. On the other hand, learned codecs achieve sub-pixel motion compensation through simple bilinear filtering. This paper offers to improve learned codec motion compensation by drawing inspiration from conventional codecs. It is shown that the usage of more advanced interpolation filters, block-based motion information and finite motion accuracy lead to better compression performance and lower decoding complexity. Experimental results are provided on the Cool-chic video codec, where we demonstrate a rate decrease of more than 10% and a lowering of motion-related decoding complexity from 391 MAC per pixel to 214 MAC per pixel. All contributions are made open-source at https://github.com/Orange-OpenSource/Cool-Chic
[452] Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling
Xiaojie Li, Ronghui Li, Shukai Fang, Shuzhao Xie, Xiaoyang Guo, Jiaqing Zhou, Junkun Peng, Zhi Wang
Main category: cs.MM
TL;DR: SoulDance dataset and SoulNet framework address challenges in generating music-aligned holistic dance by capturing high-precision data and modeling motion dependencies.
Details
Motivation: The scarcity of holistic 3D dance datasets and difficulty in cross-modal alignment between music and dance motivate the need for a solution.Method: SoulNet uses Hierarchical Residual Vector Quantization, a Music-Aligned Generative Model, and a Music-Motion Retrieval Module to generate coordinated dance sequences.
Result: SoulNet outperforms existing methods in producing high-quality, music-aligned 3D dance sequences.
Conclusion: SoulDance and SoulNet provide a robust solution for generating expressive, music-coordinated holistic dance.
Abstract: Well-coordinated, music-aligned holistic dance enhances emotional expressiveness and audience engagement. However, generating such dances remains challenging due to the scarcity of holistic 3D dance datasets, the difficulty of achieving cross-modal alignment between music and dance, and the complexity of modeling interdependent motion across the body, hands, and face. To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. SoulNet consists of three principal components: (1) Hierarchical Residual Vector Quantization, which models complex, fine-grained motion dependencies across the body, hands, and face; (2) Music-Aligned Generative Model, which composes these hierarchical motion units into expressive and coordinated holistic dance; (3) Music-Motion Retrieval Module, a pre-trained cross-modal model that functions as a music-dance alignment prior, ensuring temporal synchronization and semantic coherence between generated dance and input music throughout the generation process. Extensive experiments demonstrate that SoulNet significantly surpasses existing approaches in generating high-quality, music-coordinated, and well-aligned holistic 3D dance sequences.
[453] EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment
Lancheng Gao, Ziheng Jia, Yunhao Zeng, Wei Sun, Yiming Zhang, Wei Zhou, Guangtao Zhai, Xiongkuo Min
Main category: cs.MM
TL;DR: EEmo-Bench is a new benchmark for evaluating multi-modal large language models (MLLMs) on image-evoked emotions, using diverse tasks and emotional attributes.
Details
Motivation: Current evaluations of MLLMs' emotion understanding are coarse-grained; EEmo-Bench aims to provide a systematic and comprehensive assessment.Method: The benchmark uses Valence-Arousal-Dominance (VAD) attributes, 1,960 annotated images, and four tasks (Perception, Ranking, Description, Assessment) with 6,773 QA pairs.
Result: Some MLLMs perform well overall, but analytical capabilities in certain dimensions are lacking.
Conclusion: EEmo-Bench advances research on MLLMs’ emotion perception and understanding, crucial for applications like human-machine interaction.
Abstract: The furnishing of multi-modal large language models (MLLMs) has led to the emergence of numerous benchmark studies, particularly those evaluating their perception and understanding capabilities. Among these, understanding image-evoked emotions aims to enhance MLLMs’ empathy, with significant applications such as human-machine interaction and advertising recommendations. However, current evaluations of this MLLM capability remain coarse-grained, and a systematic and comprehensive assessment is still lacking. To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. Our core contributions include: 1) Regarding the diversity of the evoked emotions, we adopt an emotion ranking strategy and employ the Valence-Arousal-Dominance (VAD) as emotional attributes for emotional assessment. In line with this methodology, 1,960 images are collected and manually annotated. 2) We design four tasks to evaluate MLLMs’ ability to capture the evoked emotions by single images and their associated attributes: Perception, Ranking, Description, and Assessment. Additionally, image-pairwise analysis is introduced to investigate the model’s proficiency in performing joint and comparative analysis. In total, we collect 6,773 question-answer pairs and perform a thorough assessment on 19 commonly-used MLLMs. The results indicate that while some proprietary and large-scale open-source MLLMs achieve promising overall performance, the analytical capabilities in certain evaluation dimensions remain suboptimal. Our EEmo-Bench paves the path for further research aimed at enhancing the comprehensive perceiving and understanding capabilities of MLLMs concerning image-evoked emotions, which is crucial for machine-centric emotion perception and understanding.
eess.AS
[454] Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations
Teng, Ma, Sile Yin, Li-Chia Yang, Shuo Zhang
Main category: eess.AS
TL;DR: RAVEN is a real-time AVSE system that enhances on-screen speaker audio while suppressing noise and interfering speakers, leveraging AVSR and ASD embeddings for optimal performance.
Details
Motivation: Speech enhancement in audio-only settings is challenging, especially with interfering speakers, prompting the need for an effective AVSE solution.Method: RAVEN uses visual embeddings from AVSR and ASD models, concatenating them for multi-speaker environments and using AVSR alone for noise-only scenarios. It operates in real-time on a CPU.
Result: Concatenated AVSR and ASD embeddings improve performance in low-SNR, multi-speaker settings, while AVSR alone excels in noise-only conditions.
Conclusion: RAVEN is the first open-source real-time AVSE system, demonstrating effectiveness across various noise and speaker conditions.
Abstract: Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system.
eess.IV
[455] Comparative Analysis of Vision Transformers and Convolutional Neural Networks for Medical Image Classification
Kunal Kawadkar
Main category: eess.IV
TL;DR: A comparative study of CNNs and ViTs in medical imaging tasks shows task-specific performance advantages, guiding architecture selection for clinical AI.
Details
Motivation: To explore the effectiveness of Vision Transformers (ViTs) compared to traditional CNNs in medical imaging, addressing a gap in current research.Method: Evaluated four models (ResNet-50, EfficientNet-B0, ViT-Base, DeiT-Small) on three medical tasks using 8,469 images.
Result: ResNet-50 excelled in chest X-ray (98.37%), DeiT-Small in brain tumor (92.16%), and EfficientNet-B0 in skin cancer (81.84%).
Conclusion: Task-specific architecture selection is crucial for medical AI, with no single model universally superior.
Abstract: The emergence of Vision Transformers (ViTs) has revolutionized computer vision, yet their effectiveness compared to traditional Convolutional Neural Networks (CNNs) in medical imaging remains under-explored. This study presents a comprehensive comparative analysis of CNN and ViT architectures across three critical medical imaging tasks: chest X-ray pneumonia detection, brain tumor classification, and skin cancer melanoma detection. We evaluated four state-of-the-art models - ResNet-50, EfficientNet-B0, ViT-Base, and DeiT-Small
- across datasets totaling 8,469 medical images. Our results demonstrate task-specific model advantages: ResNet-50 achieved 98.37% accuracy on chest X-ray classification, DeiT-Small excelled at brain tumor detection with 92.16% accuracy, and EfficientNet-B0 led skin cancer classification at 81.84% accuracy. These findings provide crucial insights for practitioners selecting architectures for medical AI applications, highlighting the importance of task-specific architecture selection in clinical decision support systems.
[456] Querying GI Endoscopy Images: A VQA Approach
Gaurav Parajuli
Main category: eess.IV
TL;DR: The paper explores adapting the Florence2 model for medical VQA tasks, specifically for GI endoscopy images, to improve diagnostic accuracy.
Details
Motivation: Current multimodal LLMs perform poorly in specialized domains like medical imaging, despite their potential for aiding clinicians in diagnosing GI diseases.Method: The study adapts the Florence2 model for medical VQA tasks and evaluates its performance using ROUGE, BLEU, and METEOR metrics.
Result: The paper evaluates the model’s performance but does not specify quantitative results in the abstract.
Conclusion: Adapting general-domain models like Florence2 for specialized medical VQA tasks shows promise for improving diagnostic AI systems.
Abstract: VQA (Visual Question Answering) combines Natural Language Processing (NLP) with image understanding to answer questions about a given image. It has enormous potential for the development of medical diagnostic AI systems. Such a system can help clinicians diagnose gastro-intestinal (GI) diseases accurately and efficiently. Although many of the multimodal LLMs available today have excellent VQA capabilities in the general domain, they perform very poorly for VQA tasks in specialized domains such as medical imaging. This study is a submission for ImageCLEFmed-MEDVQA-GI 2025 subtask 1 that explores the adaptation of the Florence2 model to answer medical visual questions on GI endoscopy images. We also evaluate the model performance using standard metrics like ROUGE, BLEU and METEOR
[457] ST-DAI: Single-shot 2.5D Spatial Transcriptomics with Intra-Sample Domain Adaptive Imputation for Cost-efficient 3D Reconstruction
Jiahe Qian, Yaoyu Fang, Xinkun Wang, Lee A. Cooper, Bo Zhou
Main category: eess.IV
TL;DR: ST-DAI is a cost-efficient 3D spatial transcriptomics framework using 2.5D sampling and intra-sample domain-adaptive imputation to reduce experimental costs while maintaining performance.
Details
Motivation: High costs and domain discrepancies in fully sampling 3D spatial transcriptomics (ST) datasets motivate the need for a cost-efficient and generalizable solution.Method: ST-DAI combines 2.5D sampling (fully sampling a central section and sparsely sampling adjacent sections) with intra-sample domain-adaptive imputation, including alignment, pseudo-supervision, and Fast Multi-Domain Refinement (FMDR).
Result: ST-DAI achieves gene expression prediction performance comparable to fully sampled methods while significantly reducing measurement burden.
Conclusion: ST-DAI provides a practical and efficient solution for 3D ST, balancing cost and accuracy.
Abstract: For 3D spatial transcriptomics (ST), the high per-section acquisition cost of fully sampling every tissue section remains a significant challenge. Although recent approaches predict gene expression from histology images, these methods require large external datasets, which leads to high-cost and suffers from substantial domain discrepancies that lead to poor generalization on new samples. In this work, we introduce ST-DAI, a single-shot framework for 3D ST that couples a cost-efficient 2.5D sampling scheme with an intra-sample domain-adaptive imputation framework. First, in the cost-efficient 2.5D sampling stage, one reference section (central section) is fully sampled while other sections (adjacent sections) is sparsely sampled, thereby capturing volumetric context at significantly reduced experimental cost. Second, we propose a single-shot 3D imputation learning method that allows us to generate fully sampled 3D ST from this cost-efficient 2.5D ST scheme, using only sample-specific training. We observe position misalignment and domain discrepancy between sections. To address those issues, we adopt a pipeline that first aligns the central section to the adjacent section, thereafter generates dense pseudo-supervision on the central section, and then performs Fast Multi-Domain Refinement (FMDR), which adapts the network to the domain of the adjacent section while fine-tuning only a few parameters through the use of Parameter-Efficient Domain-Alignment Layers (PDLs). During this refinement, a Confidence Score Generator (CSG) reweights the pseudo-labels according to their estimated reliability, thereby directing imputation toward trustworthy regions. Our experimental results demonstrate that ST-DAI achieves gene expression prediction performance comparable to fully sampled approaches while substantially reducing the measurement burden.
[458] Control Copy-Paste: Controllable Diffusion-Based Augmentation Method for Remote Sensing Few-Shot Object Detection
Yanxing Liu, Jiancheng Pan, Bingchen Zhang
Main category: eess.IV
TL;DR: The paper proposes Control Copy-Paste, a method using a conditional diffusion model to enhance few-shot object detection (FSOD) in remote sensing images by diversifying contexts and mitigating overfitting.
Details
Motivation: Limited training data in FSOD leads to overfitting due to lack of diversity in objects and contexts. Current methods focus on object diversity but neglect context, which is crucial for detection.Method: Control Copy-Paste injects few-shot objects into diverse contexts using a conditional diffusion model and employs orientation alignment to reduce distortion.
Result: Experiments on the DIOR dataset show a 10.76% average improvement in detection performance.
Conclusion: The method effectively addresses overfitting by diversifying contexts and aligning orientations, enhancing FSOD performance.
Abstract: Few-shot object detection (FSOD) for optical remote sensing images aims to detect rare objects with only a few annotated bounding boxes. The limited training data makes it difficult to represent the data distribution of realistic remote sensing scenes, which results in the notorious overfitting problem. Current researchers have begun to enhance the diversity of few-shot novel instances by leveraging diffusion models to solve the overfitting problem. However, naively increasing the diversity of objects is insufficient, as surrounding contexts also play a crucial role in object detection, and in cases where the object diversity is sufficient, the detector tends to overfit to monotonous contexts. Accordingly, we propose Control Copy-Paste, a controllable diffusion-based method to enhance the performance of FSOD by leveraging diverse contextual information. Specifically, we seamlessly inject a few-shot novel objects into images with diverse contexts by a conditional diffusion model. We also develop an orientation alignment strategy to mitigate the integration distortion caused by varying aspect ratios of instances. Experiments on the public DIOR dataset demonstrate that our method can improve detection performance by an average of 10.76%.
[459] VidFuncta: Towards Generalizable Neural Representations for Ultrasound Videos
Julia Wolleb, Florentin Bieder, Paul Friedrich, Hemant D. Tagare, Xenophon Papademetris
Main category: eess.IV
TL;DR: VidFuncta, a novel framework using implicit neural representations (INRs), encodes ultrasound videos into compact, time-resolved representations, outperforming 2D/3D baselines and enabling efficient downstream tasks.
Details
Motivation: Standard deep learning struggles with ultrasound video analysis due to non-standardized acquisition and operator bias. VidFuncta addresses this by leveraging INRs for better temporal and dataset-level representation.Method: Extends Functa (INR framework) to temporal domain, encoding videos into static video-specific vectors and time-dependent modulation vectors. Validated on cardiac, lung, and breast ultrasound datasets.
Result: Outperforms 2D/3D baselines in reconstruction and enables efficient downstream tasks (e.g., ejection fraction prediction, B-line detection, lesion classification).
Conclusion: VidFuncta is a generalizable, efficient framework for ultrasound video analysis, with potential for broader clinical applications.
Abstract: Ultrasound is widely used in clinical care, yet standard deep learning methods often struggle with full video analysis due to non-standardized acquisition and operator bias. We offer a new perspective on ultrasound video analysis through implicit neural representations (INRs). We build on Functa, an INR framework in which each image is represented by a modulation vector that conditions a shared neural network. However, its extension to the temporal domain of medical videos remains unexplored. To address this gap, we propose VidFuncta, a novel framework that leverages Functa to encode variable-length ultrasound videos into compact, time-resolved representations. VidFuncta disentangles each video into a static video-specific vector and a sequence of time-dependent modulation vectors, capturing both temporal dynamics and dataset-level redundancies. Our method outperforms 2D and 3D baselines on video reconstruction and enables downstream tasks to directly operate on the learned 1D modulation vectors. We validate VidFuncta on three public ultrasound video datasets – cardiac, lung, and breast – and evaluate its downstream performance on ejection fraction prediction, B-line detection, and breast lesion classification. These results highlight the potential of VidFuncta as a generalizable and efficient representation framework for ultrasound videos. Our code is publicly available under https://github.com/JuliaWolleb/VidFuncta_public.
[460] Cyst-X: AI-Powered Pancreatic Cancer Risk Prediction from Multicenter MRI in Centralized and Federated Learning
Hongyi Pan, Gorkem Durak, Elif Keles, Deniz Seyithanoglu, Zheyuan Zhang, Alpay Medetalibeyoglu, Halil Ertugrul Aktas, Andrea Mia Bejar, Ziliang Hong, Yavuz Taktak, Gulbiz Dagoglu Kartal, Mehmet Sukru Erturk, Timurhan Cebeci, Maria Jaramillo Gonzalez, Yury Velichko, Lili Zhao, Emil Agarunov, Federica Proietto Salanitri, Concetto Spampinato, Pallavi Tiwari, Ziyue Xu, Sachin Jambawalikar, Ivo G. Schoots, Marco J. Bruno, Chenchang Huang, Candice Bolan, Tamas Gonda, Frank H. Miller, Rajesh N. Keswani, Michael B. Wallace, Ulas Bagci
Main category: eess.IV
TL;DR: Cyst-X, an AI framework, predicts IPMN malignancy using MRI data, outperforming current guidelines and radiologists, and supports federated learning for privacy-preserving collaboration.
Details
Motivation: Pancreatic cancer is a growing threat, and current methods for assessing IPMNs, its precursors, are inadequate, leading to unnecessary surgeries or missed malignancies.Method: Cyst-X uses multicenter MRI data (T1- and T2-weighted scans from 764 patients) to predict IPMN malignancy, leveraging AI and federated learning for collaborative training without data sharing.
Result: Cyst-X achieves an AUC of 0.82, surpassing Kyoto guidelines (AUC=0.75) and expert radiologists, with AI-derived features aligning with clinical markers.
Conclusion: Cyst-X improves IPMN risk stratification, offers biologically meaningful insights, and promotes privacy-preserving AI development with its released dataset.
Abstract: Pancreatic cancer is projected to become the second-deadliest malignancy in Western countries by 2030, highlighting the urgent need for better early detection. Intraductal papillary mucinous neoplasms (IPMNs), key precursors to pancreatic cancer, are challenging to assess with current guidelines, often leading to unnecessary surgeries or missed malignancies. We present Cyst-X, an AI framework that predicts IPMN malignancy using multicenter MRI data, leveraging MRI’s superior soft tissue contrast over CT. Trained on 723 T1- and 738 T2-weighted scans from 764 patients across seven institutions, our models (AUC=0.82) significantly outperform both Kyoto guidelines (AUC=0.75) and expert radiologists. The AI-derived imaging features align with known clinical markers and offer biologically meaningful insights. We also demonstrate strong performance in a federated learning setting, enabling collaborative training without sharing patient data. To promote privacy-preserving AI development and improve IPMN risk stratification, the Cyst-X dataset is released as the first large-scale, multi-center pancreatic cysts MRI dataset.
[461] Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images
Yutao Hu, Ying Zheng, Shumei Miao, Xiaolei Zhang, Jiahao Xia, Yaolei Qi, Yiyang Zhang, Yuting He, Qian Chen, Jing Ye, Hongyan Qiao, Xiuhua Hu, Lei Xu, Jiayin Zhang, Hui Liu, Minwen Zheng, Yining Wang, Daimin Zhang, Ji Zhang, Wenqi Shao, Yun Liu, Longjiang Zhang, Guanyu Yang
Main category: eess.IV
TL;DR: Cardiac-CLIP is a multi-modal foundation model for 3D cardiac CT images, using a two-stage pre-training strategy (3D MAE and contrastive learning) to achieve state-of-the-art performance in cardiovascular diagnostics.
Details
Motivation: Despite the success of foundation models in medicine, their application to complex cardiovascular diagnostics is underexplored. This paper aims to bridge this gap.Method: A two-stage pre-training strategy: 1) 3D masked autoencoder for self-supervised learning, 2) contrastive learning to align visual and textual representations. Uses large-scale clinical and public data.
Result: Cardiac-CLIP achieves state-of-the-art performance in tasks like abnormality classification, information retrieval, and clinical analysis, including challenging scenarios like acute coronary syndrome prediction.
Conclusion: Cardiac-CLIP demonstrates strong potential for complex cardiovascular diagnostics, with robust performance across diverse tasks and datasets.
Abstract: Foundation models have demonstrated remarkable potential in medical domain. However, their application to complex cardiovascular diagnostics remains underexplored. In this paper, we present Cardiac-CLIP, a multi-modal foundation model designed for 3D cardiac CT images. Cardiac-CLIP is developed through a two-stage pre-training strategy. The first stage employs a 3D masked autoencoder (MAE) to perform self-supervised representation learning from large-scale unlabeled volumetric data, enabling the visual encoder to capture rich anatomical and contextual features. In the second stage, contrastive learning is introduced to align visual and textual representations, facilitating cross-modal understanding. To support the pre-training, we collect 16641 real clinical CT scans, supplemented by 114k publicly available data. Meanwhile, we standardize free-text radiology reports into unified templates and construct the pathology vectors according to diagnostic attributes, based on which the soft-label matrix is generated to supervise the contrastive learning process. On the other hand, to comprehensively evaluate the effectiveness of Cardiac-CLIP, we collect 6,722 real-clinical data from 12 independent institutions, along with the open-source data to construct the evaluation dataset. Specifically, Cardiac-CLIP is comprehensively evaluated across multiple tasks, including cardiovascular abnormality classification, information retrieval and clinical analysis. Experimental results demonstrate that Cardiac-CLIP achieves state-of-the-art performance across various downstream tasks in both internal and external data. Particularly, Cardiac-CLIP exhibits great effectiveness in supporting complex clinical tasks such as the prospective prediction of acute coronary syndrome, which is notoriously difficult in real-world scenarios.
[462] ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports
Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P. Mistry, Kent Ryan Kleinschmidt, Brady Chrisler, Sathvik Suryadevara, Sri Sai Dinesh Jaliparthi, Noah Michael Prudlo, Mark David Marino, Jeremy Palacio, Rithvik Akula, Hong-Yu Zhou, Ibrahim Ethem Hamamci, Scott J. Adams, Hassan Rayhan AlOmaish, Pranav Rajpurkar
Main category: eess.IV
TL;DR: ReXGroundingCT is the first dataset linking free-text radiology findings to 3D CT scan segmentations, manually annotated, addressing gaps in medical AI for grounded report generation.
Details
Motivation: To bridge the gap between descriptive clinical language and precise 3D anatomical locations in medical imaging, essential for AI applications like radiology report generation.Method: A three-stage pipeline using GPT-4 to extract findings from CT-RATE reports, followed by manual segmentation by experts and quality control by radiologists.
Result: 8,028 findings across 16,301 entities annotated, with 79% focal and 21% non-focal abnormalities, providing a benchmark for medical segmentation models.
Conclusion: ReXGroundingCT sets a new standard for grounded medical AI in chest CT, enabling advanced research in free-text segmentation and report generation.
Abstract: We present ReXGroundingCT, the first publicly available dataset to link free-text radiology findings with pixel-level segmentations in 3D chest CT scans that is manually annotated. While prior datasets have relied on structured labels or predefined categories, ReXGroundingCT captures the full expressiveness of clinical language represented in free text and grounds it to spatially localized 3D segmentation annotations in volumetric imaging. This addresses a critical gap in medical AI: the ability to connect complex, descriptive text, such as “3 mm nodule in the left lower lobe”, to its precise anatomical location in three-dimensional space, a capability essential for grounded radiology report generation systems. The dataset comprises 3,142 non-contrast chest CT scans paired with standardized radiology reports from the CT-RATE dataset. Using a systematic three-stage pipeline, GPT-4 was used to extract positive lung and pleural findings, which were then manually segmented by expert annotators. A total of 8,028 findings across 16,301 entities were annotated, with quality control performed by board-certified radiologists. Approximately 79% of findings are focal abnormalities, while 21% are non-focal. The training set includes up to three representative segmentations per finding, while the validation and test sets contain exhaustive labels for each finding entity. ReXGroundingCT establishes a new benchmark for developing and evaluating sentence-level grounding and free-text medical segmentation models in chest CT. The dataset can be accessed at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.
[463] G$^{2}$SF-MIAD: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection
Chengyu Tao, Xuanming Cao, Juan Du
Main category: eess.IV
TL;DR: The paper proposes a Geometry-Guided Score Fusion (G²SF) framework for multimodal anomaly detection in industrial quality inspection, improving discriminative power by integrating 3D point clouds and 2D RGB images through anisotropic local distance metrics.
Details
Motivation: Existing methods struggle with integrating unimodal results and enhancing discriminative power in anomaly detection, limiting their effectiveness in industrial quality inspection.Method: The G²SF framework uses a Local Scale Prediction Network (LSPN) to predict direction-aware scaling factors, dynamically evolving from Euclidean metrics to anisotropic local distance metrics for score fusion. Specialized loss functions and score aggregation strategies are also developed.
Result: The method achieves state-of-the-art performance on MVTec-3D AD and Eyecandies datasets, with ablation studies validating each component’s contribution.
Conclusion: The G²SF framework effectively addresses limitations in multimodal anomaly detection, offering improved performance and generalization for industrial applications.
Abstract: Industrial quality inspection plays a critical role in modern manufacturing by identifying defective products during production. While single-modality approaches using either 3D point clouds or 2D RGB images suffer from information incompleteness, multimodal anomaly detection offers promise through the complementary fusion of crossmodal data. However, existing methods face challenges in effectively integrating unimodal results and improving discriminative power. To address these limitations, we first reinterpret memory bank-based anomaly scores in single modalities as isotropic Euclidean distances in local feature spaces. Dynamically evolving from Euclidean metrics, we propose a novel \underline{G}eometry-\underline{G}uided \underline{S}core \underline{F}usion (G$^{2}$SF) framework that progressively learns an anisotropic local distance metric as a unified score for the fusion task. Through a geometric encoding operator, a novel Local Scale Prediction Network (LSPN) is proposed to predict direction-aware scaling factors that characterize first-order local feature distributions, thereby enhancing discrimination between normal and anomalous patterns. Additionally, we develop specialized loss functions and score aggregation strategy from geometric priors to ensure both metric generalization and efficacy. Comprehensive evaluations on the MVTec-3D AD and Eyecandies datasets demonstrate the state-of-the-art detection performance of our method, and detailed ablation analysis validates each component’s contribution. Our code is available at https://github.com/ctaoaa/G2SF.
[464] SurgiSR4K: A High-Resolution Endoscopic Video Dataset for Robotic-Assisted Minimally Invasive Procedures
Fengyi Jiang, Xiaorui Zhang, Lingbo Jin, Ruixing Liang, Yuxin Chen, Adi Chola Venkatesh, Jason Culman, Tiantian Wu, Lirong Shao, Wenqing Sun, Cong Gao, Hallie McNamara, Jingpei Lu, Omid Mohareri
Main category: eess.IV
TL;DR: SurgiSR4K is the first publicly available 4K surgical imaging dataset for robotic-assisted MIS, addressing the lack of high-resolution data for computer vision tasks in surgery.
Details
Motivation: The need for high-resolution imaging in minimally invasive surgery (MIS) to improve visual clarity and computer-assisted guidance, coupled with the absence of native 4K datasets for robotic-assisted MIS, motivated the creation of SurgiSR4K.Method: The dataset, SurgiSR4K, includes diverse surgical scenarios (e.g., specular reflections, tool occlusions) captured at native 4K resolution, reflecting real-world challenges in laparoscopic and robotic surgeries.
Result: SurgiSR4K enables various computer vision tasks like super resolution, smoke removal, and instrument detection, providing a foundation for high-resolution surgical imaging research.
Conclusion: SurgiSR4K advances research in high-resolution surgical imaging and supports the development of intelligent technologies to improve performance and safety in robotic surgeries.
Abstract: High-resolution imaging is crucial for enhancing visual clarity and enabling precise computer-assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic-assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging and video dataset captured at a native 4K resolution, representing realistic conditions of robotic-assisted procedures. SurgiSR4K comprises diverse visual scenarios including specular reflections, tool occlusions, bleeding, and soft tissue deformations, meticulously designed to reflect common challenges faced during laparoscopic and robotic surgeries. This dataset opens up possibilities for a broad range of computer vision tasks that might benefit from high resolution data, such as super resolution (SR), smoke removal, surgical instrument detection, 3D tissue reconstruction, monocular depth estimation, instance segmentation, novel view synthesis, and vision-language model (VLM) development. SurgiSR4K provides a robust foundation for advancing research in high-resolution surgical imaging and fosters the development of intelligent imaging technologies aimed at enhancing performance, safety, and usability in image-guided robotic surgeries.
[465] Efficacy of Image Similarity as a Metric for Augmenting Small Dataset Retinal Image Segmentation
Thomas Wallace, Ik Siong Heng, Senad Subasic, Chris Messenger
Main category: eess.IV
TL;DR: The study evaluates how synthetic images, measured by FID, improve DME segmentation with limited training data, finding lower FID (more similar datasets) enhances performance, with synthetic data outperforming standard augmentation.
Details
Motivation: To assess the effectiveness of synthetic images in augmenting medical imaging datasets for improving machine learning model performance, specifically in DME segmentation.Method: Used PGGAN to generate synthetic images and measured their impact on U-Net segmentation performance via FID, comparing synthetic and standard augmentation.
Result: Lower FID (more similar datasets) significantly improves segmentation, with synthetic data showing better performance than standard augmentation.
Conclusion: Synthetic images with lower FID enhance segmentation performance, but effectiveness depends on dataset similarity, with synthetic augmentation proving superior.
Abstract: Synthetic images are an option for augmenting limited medical imaging datasets to improve the performance of various machine learning models. A common metric for evaluating synthetic image quality is the Fr'echet Inception Distance (FID) which measures the similarity of two image datasets. In this study we evaluate the relationship between this metric and the improvement which synthetic images, generated by a Progressively Growing Generative Adversarial Network (PGGAN), grant when augmenting Diabetes-related Macular Edema (DME) intraretinal fluid segmentation performed by a U-Net model with limited amounts of training data. We find that the behaviour of augmenting with standard and synthetic images agrees with previously conducted experiments. Additionally, we show that dissimilar (high FID) datasets do not improve segmentation significantly. As FID between the training and augmenting datasets decreases, the augmentation datasets are shown to contribute to significant and robust improvements in image segmentation. Finally, we find that there is significant evidence to suggest that synthetic and standard augmentations follow separate log-normal trends between FID and improvements in model performance, with synthetic data proving more effective than standard augmentation techniques. Our findings show that more similar datasets (lower FID) will be more effective at improving U-Net performance, however, the results also suggest that this improvement may only occur when images are sufficiently dissimilar.