Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 89]
cs.CV [Total: 116]
cs.AI [Total: 54]
cs.SD [Total: 8]
cs.LG [Total: 154]
cs.MA [Total: 6]
cs.MM [Total: 1]
eess.AS [Total: 5]
eess.IV [Total: 18]

cs.CL

[1] Teach Me Sign: Stepwise Prompting LLM for Sign Language Production

Zhaoyi An, Rei Kawakami

Main category: cs.CL

TL;DR: TEAM-Sign fine-tunes an LLM to bridge text and sign language, using stepwise prompting to align their distributions and rules, showing effectiveness on How2Sign and Phoenix14T datasets.

Details

Motivation: Sign language generation is complex and under-explored in LLMs. TEAM-Sign aims to leverage LLMs' reasoning and knowledge for this task by treating sign language as a natural language.

Method: Fine-tune an LLM to learn text-sign language correspondence, using stepwise prompting to extract sign language knowledge and align distributions and grammatical rules.

Result: TEAM-Sign effectively aligns sign and spoken language distributions and rules, as shown on How2Sign and Phoenix14T datasets.

Conclusion: TEAM-Sign successfully leverages LLMs for sign language generation, addressing complexity and unique rules through stepwise prompting and fine-tuning.

Abstract: Large language models, with their strong reasoning ability and rich knowledge, have brought revolution to many tasks of AI, but their impact on sign language generation remains limited due to its complexity and unique rules. In this paper, we propose TEAch Me Sign (TEAM-Sign), treating sign language as another natural language. By fine-tuning an LLM, we enable it to learn the correspondence between text and sign language, and facilitate generation. Considering the differences between sign and spoken language, we employ a stepwise prompting strategy to extract the inherent sign language knowledge within the LLM, thereby supporting the learning and generation process. Experimental results on How2Sign and Phoenix14T datasets demonstrate that our approach effectively leverages both the sign language knowledge and reasoning capabilities of LLM to align the different distribution and grammatical rules between sign and spoken language.

[2] Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions

Logé Cécile, Ghori Rehan

Main category: cs.CL

TL;DR: The paper introduces an AI-powered system, Truth Sleuth and Trend Bender, to combat misinformation on YouTube by fact-checking videos and engaging users in comments.

Details

Motivation: Misinformation spreads rapidly on platforms like YouTube, necessitating innovative solutions to fact-check and counter misleading narratives.

Method: The system uses Retrieval-Augmented Generation (RAG) for fact-checking (Truth Sleuth) and generates persuasive comments (Trend Bender) to engage users.

Result: Experiments show high accuracy in fact-checking and potential to influence user perspectives.

Conclusion: AI-driven interventions can effectively combat misinformation and foster informed online discussions.

Abstract: Misinformation poses a significant threat in today’s digital world, often spreading rapidly through platforms like YouTube. This paper introduces a novel approach to combating misinformation by developing an AI-powered system that not only fact-checks claims made in YouTube videos but also actively engages users in the comment section and challenge misleading narratives. Our system comprises two main agents: Truth Sleuth and Trend Bender. Truth Sleuth extracts claims from a YouTube video, uses a Retrieval-Augmented Generation (RAG) approach - drawing on sources like Wikipedia, Google Search, Google FactCheck - to accurately assess their veracity and generates a nuanced and comprehensive report. Through rigorous prompt engineering, Trend Bender leverages this report along with a curated corpus of relevant articles to generate insightful and persuasive comments designed to stimulate a productive debate. With a carefully set up self-evaluation loop, this agent is able to iteratively improve its style and refine its output. We demonstrate the system’s capabilities through experiments on established benchmark datasets and a real-world deployment on YouTube, showcasing its potential to engage users and potentially influence perspectives. Our findings highlight the high accuracy of our fact-checking agent, and confirm the potential of AI-driven interventions in combating misinformation and fostering a more informed online space.

[3] An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation

Vimaleswar A, Prabhu Nandan Sahu, Nilesh Kumar Sahu, Haroon R Lone

Main category: cs.CL

TL;DR: EmoSApp is an offline, smartphone-based conversational app for mental health support, using fine-tuned LLMs for on-device inference.

Details

Motivation: Address challenges like limited accessibility, connectivity, and data privacy in digital mental health platforms.

Method: Fine-tuned LLaMA-3.2-1B-Instruct on a custom mental-health QA dataset, deployed offline using Torchtune and Executorch.

Result: Qualitative and quantitative evaluations show coherent, empathetic responses and efficacy in low-resource settings.

Conclusion: EmoSApp is a blueprint for portable, secure, and tailored AI-driven mental health solutions.

Abstract: Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have been increasingly used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solution. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed for mental health and emotional support. The system leverages Large Language Models (LLMs), specifically fine-tuned, quantized and deployed using Torchtune and Executorch for resource-constrained devices, allowing all inferences to occur on the smartphone. To equip EmoSApp with robust domain expertise, we fine-tuned the LLaMA-3.2-1B-Instruct model on our custom curated ``Knowledge dataset’’ of 14,582 mental-health QA pairs, along with the multi-turn conversational data. Through qualitative human evaluation with the student population, we demonstrate that EmoSApp has the ability to respond coherently, empathetically, maintain interactive dialogue, and provide relevant suggestions to user’s mental health problems. Additionally, quantitative evaluations on nine standard commonsense and reasoning benchmarks demonstrate the efficacy of our fine-tuned, quantized model in low-resource settings. By prioritizing on-device deployment and specialized domain adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health solutions.

[4] Transforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis

Anders Ledberg, Anna Thalén

Main category: cs.CL

TL;DR: A modular toolchain using open-weight LLMs processes sensitive, unstructured text for embedding-based analysis, ensuring privacy and standardization.

Details

Motivation: To enable large-scale research on sensitive, heterogeneous text data (e.g., legal, medical) while addressing privacy and structural challenges.

Method: Uses LLM prompting for standardization, summarization, translation, and anonymization via redaction, NER, and rule-based methods. Validated on Swedish court decisions.

Result: Effective anonymization and semantic retention demonstrated. Predictive models trained on embeddings show scalability for semi-automated analysis.

Conclusion: The toolchain facilitates privacy-conscious, large-scale text analysis, expanding research opportunities in sensitive domains.

Abstract: Unstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research. The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure. We demonstrate the toolchain on a corpus of 10,842 Swedish court decisions under the Care of Abusers Act (LVM), comprising over 56,000 pages. Each document is processed into an anonymized, standardized summary and transformed into a document-level embedding. Validation, including manual review, automated scanning, and predictive evaluation shows the toolchain effectively removes identifying information while retaining semantic content. As an illustrative application, we train a predictive model using embedding vectors derived from a small set of manually labeled summaries, demonstrating the toolchain’s capacity for semi-automated content analysis at scale. By enabling structured, privacy-conscious analysis of sensitive documents, our toolchain opens new possibilities for large-scale research in domains where textual data was previously inaccessible due to privacy and heterogeneity constraints.

[5] A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations

Isar Nejadgholi, Mona Omidyeganeh, Marc-Antoine Drouin, Jonathan Boisvert

Main category: cs.CL

TL;DR: The paper proposes an updated taxonomy for Explainable AI (XAI) focused on Natural Language Explanations (NLEs) to improve AI governance and transparency.

Details

Motivation: With the rise of large language models, there's a need for structured approaches to verify AI behavior, making NLEs crucial for transparency and governance.

Method: The authors draw on XAI literature to create a taxonomy for NLEs across three dimensions: Context, Generation and Presentation, and Evaluation.

Result: The taxonomy provides a framework for stakeholders to characterize, design, and enhance NLEs for transparent AI systems.

Conclusion: The updated XAI taxonomy aids researchers, auditors, and policymakers in improving AI governance through better NLEs.

Abstract: Effective AI governance requires structured approaches for stakeholders to access and verify AI system behavior. With the rise of large language models, Natural Language Explanations (NLEs) are now key to articulating model behavior, which necessitates a focused examination of their characteristics and governance implications. We draw on Explainable AI (XAI) literature to create an updated XAI taxonomy, adapted to prompt-based NLEs, across three dimensions: (1) Context, including task, data, audience, and goals; (2) Generation and Presentation, covering generation methods, inputs, interactivity, outputs, and forms; and (3) Evaluation, focusing on content, presentation, and user-centered properties, as well as the setting of the evaluation. This taxonomy provides a framework for researchers, auditors, and policymakers to characterize, design, and enhance NLEs for transparent AI systems.

[6] AutoRAG-LoRA: Hallucination-Triggered Knowledge Retuning via Lightweight Adapters

Kaushik Dwivedi, Padmanabh Patanjali Mishra

Main category: cs.CL

TL;DR: AutoRAG-LoRA reduces hallucinations in LLMs using LoRA adapters, KL-regularized training, and a feedback loop for factual alignment.

Details

Motivation: LLMs often produce factual inaccuracies (hallucinations), which undermines trust in real-world applications.

Method: Combines automated prompt rewriting, hybrid retrieval, LoRA-based adapters, and a hallucination detection module with feedback correction.

Result: Significantly reduces factual drift while maintaining model efficiency and modularity.

Conclusion: AutoRAG-LoRA effectively mitigates hallucinations in LLMs without compromising performance.

Abstract: Large Language Models (LLMs) have demonstrated remarkable fluency across a range of natural language tasks, yet remain vulnerable to hallucinations - factual inaccuracies that undermine trust in real world deployment. We present AutoRAG-LoRA, a modular framework for Retrieval-Augmented Generation (RAG) that tackles hallucination in large language models through lightweight LoRA-based adapters and KL-regularized training. Our pipeline integrates automated prompt rewriting, hybrid retrieval, and low-rank adapter tuning to ground responses in retrieved evidence. A hallucination detection module, using both classifier-based and self-evaluation techniques, assigns confidence scores to generated outputs, triggering an optional feedback correction loop. This loop enforces factual alignment via contrastive KL loss and adapter fine tuning. We demonstrate that AutoRAG-LoRA significantly reduces the factual drift while preserving the efficiency and modularity of the model.

[7] Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing

Dennis Ulmer, Alexandra Lorson, Ivan Titov, Christian Hardmeier

Main category: cs.CL

TL;DR: The paper discusses the need for LLMs to communicate uncertainty to users to enhance trustworthiness, proposing anthropomimetic uncertainty as a solution by emulating human communication.

Details

Motivation: The motivation is to improve trust in LLMs by addressing their overconfident outputs, which can mislead users, and to explore better ways to signal uncertainty.

Method: The method involves analyzing human uncertainty communication, surveying existing research, and identifying biases in verbalized uncertainty.

Result: The results highlight overlooked biases in machine uncertainty communication and the need for linguistic authenticity.

Conclusion: The conclusion advocates for anthropomimetic uncertainty in NLP, suggesting future research directions to align machine communication with human-like uncertainty expression.

Abstract: Human users increasingly rely on natural language interactions with large language models (LLMs) in order to receive help on a large variety of tasks and problems. However, the trustworthiness and perceived legitimacy of LLMs is undermined by the fact that their output is frequently stated in very confident terms, even when its accuracy is questionable. Therefore, there is a need to signal the confidence of the language model to a user in order to reap the benefits of human-machine collaboration and mitigate potential harms. Verbalized uncertainty is the expression of confidence with linguistic means, an approach that integrates perfectly into language-based interfaces. Nevertheless, most recent research in natural language processing (NLP) overlooks the nuances surrounding human uncertainty communication and the data biases that influence machine uncertainty communication. We argue for anthropomimetic uncertainty, meaning that intuitive and trustworthy uncertainty communication requires a degree of linguistic authenticity and personalization to the user, which could be achieved by emulating human communication. We present a thorough overview over the research in human uncertainty communication, survey ongoing research, and perform additional analyses to demonstrate so-far overlooked biases in verbalized uncertainty. We conclude by pointing out unique factors in human-machine communication of uncertainty and deconstruct anthropomimetic uncertainty into future research directions for NLP.

[8] PLEX: Perturbation-free Local Explanations for LLM-Based Text Classification

Yogachandran Rahulamathavan, Misbah Farooq, Varuna De Silva

Main category: cs.CL

TL;DR: PLEX is a perturbation-free method for explaining LLM predictions, using contextual embeddings and a Siamese network, achieving high agreement with LIME/SHAP while being much faster.

Details

Motivation: LLMs lack interpretability, and existing XAI methods like LIME/SHAP are computationally expensive due to perturbations.

Method: PLEX leverages LLM embeddings and a Siamese network to align with feature importance scores, eliminating the need for perturbations.

Result: PLEX shows >92% agreement with LIME/SHAP, accurately identifies influential words, and is 2-4 orders of magnitude faster.

Conclusion: PLEX provides efficient, accurate explanations for LLM-based text classification, addressing computational and interpretability challenges.

Abstract: Large Language Models (LLMs) excel in text classification, but their complexity hinders interpretability, making it difficult to understand the reasoning behind their predictions. Explainable AI (XAI) methods like LIME and SHAP offer local explanations by identifying influential words, but they rely on computationally expensive perturbations. These methods typically generate thousands of perturbed sentences and perform inferences on each, incurring a substantial computational burden, especially with LLMs. To address this, we propose \underline{P}erturbation-free \underline{L}ocal \underline{Ex}planation (PLEX), a novel method that leverages the contextual embeddings extracted from the LLM and a Siamese network" style neural network trained to align with feature importance scores. This one-off training eliminates the need for subsequent perturbations, enabling efficient explanations for any new sentence. We demonstrate PLEX's effectiveness on four different classification tasks (sentiment, fake news, fake COVID-19 news and depression), showing more than 92\% agreement with LIME and SHAP. Our evaluation using a stress test" reveals that PLEX accurately identifies influential words, leading to a similar decline in classification accuracy as observed with LIME and SHAP when these words are removed. Notably, in some cases, PLEX demonstrates superior performance in capturing the impact of key features. PLEX dramatically accelerates explanation, reducing time and computational overhead by two and four orders of magnitude, respectively. This work offers a promising solution for explainable LLM-based text classification.

[9] Emergence of Hierarchical Emotion Organization in Large Language Models

Bo Zhao, Maya Okawa, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

Main category: cs.CL

TL;DR: LLMs form hierarchical emotion trees aligning with human psychology, with larger models showing more complexity. Biases in emotion recognition exist, especially for underrepresented groups. Human studies suggest LLMs internalize social perceptions.

Details

Motivation: Understanding how LLMs model emotional states is crucial for ethical deployment, inspired by psychological emotion wheels.

Method: Analyzed probabilistic dependencies between emotional states in LLM outputs and conducted human studies.

Result: LLMs naturally form hierarchical emotion trees, with larger models developing more complex hierarchies. Biases in emotion recognition were found, particularly for underrepresented groups.

Conclusion: LLMs exhibit emergent emotional reasoning and internalize social perceptions, suggesting cognitively-grounded theories could improve model evaluations.

Abstract: As large language models (LLMs) increasingly power conversational agents, understanding how they model users’ emotional states is critical for ethical deployment. Inspired by emotion wheels – a psychological framework that argues emotions organize hierarchically – we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

[10] Language Models for Adult Service Website Text Analysis

Nickolas Freeman, Thanh Nguyen, Gregory Bott, Jason Parton, Collin Francel

Main category: cs.CL

TL;DR: The paper explores language modeling for analyzing Adult Service Website (ASW) ad text to combat sex trafficking, showing custom transformer models outperform pre-trained ones in accuracy and efficiency.

Details

Motivation: ASW data is crucial for identifying sex trafficking victims, but text analysis is challenging due to emojis, poor grammar, and obfuscation.

Method: The study evaluates various language modeling approaches, including custom transformers, pre-trained models (BERT, RoBERTa, ModernBERT), and information retrieval methods.

Result: Custom transformers trained with small GPU resources outperform pre-trained models in accuracy, recall, F1 score, and ROC AUC. They are applied to graph decomposition, ad clustering, and emoji analysis.

Conclusion: Custom models advance ASW text analysis, enabling better downstream applications for combating sex trafficking.

Abstract: Sex trafficking refers to the use of force, fraud, or coercion to compel an individual to perform in commercial sex acts against their will. Adult service websites (ASWs) have and continue to be linked to sex trafficking, offering a platform for traffickers to advertise their victims. Thus, organizations involved in the fight against sex trafficking often use ASW data when attempting to identify potential sex trafficking victims. A critical challenge in transforming ASW data into actionable insight is text analysis. Previous research using ASW data has shown that ASW ad text is important for linking ads. However, working with this text is challenging due to its extensive use of emojis, poor grammar, and deliberate obfuscation to evade law enforcement scrutiny. We conduct a comprehensive study of language modeling approaches for this application area, including simple information retrieval methods, pre-trained transformers, and custom transformer models. We demonstrate that characteristics of ASW text data allow efficient custom transformer models to be trained with relatively small GPU resources and used efficiently for inference on consumer hardware. Our custom models outperform fine-tuned variants of well-known encoder-only transformer models, including BERT-base, RoBERTa, and ModernBERT, on accuracy, recall, F1 score, and ROC AUC. We demonstrate the use of our best-performing custom configuration on three tasks related to ASW data analysis: (i) decomposing the giant component in a graph representation of ASW data, (ii) clustering ASW ad text, and (iii) using the learned token embeddings to understand the use of emojis in the illicit context we study. The models we develop represent a significant advancement in ASW text analysis, which can be leveraged in a variety of downstream applications and research.

[11] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

Michal Podstawski

Main category: cs.CL

TL;DR: The paper explores using pretrained text embedding models to enhance semantic analysis in labeled property graphs, improving tasks like node classification and relation prediction.

Details

Motivation: To leverage rich textual attributes in property graphs for better analytical tasks.

Method: Integrates pretrained language model embeddings into the graph pipeline without changing its structure.

Result: Demonstrates that textual semantics improve accuracy and interpretability of graph analysis.

Conclusion: Textual semantics significantly enhance property graph analysis when integrated with pretrained embeddings.

Abstract: Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.

[12] Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Yilun Zhao, Chengye Wang, Chuhan Li, Arman Cohan

Main category: cs.CL

TL;DR: MISS-QA is a new benchmark for evaluating models’ ability to interpret schematic diagrams in scientific papers, revealing a performance gap between models and humans.

Details

Motivation: To assess and improve models' comprehension of multimodal scientific literature by focusing on schematic diagrams.

Method: Created a benchmark with 1,500 expert-annotated examples from 465 papers, testing 18 multimodal models on diagram interpretation and question-answering.

Result: Significant performance gap between models and human experts, with detailed error analysis highlighting model limitations.

Conclusion: The benchmark provides insights to enhance models’ understanding of multimodal scientific content.

Abstract: This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

David M. Markowitz, Samuel Hardman Taylor

Main category: cs.CL

TL;DR: The study examines whether social approval (upvotes) on hate speech posts leads to more or more extreme hate speech, finding no consistent positive relationship.

Details

Motivation: To test Walther's (2024) social approval theory of online hate, specifically whether social approval reinforces hate speech.

Method: Analyzed over 110 million posts from Parler (2018-2021), measuring the relationship between upvotes on hate speech and subsequent hate speech production.

Result: No consistent positive link found; social approval (upvotes) did not predict more or more extreme hate speech. Mixed or negative relationships observed at different time intervals.

Conclusion: Social approval mechanisms for online hate may differ on niche platforms like Parler, challenging assumptions about reinforcement.

Abstract: In this paper, we explored how online hate is motivated by receiving social approval from others. We specifically examined two central tenets of Walther’s (2024) social approval theory of online hate: (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech messages become more extreme. Using over 110 million posts from Parler (2018-2021), we observed that the number of upvotes a person received on a hate speech post was unassociated with the amount of hate speech in their next post and posts during the next week, month, three months, and six months. Between-person effects revealed an average negative relationship between social approval and hate speech production at the post level, but this relationship was mixed at other time intervals. Social approval reinforcement mechanisms of online hate may operate differently on niche social media platforms.

[14] LLMs on Trial: Evaluating Judicial Fairness for Large Language Models

Yiran Hu, Zongyue Xue, Haitao Li, Siyuan Zheng, Qingjing Chen, Shaochun Wang, Xihan Zhang, Ning Zheng, Yun Liu, Qingyao Ai, Yiqun Liu, Charles L. A. Clarke, Weixing Shen

Main category: cs.CL

TL;DR: The paper evaluates the judicial fairness of Large Language Models (LLMs) using a framework with 65 labels and 161 values, revealing pervasive inconsistency, bias, and inaccuracy, especially on demographic labels.

Details

Motivation: To assess LLMs' fairness in judicial contexts, given their growing impact on rights and equity, and to address underexplored implications for social justice.

Method: Developed a framework with 65 labels and 161 values, compiled the JudiFair dataset (177,100 cases), and introduced three metrics (inconsistency, bias, imbalanced inaccuracy) to evaluate 16 LLMs.

Result: Found severe unfairness: inconsistency, bias, and inaccuracy, with demographic labels showing pronounced biases. Temperature adjustment affects fairness, but model size, release date, and origin do not.

Conclusion: LLMs exhibit significant judicial unfairness, necessitating tools like the provided toolkit for future fairness evaluation and improvement.

Abstract: Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs’ judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.

[15] How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations

Ikumi Numaya, Shoji Moriya, Shiki Sato, Reina Akama, Jun Suzuki

Main category: cs.CL

TL;DR: The paper explores the impact of stylistic similarity in dialogue systems, distinguishing between subjective (user-perceived) and objective (third-party annotated) similarity, and finds a strong correlation between subjective similarity and user preference.

Details

Motivation: To address the overlooked distinction between subjective and objective stylistic similarity in dialogue systems and its impact on user preferences.

Method: Introduces a novel dataset with user preferences, subjective stylistic similarity (user-perceived), and objective stylistic similarity (third-party annotated) in open-domain dialogues.

Result: Reveals a strong positive correlation between subjective stylistic similarity and user preference, and highlights the divergence between subjective and objective similarity.

Conclusion: Emphasizes the need to distinguish between subjective and objective evaluations in analyzing stylistic similarity’s impact on user preferences.

Abstract: Recent advancements in dialogue generation have broadened the scope of human-bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users’ preferences, subjective stylistic similarity based on users' own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users’ subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.

[16] HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

Seungho Choi

Main category: cs.CL

TL;DR: HanjaBridge improves Korean LLM performance by injecting Hanja meanings during pre-training, achieving a 21% boost on KoBALT without runtime costs.

Details

Motivation: LLMs struggle with low-resource languages like Korean due to semantic ambiguity in homophonous words.

Method: Proposes HanjaBridge, a meaning-injection technique in continual pre-training, using all possible Hanja candidates for disambiguation and token-level knowledge distillation.

Result: 21% relative improvement on KoBALT; strong cross-lingual transfer between Korean and Chinese.

Conclusion: HanjaBridge effectively enhances Korean understanding and cross-lingual alignment, maintaining gains without inference-time Hanja.

Abstract: Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic alignment between Korean and Chinese through shared Hanja, we observe a strong positive cross-lingual transfer. Furthermore, these gains persist even when Hanja augmentation is omitted at inference time, ensuring practical efficiency with no additional run-time cost.

[17] Modeling Understanding of Story-Based Analogies Using Large Language Models

Kalit Inani, Keshav Kabra, Vijay Marupudi, Sashank Varma

Main category: cs.CL

TL;DR: The study evaluates LLMs’ analogical reasoning abilities compared to humans, focusing on semantic representation and prompting techniques, while examining model size and architecture impacts.

Details

Motivation: To assess how well LLMs align with human cognition in detecting and mapping analogies, addressing gaps in robust human-like reasoning.

Method: Used a story-based analogical mapping task, analyzed semantic representations via sentence embeddings, and tested explicit prompting for analogy explanations. Evaluated model size (8B vs. 70B) and architectures (GPT-4, LLaMA3).

Result: Examined LLM performance at individual analogy levels, beyond overall accuracy, to identify alignment with human reasoning profiles.

Conclusion: Advances understanding of LLMs’ analogical reasoning, highlighting their potential as models of human cognition.

Abstract: Recent advancements in Large Language Models (LLMs) have brought them closer to matching human cognition across a variety of tasks. How well do these models align with human performance in detecting and mapping analogies? Prior research has shown that LLMs can extract similarities from analogy problems but lack robust human-like reasoning. Building on Webb, Holyoak, and Lu (2023), the current study focused on a story-based analogical mapping task and conducted a fine-grained evaluation of LLM reasoning abilities compared to human performance. First, it explored the semantic representation of analogies in LLMs, using sentence embeddings to assess whether they capture the similarity between the source and target texts of an analogy, and the dissimilarity between the source and distractor texts. Second, it investigated the effectiveness of explicitly prompting LLMs to explain analogies. Throughout, we examine whether LLMs exhibit similar performance profiles to those observed in humans by evaluating their reasoning at the level of individual analogies, and not just at the level of overall accuracy (as prior studies have done). Our experiments include evaluating the impact of model size (8B vs. 70B parameters) and performance variation across state-of-the-art model architectures such as GPT-4 and LLaMA3. This work advances our understanding of the analogical reasoning abilities of LLMs and their potential as models of human reasoning.

[18] DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models

Anthony Miyaguchi, David Guecha, Yuwen Chiu, Sidharth Gaur

Main category: cs.CL

TL;DR: The DS@GT team participated in eRisk 2025’s conversational depression detection task using prompt-engineered LLMs to generate BDI-II-based JSON outputs, achieving competitive results.

Details

Motivation: To explore the effectiveness of prompt-engineering LLMs for depression detection in conversational settings, leveraging BDI-II criteria.

Method: Adopted a prompt-engineering strategy with diverse LLMs to produce structured JSON outputs, evaluating cross-model agreement and internal consistency.

Result: Achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27, ranking second on the leaderboard.

Conclusion: The prompt design successfully aligned LLM outputs with BDI-II criteria, enabling analysis of conversational cues for symptom prediction.

Abstract: This Working Note summarizes the participation of the DS@GT team in two eRisk 2025 challenges. For the Pilot Task on conversational depression detection with large language-models (LLMs), we adopted a prompt-engineering strategy in which diverse LLMs conducted BDI-II-based assessments and produced structured JSON outputs. Because ground-truth labels were unavailable, we evaluated cross-model agreement and internal consistency. Our prompt design methodology aligned model outputs with BDI-II criteria and enabled the analysis of conversational cues that influenced the prediction of symptoms. Our best submission, second on the official leaderboard, achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27.

[19] Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection

Lin Tian, Johanne R. Trippas, Marian-Andrei Rizoiu

Main category: cs.CL

TL;DR: The paper introduces a hierarchical LoRA adaptation method for Llama 3.1 8B to detect sexism in tweets, achieving efficient multilingual performance with minimal preprocessing and resource usage.

Details

Motivation: To address text-based sexism detection in English and Spanish tweets efficiently, leveraging hierarchical subtasks and parameter-efficient fine-tuning.

Method: Hierarchical LoRA adaptation applied to all linear transformations, with conditional adapter routing for label dependencies. Uses QLoRA 4-bit and unified multilingual training.

Result: Achieves 1.7-2.4% F1 improvements via cross-lingual transfer, with 75% faster training and 98% reduced storage. Competitive performance across subtasks.

Conclusion: The method demonstrates efficient, high-performance sexism detection with minimal resource overhead, suitable for multilingual applications.

Abstract: This paper presents our approach to EXIST 2025 Task 1, addressing text-based sexism detection in English and Spanish tweets through hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B. Our method introduces conditional adapter routing that explicitly models label dependencies across three hierarchically structured subtasks: binary sexism identification, source intention detection, and multilabel sexism categorization. Unlike conventional LoRA applications that target only attention layers, we apply adaptation to all linear transformations, enhancing the model’s capacity to capture task-specific patterns. In contrast to complex data processing and ensemble approaches, we show that straightforward parameter-efficient fine-tuning achieves strong performance. We train separate LoRA adapters (rank=16, QLoRA 4-bit) for each subtask using unified multilingual training that leverages Llama 3.1’s native bilingual capabilities. The method requires minimal preprocessing and uses standard supervised learning. Our multilingual training strategy eliminates the need for separate language-specific models, achieving 1.7-2.4% F1 improvements through cross-lingual transfer. With only 1.67% trainable parameters compared to full fine-tuning, our approach reduces training time by 75% and model storage by 98%, while achieving competitive performance across all subtasks (ICM-Hard: 0.6774 for binary classification, 0.4991 for intention detection, 0.6519 for multilabel categorization).

[20] Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification

Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park

Main category: cs.CL

TL;DR: HerO 2 is an improved version of HerO for fact verification, ranking second in AVeriTeC with high efficiency and updated LM backbones.

Details

Motivation: To enhance evidence quality, veracity prediction, and system performance for real-world fact verification.

Method: Uses document summarization, answer reformulation, post-training quantization, and updated LM backbones.

Result: Ranked second in AVeriTeC with the shortest runtime among top systems.

Conclusion: HerO 2 is efficient and effective for fact verification, with potential for real-world applications.

Abstract: This paper presents HerO 2, Team HUMANE’s system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year’s challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at https://github.com/ssu-humane/HerO2.

[21] Journalism-Guided Agentic In-Context Learning for News Stance Detection

Dahyun Lee, Jonghyeon Choi, Jiyoung Han, Kunwoo Park

Main category: cs.CL

TL;DR: The paper introduces K-News-Stance, a Korean dataset for article-level stance detection, and JoA-ICL, a framework using segment-level analysis to improve stance detection in long-form news.

Details

Motivation: Addressing gaps in stance detection for long texts and non-English languages to mitigate filter bubbles and polarization in news recommendations.

Method: Proposes JoA-ICL, a journalism-guided framework using in-context learning to analyze key segments (e.g., leads, quotes) and aggregate them for article-level stance detection.

Result: JoA-ICL outperforms existing methods, demonstrating effectiveness in capturing stances in long-form articles and aiding viewpoint diversity.

Conclusion: The work advances stance detection for non-English, long-form news, with applications in reducing media bias and improving recommendation systems.

Abstract: As online news consumption grows, personalized recommendation systems have become integral to digital journalism. However, these systems risk reinforcing filter bubbles and political polarization by failing to incorporate diverse perspectives. Stance detection – identifying a text’s position on a target – can help mitigate this by enabling viewpoint-aware recommendations and data-driven analyses of media bias. Yet, existing stance detection research remains largely limited to short texts and high-resource languages. To address these gaps, we introduce \textsc{K-News-Stance}, the first Korean dataset for article-level stance detection, comprising 2,000 news articles with article-level and 19,650 segment-level stance annotations across 47 societal issues. We also propose \textsc{JoA-ICL}, a \textbf{Jo}urnalism-guided \textbf{A}gentic \textbf{I}n-\textbf{C}ontext \textbf{L}earning framework that employs a language model agent to predict the stances of key structural segments (e.g., leads, quotes), which are then aggregated to infer the overall article stance. Experiments show that \textsc{JoA-ICL} outperforms existing stance detection methods, highlighting the benefits of segment-level agency in capturing the overall position of long-form news articles. Two case studies further demonstrate its broader utility in promoting viewpoint diversity in news recommendations and uncovering patterns of media bias.

[22] LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP

Haowei Yang, Ziyu Shen, Junli Shao, Luyao Men, Xinyue Han, Jing Dong

Main category: cs.CL

TL;DR: A novel LLM-augmented NLP pipeline improves CVD risk prediction by extracting and analyzing unstructured clinical notes, outperforming traditional models.

Details

Motivation: Timely CVD identification and risk stratification are critical for reducing mortality, but existing models rely on structured data, missing valuable insights from unstructured notes.

Method: The study uses domain-adapted LLMs for symptom extraction, contextual reasoning, and correlation from free-text reports, integrating cardiovascular-specific fine-tuning and prompt-based inference.

Result: Evaluations show improved precision, recall, F1-score, and AUROC, with high clinical relevance (kappa = 0.82). Challenges like contextual hallucination and temporal ambiguity are mitigated.

Conclusion: The work highlights LLMs’ potential in CDSS, enhancing early warning systems and translating patient narratives into actionable risk assessments.

Abstract: Timely identification and accurate risk stratification of cardiovascular disease (CVD) remain essential for reducing global mortality. While existing prediction models primarily leverage structured data, unstructured clinical notes contain valuable early indicators. This study introduces a novel LLM-augmented clinical NLP pipeline that employs domain-adapted large language models for symptom extraction, contextual reasoning, and correlation from free-text reports. Our approach integrates cardiovascular-specific fine-tuning, prompt-based inference, and entity-aware reasoning. Evaluations on MIMIC-III and CARDIO-NLP datasets demonstrate improved performance in precision, recall, F1-score, and AUROC, with high clinical relevance (kappa = 0.82) assessed by cardiologists. Challenges such as contextual hallucination, which occurs when plausible information contracts with provided source, and temporal ambiguity, which is related with models struggling with chronological ordering of events are addressed using prompt engineering and hybrid rule-based verification. This work underscores the potential of LLMs in clinical decision support systems (CDSS), advancing early warning systems and enhancing the translation of patient narratives into actionable risk assessments.

Md. Sabbir Hossen, Md. Saiduzzaman, Pabon Shaha

Main category: cs.CL

TL;DR: A hybrid transformer-based sentiment analysis framework was developed to analyze Bangla social media comments during the July Revolution in Bangladesh, achieving 83.7% accuracy with a hybrid XMB-BERT and voting classifier.

Details

Motivation: To decode public opinion during the July Revolution in Bangladesh using social media data, addressing the challenge of sentiment analysis in low-resource languages like Bangla.

Method: A hybrid transformer-based framework (BanglaBERT, mBERT, XLM-RoBERTa, and hybrid XMB-BERT) with PCA for dimensionality reduction and eleven ML classifiers for sentiment identification.

Result: The hybrid XMB-BERT with voting classifier achieved the highest accuracy of 83.7%, outperforming other models.

Conclusion: Machine learning techniques, especially hybrid transformer models, are effective for sentiment analysis in low-resource languages like Bangla.

Abstract: The July Revolution in Bangladesh marked a significant student-led mass uprising, uniting people across the nation to demand justice, accountability, and systemic reform. Social media platforms played a pivotal role in amplifying public sentiment and shaping discourse during this historic mass uprising. In this study, we present a hybrid transformer-based sentiment analysis framework to decode public opinion expressed in social media comments during and after the revolution. We used a brand new dataset of 4,200 Bangla comments collected from social media. The framework employs advanced transformer-based feature extraction techniques, including BanglaBERT, mBERT, XLM-RoBERTa, and the proposed hybrid XMB-BERT, to capture nuanced patterns in textual data. Principle Component Analysis (PCA) were utilized for dimensionality reduction to enhance computational efficiency. We explored eleven traditional and advanced machine learning classifiers for identifying sentiments. The proposed hybrid XMB-BERT with the voting classifier achieved an exceptional accuracy of 83.7% and outperform other model classifier combinations. This study underscores the potential of machine learning techniques to analyze social sentiment in low-resource languages like Bangla.

[24] Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification

Andres Azqueta-Gavaldón, Joaquin Ramos Cosgrove

Main category: cs.CL

TL;DR: The paper explores using Large Language Models (LLMs) to improve entity-matching in cross-border financial activities, outperforming traditional methods with higher accuracy and lower false positives.

Details

Motivation: The need for accurate foreign entity identification in the Spanish financial system to ensure risk management, regulatory compliance, and fraud prevention.

Method: Comparison of traditional methods (Jaccard, cosine, Levenshtein distances) with Hugging Face-based LLMs and interface-based LLMs (e.g., Microsoft Copilot, Alibaba’s Qwen 2.5) on a dataset of 65 Portuguese company cases.

Result: Traditional methods achieve >92% accuracy but have high false positives (20-40%). Interface-based LLMs achieve >93% accuracy, F1 scores >96%, and lower false positives (40-80%).

Conclusion: LLMs, especially interface-based ones, offer superior performance for entity-matching in cross-border financial contexts.

Abstract: The growing prevalence of cross-border financial activities in global markets has underscored the necessity of accurately identifying and classifying foreign entities. This practice is essential within the Spanish financial system for ensuring robust risk management, regulatory adherence, and the prevention of financial misconduct. This process involves a labor-intensive entity-matching task, where entities need to be validated against available reference sources. Challenges arise from linguistic variations, special characters, outdated names, and changes in legal forms, complicating traditional matching algorithms like Jaccard, cosine, and Levenshtein distances. These methods struggle with contextual nuances and semantic relationships, leading to mismatches. To address these limitations, we explore Large Language Models (LLMs) as a flexible alternative. LLMs leverage extensive training to interpret context, handle abbreviations, and adapt to legal transitions. We evaluate traditional methods, Hugging Face-based LLMs, and interface-based LLMs (e.g., Microsoft Copilot, Alibaba’s Qwen 2.5) using a dataset of 65 Portuguese company cases. Results show traditional methods achieve accuracies over 92% but suffer high false positive rates (20-40%). Interface-based LLMs outperform, achieving accuracies above 93%, F1 scores exceeding 96%, and lower false positives (40-80%).

[25] The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang

Main category: cs.CL

TL;DR: DIJA is a jailbreak attack framework targeting diffusion-based LLMs (dLLMs), exploiting their bidirectional modeling and parallel decoding to bypass alignment mechanisms, achieving high success rates in harmful completions.

Details

Motivation: Existing alignment mechanisms in dLLMs fail against context-aware adversarial prompts, exposing safety vulnerabilities.

Method: DIJA constructs adversarial interleaved mask-text prompts to exploit dLLMs’ bidirectional modeling and parallel decoding.

Result: DIJA outperforms existing jailbreak methods, achieving up to 100% keyword-based ASR and surpassing prior baselines by significant margins.

Conclusion: The study highlights the urgent need for improved safety alignment in dLLMs.

Abstract: Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

[26] Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

Sanhanat Sivapiromrat, Caiqi Zhang, Marco Basaldella, Nigel Collier

Main category: cs.CL

TL;DR: The paper studies multi-trigger data poisoning in LLMs, showing triggers can coexist without interference and remain robust. It proposes a selective retraining defense.

Details

Motivation: Existing works lack understanding of trigger mechanisms and interactions in poisoned LLMs, necessitating a deeper study.

Method: A framework for studying poisoning in LLMs, demonstrating multi-trigger coexistence and proposing a layer-wise weight difference analysis for mitigation.

Result: Multiple triggers can coexist robustly; selective retraining effectively removes triggers with minimal updates.

Conclusion: LLMs have a persistent vulnerability to multi-trigger poisoning, but the proposed defense offers a practical solution.

Abstract: Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack’s effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.

[27] MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models

Seif Ahmed, Mohamed T. Younes, Abdelrahman Moustafa, Abdelrahman Allam, Hamza Moustafa

Main category: cs.CL

TL;DR: A robust ensemble system for multilingual multimodal reasoning, integrating Gemini models, achieved top performance in the ImageCLEF 2025 EXAMS V challenge through prompt engineering and cross-lingual augmentation.

Details

Motivation: To develop a lightweight yet effective system for multilingual multimodal reasoning in educational settings, outperforming heavier end-to-end models.

Method: Combined Gemini models (2.5 Flash, 1.5 Pro, 2.5 Pro) with few-shot and zero-shot prompts, conducted ablation studies, and evaluated on multilingual datasets.

Result: Achieved 81.4% accuracy overall, leading 11 out of 13 language tracks (e.g., 95.07% for Croatian). Prompt optimization boosted accuracy from 55.9% to 61.7%.

Conclusion: Lightweight OCR-VLM ensembles with precise prompts and cross-lingual augmentation can excel in multilingual educational tasks.

Abstract: We present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi 4, Gemma 3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English validation set from 55.9% to 61.7%. On the official leaderboard, our system (Team MSA) achieved first place overall in the multilingual track with 81.4% accuracy, and led 11 out of 13 individual language tracks, with top results such as 95.07% for Croatian and 92.12% for Italian. These findings highlight that lightweight OCR-VLM ensembles, when paired with precise prompt strategies and cross-lingual augmentation, can outperform heavier end-to-end models in high-stakes, multilingual educational settings.

[28] What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests

Dimitri Staufer

Main category: cs.CL

TL;DR: WikiMem dataset and metric quantify human-fact associations in LLMs, aiding GDPR compliance by identifying memorized personal data for unlearning.

Details

Motivation: Address GDPR's Right to Be Forgotten (RTBF) by identifying memorized personal data in LLMs, which existing unlearning methods fail to target.

Method: Introduces WikiMem dataset (5,000+ canaries) and a model-agnostic metric using calibrated negative log-likelihood to rank human-fact associations.

Result: Memorization in LLMs correlates with subject web presence and model scale, evaluated across 15 models (410M-70B parameters).

Conclusion: Provides tools to dynamically identify and forget personal data in LLMs, supporting RTBF compliance.

Abstract: Large Language Models (LLMs) can memorize and reveal personal information, raising concerns regarding compliance with the EU’s GDPR, particularly the Right to Be Forgotten (RTBF). Existing machine unlearning methods assume the data to forget is already known but do not address how to identify which individual-fact associations are stored in the model. Privacy auditing techniques typically operate at the population level or target a small set of identifiers, limiting applicability to individual-level data inquiries. We introduce WikiMem, a dataset of over 5,000 natural language canaries covering 243 human-related properties from Wikidata, and a model-agnostic metric to quantify human-fact associations in LLMs. Our approach ranks ground-truth values against counterfactuals using calibrated negative log-likelihood across paraphrased prompts. We evaluate 200 individuals across 15 LLMs (410M-70B parameters), showing that memorization correlates with subject web presence and model scale. We provide a foundation for identifying memorized personal data in LLMs at the individual level, enabling the dynamic construction of forget sets for machine unlearning and RTBF requests.

[29] Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Conrad Borchers, Bahar Shahrokhian, Francesco Balzan, Elham Tajik, Sreecharan Sankaranarayanan, Sebastian Simon

Main category: cs.CL

TL;DR: The study explores how multi-agent systems (MAS) with diverse personas and temperature settings affect consensus-building and coding accuracy in qualitative research using LLMs. Findings show temperature and personas impact consensus but not accuracy, with single agents often outperforming MAS.

Details

Motivation: To understand the benefits of MAS over single-agent systems in qualitative research tasks like coding and annotation, and how agent personas and temperature influence outcomes.

Method: Experimental study using six open-source LLMs (3B to 32B parameters) and 18 configurations to analyze 77,000 coding decisions against human-annotated transcripts. MAS mirrored human deductive coding with structured discussion and consensus arbitration.

Result: Temperature affected consensus timing, and diverse personas delayed consensus in most LLMs. Neither temperature nor personas improved coding accuracy; single agents often matched or outperformed MAS. Only one model (OpenHermesV2:7B) showed gains under specific conditions.

Conclusion: MAS with diverse personas and temperature settings does not reliably improve coding accuracy, challenging assumptions about their benefits. However, MAS may help refine ambiguous code applications, suggesting potential for improving codebooks and human-AI collaboration.

Abstract: Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including coding and data annotation. While multi-agent systems (MAS) can emulate human coding workflows, their benefits over single-agent coding remain poorly understood. We conducted an experimental study of how agent persona and temperature shape consensus-building and coding accuracy of dialog segments based on a codebook with 8 codes. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic), significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing lead to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Only one model (OpenHermesV2:7B) and code category showed above-chance gains from MAS deliberation when temperature was 0.5 or lower and especially when the agents included at least one assertive persona. Qualitative analysis of MAS collaboration for these configurations suggests that MAS may nonetheless aid in narrowing ambiguous code applications that could improve codebooks and human-AI coding. We contribute new insight into the limits of LLM-based qualitative methods, challenging the notion that diverse MAS personas lead to better outcomes. We open-source our MAS and experimentation code.

[30] EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

Valle Ruiz-Fernández, Mario Mina, Júlia Falcão, Luis Vasquez-Reina, Anna Sallés, Aitor Gonzalez-Agirre, Olatz Perez-de-Viñaspre

Main category: cs.CL

TL;DR: The paper introduces Spanish and Catalan bias benchmarks (EsBBQ and CaBBQ) to evaluate social biases in LLMs, adapted from BBQ for Spain’s context. Results show models struggle with ambiguous scenarios and high accuracy often aligns with bias reliance.

Details

Motivation: Address the lack of non-English and non-US social bias evaluation resources in LLMs.

Method: Develop parallel datasets (EsBBQ and CaBBQ) for Spanish and Catalan, adapting BBQ’s multiple-choice QA framework to Spain’s social context. Evaluate LLMs by model family, size, and variant.

Result: Models often fail in ambiguous scenarios, and higher QA accuracy correlates with increased reliance on social biases.

Conclusion: The benchmarks highlight the need for localized bias evaluation tools and reveal LLMs’ tendency to rely on biases, especially in ambiguous contexts.

Abstract: Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.

[31] An Agentic Flow for Finite State Machine Extraction using Prompt Chaining

Fares Wael, Youssef Maklad, Ali Hamdi, Wael Elsersy

Main category: cs.CL

TL;DR: FlowFSM uses LLMs with prompt chaining to extract accurate FSMs from RFC documents, improving scalability and precision in protocol analysis.

Details

Motivation: Existing FSM extraction techniques are limited by scalability, incomplete coverage, and ambiguity in natural language specifications.

Method: FlowFSM leverages LLMs with prompt chaining and chain-of-thought reasoning to systematically process RFCs, identify state transitions, and construct structured rule-books.

Result: FlowFSM achieves high extraction precision with minimal hallucinated transitions in evaluations of FTP and RTSP protocols.

Conclusion: Agent-based LLM systems like FlowFSM show promise for advancing protocol analysis and FSM inference in cybersecurity and reverse engineering.

Abstract: Finite-State Machines (FSMs) are critical for modeling the operational logic of network protocols, enabling verification, analysis, and vulnerability discovery. However, existing FSM extraction techniques face limitations such as scalability, incomplete coverage, and ambiguity in natural language specifications. In this paper, we propose FlowFSM, a novel agentic framework that leverages Large Language Models (LLMs) combined with prompt chaining and chain-of-thought reasoning to extract accurate FSMs from raw RFC documents. FlowFSM systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs. Experimental evaluation across FTP and RTSP protocols demonstrates that FlowFSM achieves high extraction precision while minimizing hallucinated transitions, showing promising results. Our findings highlight the potential of agent-based LLM systems in the advancement of protocol analysis and FSM inference for cybersecurity and reverse engineering applications.

[32] Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

Lyzander Marciano Andrylie, Inaya Rahmanisa, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji

Main category: cs.CL

TL;DR: The paper explores language-specific features in LLMs using sparse autoencoders (SAEs) and introduces SAE-LAPE to identify these features, revealing their impact on multilingual performance.

Details

Motivation: Understanding how LLMs process multiple languages is challenging due to the polysemantic nature of neurons. The study aims to isolate language-specific features for better interpretability.

Method: The authors use sparse autoencoders (SAEs) and introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features in LLMs’ feed-forward networks.

Result: Language-specific features are found mainly in middle to final layers, are interpretable, and influence multilingual performance. They can also be used for language identification with performance comparable to fastText.

Conclusion: The study successfully identifies and analyzes language-specific features in LLMs, offering insights into their multilingual mechanisms and potential applications like language identification.

Abstract: Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model’s multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features .

[33] KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

Luohe Shi, Zuchao Li, Lefei Zhang, Guoming Liu, Baoyuan Qi, Hai Zhao

Main category: cs.CL

TL;DR: KV-Latent reduces Key-Value cache footprint in LLMs by down-sampling dimensions into a latent space, improving efficiency with minimal extra training.

Details

Motivation: The increasing Key-Value cache during inference in LLMs causes memory and bandwidth inefficiencies, prompting the need for optimization.

Method: Proposes KV-Latent, down-sampling KV vectors into a latent space, and modifies Rotary Positional Embedding for stability.

Result: Significantly reduces KV cache footprint and improves inference speed with less than 1% extra training.

Conclusion: KV-Latent enables more efficient LLMs and opens new possibilities for KV cache optimization.

Abstract: Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model’s performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at https://github.com/ShiLuohe/KV-Latent.

[34] FMC: Formalization of Natural Language Mathematical Competition Problems

Jiaxuan Xie, Chengwu Liu, Ye Yuan, Siqi Li, Zhiping Xiao, Ming Zhang

Main category: cs.CL

TL;DR: The paper introduces an autoformalization pipeline using large language models with error feedback to create a high-quality dataset of natural language and Lean formalizations, suitable for benchmarking automated theorem provers.

Details

Motivation: Advancing formal mathematical reasoning by developing efficient and accurate autoformalization methods.

Method: Proposes a training-free autoformalization pipeline leveraging large language models with error feedback, and curates an Olympiad-level dataset.

Result: Created a dataset of 3,922 natural language and 9,787 Lean problems, with 64.46% assessed as high quality. Few-shot learning, error feedback, and increased sampling improved autoformalization.

Conclusion: The dataset is a valuable benchmark for formal reasoning tasks, and the pipeline enhances autoformalization capabilities.

Abstract: Efficient and accurate autoformalization methods, which leverage large-scale datasets of extensive natural language mathematical problems to construct formal language datasets, are key to advancing formal mathematical reasoning. In this paper, we propose an autoformalization pipeline based on large language models with error feedback, achieving a fully automatic and training-free formalization approach. Using this pipeline, we curate an Olympiad-level dataset aligning natural language problems with Lean formalizations. The dataset comprises $3,922$ mathematical problems in natural language and $9,787$ in Lean, of which $64.46%$ were assessed as at least above-average quality, making it suitable as a benchmark for automated theorem provers. Additionally, we investigate the formalization and reasoning capabilities of various LLMs and empirically demonstrate that few-shot learning, error feedback, and increasing sampling numbers enhance the autoformalization process. Experiments of three automated theorem provers on the \dataset\ dataset also highlight its challenging nature and its value as a benchmark for formal reasoning tasks.

[35] Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources, Coded Term Lexicon, and Enhanced Detection Frameworks

Zewen Bai, Liang Yang, Shengdi Yin, Yuanyuan Sun, Hongfei Lin

Main category: cs.CL

TL;DR: The paper addresses gaps in Chinese hate speech detection by introducing a span-level dataset (STATE ToxiCN), studying coded hate terms, and proposing a lexicon-integrated method to improve detection and interpretability.

Details

Motivation: The proliferation of hate speech and the lack of research on Chinese hate speech detection, especially regarding span-level understanding and coded hate terms, motivated this study.

Method: The authors introduced the STATE ToxiCN dataset, studied coded hate terms and LLMs’ interpretability, and proposed a lexicon-integrated method for hate speech detection.

Result: The work provided a valuable dataset, insights into coded hate terms, and a method that enhanced detection performance.

Conclusion: This study advances the interpretability and effectiveness of Chinese hate speech detection, offering resources and methods for future research.

Abstract: The proliferation of hate speech has inflicted significant societal harm, with its intensity and directionality closely tied to specific targets and arguments. In recent years, numerous machine learning-based methods have been developed to detect hateful comments on online platforms automatically. However, research on Chinese hate speech detection lags behind, and interpretability studies face two major challenges: first, the scarcity of span-level fine-grained annotated datasets limits models’ deep semantic understanding of hate speech; second, insufficient research on identifying and interpreting coded hate speech restricts model explainability in complex real-world scenarios. To address these, we make the following contributions: (1) We introduce the Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), the first span-level Chinese hate speech dataset, and evaluate the hate semantic understanding of existing models using it. (2) We conduct the first comprehensive study on Chinese coded hate terms, LLMs’ ability to interpret hate semantics. (3) We propose a method to integrate an annotated lexicon into models, significantly enhancing hate speech detection performance. Our work provides valuable resources and insights to advance the interpretability of Chinese hate speech detection research.

[36] Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

Andrei Niculae, Adrian Cosma, Cosmin Dumitrache, Emilian Rǎdoi

Main category: cs.CL

TL;DR: Dr.Copilot is a multi-agent LLM system for Romanian-speaking doctors, improving the presentation quality of their telemedicine responses without assessing medical accuracy.

Details

Motivation: To enhance the quality of doctor-patient text interactions by focusing on communication rather than clinical accuracy.

Method: Uses three LLM agents with prompts optimized via DSPy, designed for low-resource Romanian data and open-weight models.

Result: Empirical evaluations and live deployment with 41 doctors showed improved user reviews and response quality.

Conclusion: One of the first real-world LLM deployments in Romanian medical settings, successfully enhancing telemedicine interactions.

Abstract: Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr.Copilot , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr.Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.

[37] Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

Haoran Jin, Meng Li, Xiting Wang, Zhihao Xu, Minlie Huang, Yantao Jia, Defu Lian

Main category: cs.CL

TL;DR: The paper introduces ConVA, a method to align LLMs with human values by controlling value vectors in latent representations, ensuring consistency without performance loss.

Details

Motivation: Aligning LLMs with human values is crucial for clarity, transparency, and adaptability, but current methods lack precision and consistency.

Method: Proposes ConVA, which identifies context-controlled value vectors and uses gated activation for minimal, effective value control.

Result: Achieves highest control success rate across 10 values without harming performance or fluency, even with adversarial inputs.

Conclusion: ConVA effectively aligns LLMs with human values, ensuring consistency and robustness in diverse scenarios.

Abstract: Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.

[38] Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge

Wenqing Wu, Chengzhi Zhang, Yi Zhao

Main category: cs.CL

TL;DR: The paper proposes a hybrid approach combining human expertise and LLM knowledge to assess novelty in academic papers, focusing on method novelty. It fine-tunes PLMs using review reports and LLM summaries, achieving superior performance.

Details

Motivation: Traditional novelty assessment methods (expert judgment or reference combinations) have limitations. LLMs offer broad knowledge but lack human judgment, prompting a hybrid solution.

Method: Extracts novelty-related sentences from reviews, uses LLM to summarize methodology, and fine-tunes PLMs. Introduces a Sparse-Attention fusion module to integrate human and LLM knowledge.

Result: The proposed method outperforms baselines in predicting method novelty.

Conclusion: Combining human and LLM knowledge effectively addresses novelty assessment limitations, with the hybrid model showing superior performance.

Abstract: Novelty is a crucial criterion in the peer review process for evaluating academic papers. Traditionally, it’s judged by experts or measure by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it’s unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. The most common novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs, e.g. BERT etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine-tune PLMs. In addition, we have designed a text-guided fusion module with novel Sparse-Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.

[39] What is the Best Process Model Representation? A Comparative Analysis for Process Modeling with Large Language Models

Alexis Brissard, Frédéric Cuppens, Amal Zouaq

Main category: cs.CL

TL;DR: This paper evaluates various Process Model Representations (PMRs) for Large Language Model (LLM)-based Process Modeling (PMo), introducing a new dataset and comparing PMRs on suitability and performance.

Details

Motivation: The lack of systematic comparison among PMRs and inconsistent evaluation strategies in Process Model Generation (PMG) motivated this study.

Method: The study introduces the PMo Dataset with 55 process descriptions in nine PMRs and evaluates PMRs on suitability for LLM-based PMo and PMG performance.

Result: Mermaid scored highest overall for PMo, while BPMN text performed best in PMG for process element similarity.

Conclusion: The study provides empirical insights into PMR effectiveness, highlighting Mermaid and BPMN text as top performers for PMo and PMG, respectively.

Abstract: Large Language Models (LLMs) are increasingly applied for Process Modeling (PMo) tasks such as Process Model Generation (PMG). To support these tasks, researchers have introduced a variety of Process Model Representations (PMRs) that serve as model abstractions or generation targets. However, these PMRs differ widely in structure, complexity, and usability, and have never been systematically compared. Moreover, recent PMG approaches rely on distinct evaluation strategies and generation techniques, making comparison difficult. This paper presents the first empirical study that evaluates multiple PMRs in the context of PMo with LLMs. We introduce the PMo Dataset, a new dataset containing 55 process descriptions paired with models in nine different PMRs. We evaluate PMRs along two dimensions: suitability for LLM-based PMo and performance on PMG. \textit{Mermaid} achieves the highest overall score across six PMo criteria, whereas \textit{BPMN text} delivers the best PMG results in terms of process element similarity.

[40] Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss

Xia Cui

Main category: cs.CL

TL;DR: A weighted loss function is applied to Transformer models for multi-label emotion detection, improving performance on high-frequency classes but with limited impact on minority classes.

Details

Motivation: Address data imbalance in multi-label emotion detection without the computational cost of traditional resampling methods.

Method: Use a weighted loss function with BERT, RoBERTa, and BART on the BRIGHTER dataset, evaluated via Micro F1, Macro F1, ROC-AUC, Accuracy, and Jaccard similarity.

Result: Improved performance on high-frequency emotion classes; limited impact on minority classes.

Conclusion: The weighted loss function is effective but faces challenges in addressing minority classes in imbalanced datasets.

Abstract: This paper explores the application of a simple weighted loss function to Transformer-based models for multi-label emotion detection in SemEval-2025 Shared Task 11. Our approach addresses data imbalance by dynamically adjusting class weights, thereby enhancing performance on minority emotion classes without the computational burden of traditional resampling methods. We evaluate BERT, RoBERTa, and BART on the BRIGHTER dataset, using evaluation metrics such as Micro F1, Macro F1, ROC-AUC, Accuracy, and Jaccard similarity coefficients. The results demonstrate that the weighted loss function improves performance on high-frequency emotion classes but shows limited impact on minority classes. These findings underscore both the effectiveness and the challenges of applying this approach to imbalanced multi-label emotion detection.

[41] DCR: Quantifying Data Contamination in LLMs Evaluation

Cheng Xu, Nan Yan, Shuhao Guan, Changhong Jin, Yuke Mei, Yibing Guo, M-Tahar Kechadi

Main category: cs.CL

TL;DR: The paper introduces the Data Contamination Risk (DCR) framework to detect and quantify benchmark data contamination in LLMs, adjusting performance metrics for fairer comparisons.

Details

Motivation: Concerns about LLMs memorizing evaluation data (benchmark data contamination) inflating performance metrics and undermining genuine generalization assessment.

Method: DCR framework detects contamination at four granular levels (semantic, informational, data, label) using a fuzzy inference system to produce a unified DCR Factor.

Result: Validated on 9 LLMs, DCR adjusts accuracy to within 4% average error compared to uncontaminated baselines across three benchmarks.

Conclusion: DCR is a lightweight, transparent tool for routine contamination assessment, enhancing LLM benchmarking credibility.

Abstract: The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data, inflating performance metrics and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.

[42] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun

Main category: cs.CL

TL;DR: EXAONE 4.0 integrates Non-reasoning and Reasoning modes, enhances multilingual support, and offers two model sizes for varied applications, outperforming peers.

Details

Motivation: To combine usability and advanced reasoning, and to support the agentic AI era with multilingual and tool-use capabilities.

Method: Introduces two modes (Non-reasoning and Reasoning) and two model sizes (32B and 1.2B), with extended multilingual support (English, Korean, Spanish).

Result: Superior performance compared to open-weight models and competitiveness against frontier-class models.

Conclusion: EXAONE 4.0 is a versatile, high-performing model series publicly available for research.

Abstract: This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.

[43] KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?

Soumadeep Saha, Akshay Chaturvedi, Saptarshi Saha, Utpal Garain, Nicholas Asher

Main category: cs.CL

TL;DR: The paper introduces Causal CoT Graphs (CCGs) to analyze chain-of-thought reasoning in LLMs, showing they mediate final answers and align with model reasoning paths.

Details

Motivation: To understand how chain-of-thought traces improve reasoning in large language models (LLMs).

Method: Developed CCGs (directed acyclic graphs) from reasoning traces and analyzed them using the KisMATH dataset (1671 math problems).

Result: CCG nodes mediate final answers, and LLMs emphasize reasoning paths aligned with CCGs.

Conclusion: KisMATH enables controlled interventions and further study of chain-of-thought in LLM reasoning.

Abstract: Chain-of-thought traces have been shown to improve performance of large language models in a plethora of reasoning tasks, yet there is no consensus on the mechanism through which this performance boost is achieved. To shed more light on this, we introduce Causal CoT Graphs (CCGs), which are directed acyclic graphs automatically extracted from reasoning traces that model fine-grained causal dependencies in the language model output. A collection of $1671$ mathematical reasoning problems from MATH500, GSM8K and AIME, and their associated CCGs are compiled into our dataset – \textbf{KisMATH}. Our detailed empirical analysis with 15 open-weight LLMs shows that (i) reasoning nodes in the CCG are mediators for the final answer, a condition necessary for reasoning; and (ii) LLMs emphasise reasoning paths given by the CCG, indicating that models internally realise structures akin to our graphs. KisMATH enables controlled, graph-aligned interventions and opens up avenues for further investigation into the role of chain-of-thought in LLM reasoning.

[44] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme

Main category: cs.CL

TL;DR: The paper introduces the SOTA Ettin suite of models, comparing encoder-only and decoder-only architectures fairly by using identical training setups. Encoders excel in classification/retrieval, while decoders outperform in generative tasks. Adapting models across tasks is less effective than using specialized architectures.

Details

Motivation: To address the lack of fair comparisons between encoder-only and decoder-only models due to varying parameters, training techniques, and datasets.

Method: Developed the Ettin suite with paired encoder-only and decoder-only models (17M to 1B parameters, trained on up to 2T tokens) using identical training recipes. Evaluated performance on classification, retrieval, and generative tasks.

Result: Encoder-only models outperform in classification/retrieval, while decoder-only models excel in generative tasks. Adapting models across tasks is suboptimal.

Conclusion: Specialized architectures (encoder for classification/retrieval, decoder for generation) are superior. The Ettin suite and open-sourced artifacts enable further research.

Abstract: The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.

[45] Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?

Yanjian Zhang, Guillaume Wisniewski, Nadi Tomeh, Thierry Charnois

Main category: cs.CL

TL;DR: The paper explores whether prompting can control LLM reasoning strategies and finds adaptive strategy selection could enhance performance.

Details

Motivation: Prior work shows LLMs favor a single reasoning strategy, limiting effectiveness in diverse challenges.

Method: Investigates prompting’s role in controlling LLM strategies and assesses impact on logical problem-solving.

Result: No single strategy consistently improves accuracy, but adaptive strategy selection could enhance performance.

Conclusion: Proposes methods to guide LLMs in strategy selection, refining their reasoning abilities.

Abstract: Human reasoning involves different strategies, each suited to specific problems. Prior work shows that large language model (LLMs) tend to favor a single reasoning strategy, potentially limiting their effectiveness in diverse reasoning challenges. In this work, we investigate whether prompting can control LLMs reasoning strategies and assess its impact on logical problem-solving. While our experiments show that no single strategy consistently improves accuracy, performance could be enhanced if models could adaptively choose the optimal strategy. We propose methods to guide LLMs in strategy selection, highlighting new ways to refine their reasoning abilities.

[46] HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong

Sirui Han, Junqi Zhu, Ruiyuan Zhang, Yike Guo

Main category: cs.CL

TL;DR: HKGAI-V1 is a sovereign LLM tailored for Hong Kong, addressing its multilingual, socio-legal, and cultural needs. It outperforms general models in local queries and includes a safety framework and adversarial benchmark.

Details

Motivation: To create a value-aligned AI for Hong Kong, addressing its unique multilingual and socio-legal context under 'one country, two systems.'

Method: Built on DeepSeek architecture with full parameter fine-tuning and RAG integration for factual grounding.

Result: HKGAI-V1 excels in culturally sensitive queries and includes a governance-embedded approach and adversarial benchmark.

Conclusion: The paper offers a replicable blueprint for region-specific AI systems, emphasizing local identity and sovereignty.

Abstract: This paper presents the development of HKGAI-V1, a foundational sovereign large language model (LLM), developed as part of an initiative to establish value-aligned AI infrastructure specifically tailored for Hong Kong. Addressing the region’s unique multilingual environment (Cantonese, Mandarin, and English), its distinct socio-legal context under the “one country, two systems” framework, and specific local cultural and value considerations, the model is built upon the DeepSeek architecture and systematically aligned with regional norms through a multifaceted full parameter fine-tuning process. It is further integrated with a retrieval-augmented generation (RAG) system to ensure timely and factually grounded information access. The core contribution lies in the design and implementation of a comprehensive, region-specific AI alignment and safety framework, demonstrated through two key achievements: 1) The successful development of HKGAI-V1 itself - which outper-forms general-purpose models in handling Hong Kong-specific culturally sensitive queries, and embodies a “governance-embedded” approach to digital sovereignty - empowers Hong Kong to exercise control over AI applications in critical sectors including public services, legal systems, and edu-cation. 2) The development of the proprietary Adversarial HK Value Benchmark, a rigorous tool for evaluating model alignment with local ethical and legal stand-ards under challenging conditions. By documenting these achievements, the paper provides not only a technological artifact but also a replicable blueprint for developing advanced, regionally focused AI systems deeply rooted in their local identities.

[47] Real-World Summarization: When Evaluation Reaches Its Limits

Patrícia Schmidtová, Ondřej Dušek, Saad Mahamood

Main category: cs.CL

TL;DR: Simpler metrics like word overlap correlate well with human judgments for evaluating faithfulness in LLM-generated hotel highlights, while LLMs prove unreliable for evaluation.

Details

Motivation: To assess faithfulness of LLM-generated hotel highlights to input data and compare evaluation methods.

Method: Human evaluation campaigns with categorical error assessment and span-level annotation, comparing traditional metrics, trainable methods, and LLM-as-a-judge approaches.

Result: Word overlap metrics correlate well with human judgments (Spearman 0.63), outperforming complex methods on out-of-domain data. LLMs are unreliable evaluators due to annotation inconsistencies.

Conclusion: Simpler metrics are effective for faithfulness evaluation, while LLMs are unreliable for this task. Crowdsourced evaluations and non-checkable information pose risks.

Abstract: We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.

[48] Fine-grained Stateful Knowledge Exploration: Effective and Efficient Graph Retrieval with Large Language Models

Dehao Tao, Congqi Wang, Feng Huang, Junhao Chen, Yongfeng Huang, Minghu Jiang

Main category: cs.CL

TL;DR: FiSKE introduces a fine-grained, stateful knowledge exploration method to improve LLM knowledge updates by decomposing questions and dynamically mapping clues to knowledge graphs, outperforming existing methods.

Details

Motivation: Current methods for updating LLM knowledge with external bases like knowledge graphs suffer from granularity mismatches, leading to inefficiency and inaccuracy.

Method: FiSKE decomposes questions into fine-grained clues, uses adaptive mapping for ambiguity resolution, and employs a clue-driven termination mechanism.

Result: FiSKE outperforms existing methods in knowledge retrieval accuracy and reduces LLM invocations.

Conclusion: FiSKE effectively balances precision and efficiency in knowledge exploration for LLMs.

Abstract: Large Language Models (LLMs) have shown impressive capabilities, yet updating their knowledge remains a significant challenge, often leading to outdated or inaccurate responses. A proposed solution is the integration of external knowledge bases, such as knowledge graphs, with LLMs. Most existing methods use a paradigm that treats the whole question as the objective, with relevant knowledge being incrementally retrieved from the knowledge graph. However, this paradigm often leads to a granularity mismatch between the target question and the retrieved entities and relations. As a result, the information in the question cannot precisely correspond to the retrieved knowledge. This may cause redundant exploration or omission of vital knowledge, thereby leading to enhanced computational consumption and reduced retrieval accuracy. To address the limitations of coarse-grained knowledge exploration, we propose FiSKE, a novel paradigm for Fine-grained Stateful Knowledge Exploration. FiSKE first decomposes questions into fine-grained clues, then employs an adaptive mapping strategy during knowledge exploration process to resolve ambiguity in clue-to-graph mappings. This strategy dynamically infers contextual correspondences while maintaining a stateful record of the mappings. A clue-driven termination mechanism ensures rigorous augmentation–leveraging fully mapped paths for LLMs while reverting to chain-of-thought reasoning when necessary. Our approach balances precision and efficiency. Experiments on multiple datasets revealed that our paradigm surpasses current advanced methods in knowledge retrieval while significantly reducing the average number of LLM invocations.

[49] GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh

Main category: cs.CL

TL;DR: GenARM introduces an autoregressive reward model for test-time alignment of LLMs, outperforming prior methods and matching training-time performance while enabling efficient weak-to-strong guidance and multi-objective alignment.

Details

Motivation: Traditional alignment methods for LLMs are costly and inflexible, requiring repeated training for diverse preferences. Test-time methods using trajectory-level RMs are inefficient for autoregressive generation.

Method: GenARM uses an autoregressive reward model (ARM) to predict next-token rewards, enabling efficient alignment of frozen LLMs without retraining.

Result: GenARM outperforms prior test-time methods, matches training-time performance, and supports weak-to-strong guidance and multi-objective alignment.

Conclusion: GenARM provides a cost-effective, flexible solution for aligning LLMs with human preferences, addressing limitations of existing methods.

Abstract: Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model–a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.

[50] Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

Aryan Sajith, Krishna Chaitanya Rao Kathala

Main category: cs.CL

TL;DR: The study shows that data quality is more critical than quantity for small language models (SLMs), with controlled duplication improving accuracy but excessive duplication harming performance.

Details

Motivation: To understand the impact of data quality vs. quantity on SLMs and address the financial, computational, and environmental challenges of large-scale training.

Method: Used the TinyStories dataset, varying size (25%, 50%) and duplication (25%, 50%, 75%, 100%), and evaluated performance via validation loss, accuracy, and perplexity.

Result: Data quality matters more; minimal duplication boosts accuracy (+0.87% at 25%), but excessive duplication degrades performance (-40% at 100%).

Conclusion: Prioritizing data quality over quantity can democratize AI by making advanced models more accessible and sustainable, especially in resource-limited settings.

Abstract: This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs), utilizing the TinyStories dataset for empirical analysis. Analysis of dataset variations with respect to size (25% and 50% of the original size) and duplication (controlled rates of 25%, 50%, 75%, and 100%) were performed. Model performance was evaluated based on the validation loss, accuracy, and perplexity metrics. Results indicate training data quality plays a more significant role in the overall performance of SLMs, especially given scale of this experiment. Minimal duplication positively impacted model accuracy (+0.87% increase in accuracy at 25% duplication) without significantly increasing perplexity (+0.52% increase going from 0% to 25% duplication) but excessive duplication led to pronounced performance degradation (-40% drop in accuracy at 100% duplication). The implications of this exploration extend beyond just model performance; training large-scale models imposes significant financial and computational burdens, which can be prohibitive for organizations, individuals, and the public at large, especially in developing countries. Additionally, the energy consumption associated with large-scale training raises environmental concerns. Understanding the relative importance of data quality versus quantity could democratize AI technology, making advanced models more accessible and sustainable for all.

[51] AIDE: Attribute-Guided MultI-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning

Jiayu Li, Xuan Zhu, Fang Liu, Yanjun Qi

Main category: cs.CL

TL;DR: AIDE is a novel data synthesis framework that expands few seed data points for fine-tuning LLMs, ensuring diversity and relevance, outperforming existing methods.

Details

Motivation: Challenges in obtaining diverse, high-quality training data for fine-tuning LLMs. Existing methods lack balance between task relevance and diversity.

Method: AIDE uses a multi-hop process guided by topic and key attributes from seeds, with residual connections to prevent irrelevant data.

Result: AIDE fine-tunes models like Mistral-7B and Llama-3 variants from 10 seeds, surpassing human-curated data and outperforming Evol-Instruct by 30%.

Conclusion: AIDE effectively addresses data scarcity and quality issues for LLM fine-tuning, offering a scalable solution.

Abstract: Fine-tuning large language models (LLMs) for specific tasks requires diverse, high-quality training data. However, obtaining sufficient relevant data remains a significant challenge. Existing data synthesis methods either depend on extensive seed datasets or struggle to balance task relevance and data diversity. To address these challenges, we propose Attribute-guided multI-hop Data Expansion (AIDE), a novel data synthesis framework that uses a multi-hop process to expand very few seed data points while ensuring data diversity and task relevance. AIDE extracts the main topic and key knowledge attributes from the seeds to guide the synthesis steps. The process repeats for K hops, using the generated data as seeds. To prevent irrelevant data generation as the hop depth increases, AIDE incorporates a residual connection mechanism. Our empirical results show that AIDE enables fine-tuning of Mistral-7B, Llama-3.1-8B and Llama-3.2-3B from 10 seeds, surpassing the models fine-tuned on human curated data. Furthermore, AIDE outperforms state-of-the-art data synthesis methods, such as Evol-Instruct, by over 30% in task-specific fine-tuning. Code is available at https://github.com/Code4Graph/AIDE.

[52] Understanding the Dark Side of LLMs’ Intrinsic Self-Correction

Qingjie Zhang, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, Minlie Huang, Ke Xu, Hewu Li, Yan Liu, Han Qiu

Main category: cs.CL

TL;DR: The paper investigates LLMs’ intrinsic self-correction, revealing its failures and biases, and proposes mitigation strategies.

Details

Motivation: To understand why LLMs' intrinsic self-correction fails without oracle labels and to analyze its biases.

Method: Analyzed one simple and three complex tasks using SOTA LLMs (ChatGPT and Llama families) with three interpretation methods.

Result: Found intrinsic self-correction causes answer wavering, prompt bias, and cognitive bias. Proposed question repeating and supervised fine-tuning as solutions.

Conclusion: Intrinsic self-correction has limitations; simple strategies can mitigate its issues.

Abstract: Intrinsic self-correction was proposed to improve LLMs’ responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs’ intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs’ intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs’ intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.

[53] Plancraft: an evaluation dataset for planning with LLM agents

Gautier Dagan, Frank Keller, Alex Lascarides

Main category: cs.CL

TL;DR: Plancraft is a multi-modal dataset for evaluating LLM agents, featuring text-only and multi-modal interfaces based on Minecraft. It tests tool use, RAG, and decision-making, including unsolvable tasks. LLMs and VLMs struggle with its planning challenges.

Details

Motivation: To create a benchmark for evaluating LLM agents' capabilities in multi-modal environments, tool use, and decision-making, including handling unsolvable tasks.

Method: Plancraft uses Minecraft’s crafting GUI, includes the Minecraft Wiki for RAG, and features a handcrafted planner and Oracle Retriever for ablation studies. It also includes unsolvable tasks to test decision-making.

Result: LLMs and VLMs perform poorly on Plancraft’s planning challenges compared to a handcrafted planner.

Conclusion: The study highlights limitations in current LLMs and VLMs for planning tasks and suggests improvements for future models.

Abstract: We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as a handcrafted planner and Oracle Retriever, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and compare their performance and efficiency to a handcrafted planner. Overall, we find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and offer suggestions on how to improve their capabilities.

[54] Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction

Alexei Figueroa, Justus Westerhoff, Golzar Atefi, Dennis Fast, Benjamin Winter, Felix Alexander Gers, Alexander Löser, Wolfgang Nejdl

Main category: cs.CL

TL;DR: Comply, a biologically inspired neural network, improves upon FlyVec by incorporating positional information with complex weights, achieving competitive performance with state-of-the-art models without extra parameters.

Details

Motivation: To enhance the performance of FlyVec, a biologically inspired model for word embeddings, by incorporating positional information and improving computational efficiency.

Method: Introduces Comply, a single-layer neural network using complex weights to learn sequence representations, maintaining sparsity and interpretability.

Result: Comply outperforms FlyVec and matches larger state-of-the-art models without additional parameters, providing sparse and interpretable sentence representations.

Conclusion: Comply demonstrates that biologically inspired models can achieve high performance and efficiency in learning word embeddings, with added interpretability.

Abstract: Biologically inspired neural networks offer alternative avenues to model data distributions. FlyVec is a recent example that draws inspiration from the fruit fly’s olfactory circuit to tackle the task of learning word embeddings. Surprisingly, this model performs competitively even against deep learning approaches specifically designed to encode text, and it does so with the highest degree of computational efficiency. We pose the question of whether this performance can be improved further. For this, we introduce Comply. By incorporating positional information through complex weights, we enable a single-layer neural network to learn sequence representations. Our experiments show that Comply not only supersedes FlyVec but also performs on par with significantly larger state-of-the-art models. We achieve this without additional parameters. Comply yields sparse contextual representations of sentences that can be interpreted explicitly from the neuron weights.

[55] A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens

Sophie Xhonneux, David Dobre, Mehrnaz Mofakhami, Leo Schwinn, Gauthier Gidel

Main category: cs.CL

TL;DR: Proposes a red flag token () for LLMs to mark harmful content without compromising model capabilities, using LoRA for safety tuning.

Details

Motivation: Current safety training methods for LLMs cause drastic distribution shifts, reducing model utility. The goal is to maintain capabilities while improving safety.

Method: Introduces a red flag token () to mark harmful content during generation, trained with minimal distribution shift. Uses LoRA for safety tuning to defend against API attacks.

Result: Enables explicit learning of harmfulness, maintains model utility, and provides robustness comparable to adversarial training without runtime attacks.

Conclusion: The red flag token and LoRA-based safety tuning offer a balanced approach to LLM safety without sacrificing performance.

Abstract: Most safety training methods for large language models (LLMs) are based on fine-tuning that forces models to shift from an unsafe answer to refusal when faced with harmful requests. Unfortunately, these drastic distribution shifts generally compromise model capabilities. To avoid that, we propose to expand the model’s vocabulary with a special token we call red flag token () and propose to train the model to insert this token into its response at any time when harmful content is generated or about to be generated. Our approach offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution, thus maintaining the model’s utility. It also evaluates each generated answer and provides robustness as good as adversarial training without the need to run attacks during training. Moreover, by encapsulating our safety tuning in a LoRA module, we provide additional defenses against fine-tuning API attacks.

[56] Shared Global and Local Geometry of Language Model Embeddings

Andrew Lee, Melanie Weber, Fernanda Viégas, Martin Wattenberg

Main category: cs.CL

TL;DR: The paper explores geometric similarities in token embeddings of large language models, introduces methods to analyze local geometry, and presents EMB2EMB for transforming steering vectors between models.

Details

Motivation: To understand and leverage common geometric structures in token embeddings across large language models.

Method: Analyzes global and local similarities using relative orientations, Locally Linear Embeddings, and intrinsic dimension measures. Introduces EMB2EMB for transforming steering vectors.

Result: Token embeddings share geometric similarities, lie on lower-dimensional manifolds, and semantically coherent clusters correlate with intrinsic dimension. EMB2EMB enables cross-model transformations.

Conclusion: Geometric similarities in embeddings are exploitable, and EMB2EMB provides a practical tool for transferring steering vectors between models.

Abstract: Researchers have recently suggested that models share common representations. In our work, we find numerous geometric similarities across the token embeddings of large language models. First, we find ``global’’ similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each embedding. Both characterizations allow us to find local similarities across token embeddings. Additionally, our intrinsic dimension demonstrates that embeddings lie on a lower dimensional manifold, and that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Based on our findings, we introduce EMB2EMB, a simple application to linearly transform steering vectors from one language model to another, despite the two models having different dimensions.

[57] Style over Substance: Distilled Language Models Reason Via Stylistic Replication

Philip Lippmann, Jie Yang

Main category: cs.CL

TL;DR: The paper explores how distilled reasoning models rely on surface-level stylistic patterns in reasoning traces, showing comparable performance even with altered synthetic traces.

Details

Motivation: To understand the extent to which distilled models internalize stylistic patterns in reasoning traces and how these patterns influence their reasoning capabilities.

Method: Systematic analysis of reasoning traces, creation of two datasets (emergent and synthetic), and evaluation of models trained on synthetic traces.

Result: Models trained on synthetic traces perform comparably, and performance improves even with altered (incorrect) traces, indicating reliance on stylistic patterns.

Conclusion: Stylistic patterns in reasoning traces can efficiently enhance reasoning capabilities across model families.

Abstract: Specialized reasoning language models (RLMs) have demonstrated that scaling test-time computation through detailed reasoning traces significantly enhances performance. Although these traces effectively facilitate knowledge distillation into smaller, instruction-tuned models, the precise nature of transferred reasoning remains unclear. In this study, we investigate to what extent distilled models internalize replicated stylistic patterns during reasoning. To this end, we systematically analyze reasoning traces, identifying structural and lexical patterns that characterize successful reasoning. We then introduce two new datasets – a dataset of emergent reasoning traces and a synthetic dataset explicitly constructed to replicate these stylistic patterns – to precisely examine their influence on distilled models’ reasoning capabilities. We find that models trained on the synthetic traces achieve comparable performance, indicating that distilled reasoning abilities rely significantly on surface-level patterns. Surprisingly, we observe an increase in performance even when the synthetic traces are altered to lead to the wrong answer. Our findings highlight how stylistic patterns can be leveraged to efficiently enhance LM reasoning across diverse model families.

[58] Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

Pedro Ferreira, Wilker Aziz, Ivan Titov

Main category: cs.CL

TL;DR: Preference optimization in LLMs can reduce explanation faithfulness due to reward model conflicts. A solution using causal attribution improves explanation accuracy.

Details

Motivation: To address the issue of reduced faithfulness in chain-of-thought explanations caused by preference optimization in LLMs.

Method: Enrich the reward model’s input with causal attribution to detect discrepancies between explanations and decision processes.

Result: The proposed approach reduces misleading explanations in controlled settings.

Conclusion: Causal attribution enhances the faithfulness of LLM explanations by aligning them with internal reasoning.

Abstract: Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization - a key step in the alignment phase - can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model’s internal decision process and the generated explanation. Consequently, the LLM may engage in “reward hacking” by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM’s input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model’s decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.

Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, Guanying Li, Ling Yan, Yao Hu, Siming Chen, Yu Wang, Xuanjing Huang, Jiebo Luo, Shiping Tang, Libo Wu, Baohua Zhou, Zhongyu Wei

Main category: cs.CL

TL;DR: SocioVerse, an LLM-agent-driven world model, addresses alignment challenges in social simulation by leveraging four alignment components and a 10M-user pool, validated in politics, news, and economics.

Details

Motivation: Social simulation, enhanced by LLMs, struggles with alignment issues in environments, users, interactions, and behaviors.

Method: Introduces SocioVerse with four alignment components and a 10M-user pool, tested in large-scale simulations across politics, news, and economics.

Result: SocioVerse successfully reflects population dynamics, ensuring diversity, credibility, and representativeness with minimal manual adjustments.

Conclusion: SocioVerse offers a robust solution for social simulation, addressing alignment challenges and demonstrating scalability and reliability.

Abstract: Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.

[60] Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions

Minwoo Kang, Suhong Moon, Seung Hyeong Lee, Ayush Raj, Joseph Suh, David M. Chan

Main category: cs.CL

TL;DR: LLMs simulate human behavior for surveys, but it’s unclear if their responses reflect in-group or out-group perspectives. A new method creates detailed virtual personas with backstories, improving response accuracy and enabling broader human studies.

Details

Motivation: To determine if LLMs provide deep (in-group) or shallow (out-group) responses in surveys and enhance their applicability in political science studies.

Method: Proposes a novel methodology for creating virtual personas with detailed, consistent synthetic backstories to test LLM responses.

Result: Virtual personas with backstories improve human response replication by up to 87% and match effect sizes of in-group/out-group biases.

Conclusion: The method extends LLM use beyond estimating socially understood responses, enabling broader human studies.

Abstract: Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses to various surveys and polls. However, the questions in these surveys usually reflect socially understood attitudes: the patterns of attitudes of old/young, liberal/conservative, as understood by both members and non-members of those groups. It is not clear whether the LLM binding is \emph{deep}, meaning the LLM answers as a member of a particular in-group would, or \emph{shallow}, meaning the LLM responds as an out-group member believes an in-group member would. To explore this difference, we use questions that expose known in-group/out-group biases. This level of fidelity is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories" generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies of in-group/out-group biases. Altogether, our work extends the applicability of LLMs beyond estimating socially understood responses, enabling their use in a broader range of human studies.

Muhammad Ahmad, Fida Ullah, Muhammad Usman, Umyh Habiba, ldar Batyrshin, Grigori Sidorov

Main category: cs.CL

TL;DR: An AI-driven NLP framework using social media data achieves high accuracy in detecting drug use and overdose symptoms, outperforming baseline models.

Details

Motivation: Addressing the global health issue of drug overdose by leveraging real-time social media insights, overcoming limitations of traditional research methods.

Method: Hybrid annotation strategy with LLMs and human annotators, employing traditional ML, neural networks, and transformer-based models.

Result: Achieved 98% accuracy in multi-class and 97% in multi-label classification, outperforming baselines by up to 8%.

Conclusion: Demonstrates AI’s potential for enhancing public health surveillance and personalized interventions.

Abstract: Drug overdose remains a critical global health issue, often driven by misuse of opioids, painkillers, and psychiatric medications. Traditional research methods face limitations, whereas social media offers real-time insights into self-reported substance use and overdose symptoms. This study proposes an AI-driven NLP framework trained on annotated social media data to detect commonly used drugs and associated overdose symptoms. Using a hybrid annotation strategy with LLMs and human annotators, we applied traditional ML models, neural networks, and advanced transformer-based models. Our framework achieved 98% accuracy in multi-class and 97% in multi-label classification, outperforming baseline models by up to 8%. These findings highlight the potential of AI for supporting public health surveillance and personalized intervention strategies.

[62] Block Circulant Adapter for Large Language Models

Xinyu Ding, Meiqi Wang, Siyu Liao, Zhongfeng Wang

Main category: cs.CL

TL;DR: A block circulant matrix-based fine-tuning method for LLMs reduces storage and computation costs significantly while maintaining performance.

Details

Motivation: Fine-tuning large language models is challenging due to their size, prompting the need for cost-effective methods.

Method: Uses block circulant matrices and one-dimensional Fourier transforms to optimize storage and computation.

Result: Achieves 14x fewer parameters than VeRA, 16x smaller than LoRA, and 32x fewer FLOPs than FourierFT with comparable performance.

Conclusion: The method offers an efficient frequency-domain approach for fine-tuning large models on downstream tasks.

Abstract: Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs. We propose a block circulant matrix-based fine-tuning method with a stable training heuristic to leverage the properties of circulant matrices and one-dimensional Fourier transforms to reduce storage and computation costs. Experiments show that our method uses $14\times$ less number of parameters than VeRA, $16\times$ smaller than LoRA and $32\times$ less FLOPs than FourierFT, while maintaining close or better task performance. Our approach presents a promising way in frequency domain to fine-tune large models on downstream tasks.

[63] Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He

Main category: cs.CL

TL;DR: The paper explores merging Vision-Language Models (VLMs) and Large Language Models (LLMs) to combine perception and reasoning, revealing insights into their internal mechanisms.

Details

Motivation: To understand how perception and reasoning can be combined in VLMs and LLMs, and to explore model merging as a tool for multimodal integration.

Method: Proposes merging models across modalities (VLMs and LLMs) to transfer reasoning abilities without training, and analyzes the internal mechanisms post-merging.

Result: Merging successfully transfers reasoning from LLMs to VLMs; perception is encoded in early layers, while reasoning involves middle-to-late layers. Post-merging, all layers contribute to reasoning.

Conclusion: Model merging is a promising tool for multimodal integration and understanding the interplay between perception and reasoning in VLMs.

Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

[64] Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models

Jugal Gajjar, Kaustik Ranaware

Main category: cs.CL

TL;DR: The paper presents a multimodal sentiment analysis model using BERT-based encoders and early fusion, achieving high accuracy and F1-score on the CMU-MOSEI dataset.

Details

Motivation: To leverage transformer architectures for integrating text, audio, and visual modalities in sentiment analysis, demonstrating the effectiveness of early fusion.

Method: BERT-based encoders for each modality, concatenated embeddings, Adam optimization (lr=1e-4), dropout (0.3), and early stopping.

Result: 97.87% 7-class accuracy, 0.9682 F1-score, and low MAE (0.1060) on the test set.

Conclusion: Early fusion with transformers effectively captures cross-modal interactions for sentiment analysis, with potential for future work on fusion strategies and interpretability.

Abstract: This project performs multimodal sentiment analysis using the CMU-MOSEI dataset, using transformer-based models with early fusion to integrate text, audio, and visual modalities. We employ BERT-based encoders for each modality, extracting embeddings that are concatenated before classification. The model achieves strong performance, with 97.87% 7-class accuracy and a 0.9682 F1-score on the test set, demonstrating the effectiveness of early fusion in capturing cross-modal interactions. The training utilized Adam optimization (lr=1e-4), dropout (0.3), and early stopping to ensure generalization and robustness. Results highlight the superiority of transformer architectures in modeling multimodal sentiment, with a low MAE (0.1060) indicating precise sentiment intensity prediction. Future work may compare fusion strategies or enhance interpretability. This approach utilizes multimodal learning by effectively combining linguistic, acoustic, and visual cues for sentiment analysis.

[65] FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Zhehao Zhang, Weijie Xu, Fanyou Wu, Chandan K. Reddy

Main category: cs.CL

TL;DR: FalseReject is a resource to reduce over-refusal of benign queries in LLMs by providing structured responses and adversarial prompts, improving utility without compromising safety.

Details

Motivation: Address the issue of LLMs over-refusing benign queries, which reduces their utility in sensitive scenarios.

Method: Introduce FalseReject, a dataset with 16k toxic queries and structured responses, and a graph-informed adversarial multi-agent framework for diverse prompts.

Result: Supervised finetuning with FalseReject reduces unnecessary refusals while maintaining safety and language capabilities.

Conclusion: FalseReject effectively mitigates over-refusal in LLMs without sacrificing safety or performance.

Abstract: Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

[66] Is Compression Really Linear with Code Intelligence?

Shijie Xuyang, Xianzhen Luo, Tianhao Cheng, Zheng Chu, Houyi Li, ziqi wang, Siming Huang, Qingfu Zhu, Qiufeng Wang, Xiangyu Zhang, Shuigeng Zhou, Wanxiang Che

Main category: cs.CL

TL;DR: The paper explores the relationship between data compression and Code LLMs, introducing Format Annealing for fair evaluation and revealing a logarithmic, not linear, relationship between code intelligence and compression.

Details

Motivation: To understand the nuanced relationship between data compression and Code LLMs, addressing gaps in prior work that assumed linearity and lacked fair evaluation methods.

Method: Evaluated diverse Code LLMs on multi-language, multi-task benchmarks using Format Annealing for fair assessment and measured compression via bits-per-character (BPC) on a novel GitHub-derived validation set.

Result: Empirical results show a logarithmic relationship between code intelligence and BPC, refining prior linear assumptions.

Conclusion: The work provides deeper insights into compression’s role in code intelligence and offers a robust evaluation framework for future research.

Abstract: Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs’ code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve’s tail under specific, limited conditions. Our work provides a more nuanced understanding of compression’s role in developing code intelligence and contributes a robust evaluation framework in the code domain.

[67] Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara

Main category: cs.CL

TL;DR: The paper introduces two benchmarks, KnowRecall and VisRecall, to evaluate cross-lingual consistency in multimodal large language models (MLLMs), revealing their struggles with multilingual and cultural knowledge.

Details

Motivation: Addressing the challenge of achieving consistent performance in MLLMs across languages and cultural contexts.

Method: Development of two benchmarks: KnowRecall (visual question answering for factual knowledge) and VisRecall (visual memory consistency).

Result: State-of-the-art MLLMs, including proprietary ones, fail to achieve cross-lingual consistency.

Conclusion: Highlights the need for more robust approaches to develop truly multilingual and culturally aware MLLMs.

Abstract: The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

[68] Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion

Jianxiang Zang, Meiling Ning, Yongda Wei, Shihan Dou, Jiazheng Zhang, Nijia Mo, Binhong Li, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: The paper introduces refined compression metrics for language models by analyzing geometric distortion, addressing the issue of anisotropic word representations caused by high compression.

Details

Motivation: The study aims to improve the interpretation of language model intelligence by addressing the problem of anisotropic word representations in highly compressed models.

Method: The authors propose three refined compression metrics incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline.

Result: The refined metrics show strong alignment with language model capabilities, achieving Spearman correlation coefficients above 0.9.

Conclusion: Incorporating geometric distortion into compression metrics enhances the informatics interpretation of language models.

Abstract: Recently, the concept of compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the Compression Hacking’’ in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM’s comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.

[69] Gaussian mixture models as a proxy for interacting language models

Edward L. Wang, Tianyu Wang, Hayden Helm, Avanti Athreya, Vince Lyzinski, Carey E. Priebe

Main category: cs.CL

TL;DR: The paper proposes interacting Gaussian mixture models (GMMs) as a simpler alternative to large language models (LLMs) for studying human behavior in social sciences, highlighting their ability to mimic LLM dynamics.

Details

Motivation: LLMs are powerful but computationally expensive, motivating the search for simpler alternatives like GMMs for studying human behavior in large-scale experiments.

Method: The authors compare a simplified GMM model to experimental simulations of LLMs, focusing on dynamics and feedback interactions.

Result: Interacting GMMs capture key features of LLM dynamics, with similarities and differences identified between the two models.

Conclusion: GMMs offer benefits over LLMs, with potential modifications and future research directions discussed.

Abstract: Large language models (LLMs) are a powerful tool with the ability to match human capabilities and behavior in many settings. Retrieval-augmented generation (RAG) further allows LLMs to generate diverse output depending on the contents of their RAG database. This motivates their use in the social sciences to study human behavior between individuals when large-scale experiments are infeasible. However, LLMs depend on complex, computationally expensive algorithms. In this paper, we introduce interacting Gaussian mixture models (GMMs) as an alternative to similar frameworks using LLMs. We compare a simplified model of GMMs to select experimental simulations of LLMs whose updating and response depend on feedback from other LLMs. We find that interacting GMMs capture important features of the dynamics in interacting LLMs, and we investigate key similarities and differences between interacting LLMs and GMMs. We conclude by discussing the benefits of Gaussian mixture models, potential modifications, and future research directions.

[70] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

Main category: cs.CL

TL;DR: Critique-GRPO integrates natural language and numerical feedback to enhance RL-finetuned LLMs, outperforming traditional methods in reasoning tasks.

Details

Motivation: Address performance plateaus, limited self-reflection, and persistent failures in RL with numerical feedback.

Method: Propose Critique-GRPO, an online RL framework combining natural language critiques and numerical feedback, with a shaping function for policy optimization.

Result: Improves pass@1 scores by ~4.4% and 3.8% on Qwen models, with notable gains in self-improvement and weak-to-strong generalization.

Conclusion: Critique-GRPO effectively leverages feedback for better performance and generalization in LLMs.

Abstract: Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024, respectively.

[71] A quantum semantic framework for natural language processing

Christopher J. Agostino, Quan Le Thien, Molly Apsel, Denizhan Pak, Elina Lesyk, Ashabari Majumdar

Main category: cs.CL

TL;DR: The paper explores semantic degeneracy in language, showing it limits LLMs due to combinatorial ambiguity. It uses Kolmogorov complexity and quantum-like logic, with experiments violating classical bounds, suggesting Bayesian approaches for meaning.

Details

Motivation: To understand how semantic degeneracy in language fundamentally limits LLMs and NLP systems, and to explore non-classical interpretations of meaning.

Method: Uses Kolmogorov complexity to analyze ambiguity, conducts a semantic Bell inequality test with LLMs, and compares results to classical boundaries.

Result: Experiments showed non-classical contextuality (CHSH values up to 2.8), violating classical limits, aligning with human cognition findings.

Conclusion: Classical approaches are inadequate; Bayesian repeated sampling better captures linguistic meaning in context.

Abstract: Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. In this work, we argue this property imposes fundamental limitations on Large Language Models (LLMs) and other modern NLP systems, precisely because they operate within natural language itself. Using Kolmogorov complexity, we demonstrate that as an expression’s complexity grows, the amount of contextual information required to reliably resolve its ambiguity explodes combinatorially. The computational intractability of recovering a single intended meaning for complex or ambiguous text therefore suggests that the classical view that linguistic forms possess intrinsic meaning in and of themselves is conceptually inadequate. We argue instead that meaning is dynamically actualized through an observer-dependent interpretive act, a process whose non-deterministic nature is most appropriately described by a non-classical, quantum-like logic. To test this hypothesis, we conducted a semantic Bell inequality test using diverse LLM agents. Our experiments yielded average CHSH expectation values from 1.2 to 2.8, with several runs producing values (e.g., 2.3-2.4) in significant violation of the classical boundary ($|S|\leq2$), demonstrating that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context.

[72] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze

Main category: cs.CL

TL;DR: ImpliRet is a benchmark for evaluating retrieval systems where relevance depends on implicit document-side reasoning, not just query complexity. Current retrievers perform poorly, with the best nDCG@10 at 14.91%, and even long-context models like GPT-o4-mini score only 55.54%.

Details

Motivation: To assess retrieval systems beyond shallow signals like keyword overlap by focusing on implicit reasoning in documents.

Method: Introduces ImpliRet, a benchmark with simple queries but relevance tied to implicit facts in documents (temporal, arithmetic, world knowledge). Evaluates sparse/dense retrievers and long-context models.

Result: Best nDCG@10 is 14.91%; GPT-o4-mini scores 55.54% with short context, showing document-side reasoning is challenging.

Conclusion: Document-side reasoning remains a significant challenge for retrieval systems, even for advanced models.

Abstract: Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 14.91%. We also test whether long-context models can overcome this limitation. But even with a short context of only thirty documents, including the positive document, GPT-o4-mini scores only 55.54%, showing that document-side reasoning remains a challenge. Our codes are available at: github.com/ZeinabTaghavi/IMPLIRET

[73] Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

Hessa A. Alawwad, Anas Zafar, Areej Alhothali, Usman Naseem, Ali Alkhathlan, Amani Jamal

Main category: cs.CL

TL;DR: MLLMs like LLaVA-1.5 and LLaMA 3.2-Vision are tested on textbook QA. Retrieval-augmented context helps LLaVA but harms LLaMA 3.2-Vision, termed ‘catastrophic context interference.’ Fine-tuning reveals architectural differences, with LLaMA improving and LLaVA declining.

Details

Motivation: Evaluate MLLMs' reasoning on complex educational materials, focusing on textbook QA, to understand their limitations and potential in education.

Method: Use a multimodal RAG pipeline to provide context (lesson paragraphs and diagrams) and test zero-shot and fine-tuned performance on the CK12-QA dataset.

Result: Retrieved context boosts LLaVA’s text-based QA but drastically reduces LLaMA 3.2-Vision’s diagram-based accuracy (74.07% to 25.93%). Fine-tuning improves LLaMA (71.16%) but worsens LLaVA.

Conclusion: MLLMs struggle with modality prioritization and context integration, highlighting challenges for AI in education and suggesting future research directions.

Abstract: Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested. This work presents the first evaluation of state-of-the-art MLLMs, including LLaVA-1.5 and LLaMA 3.2-Vision, on the textbook question answering (TQA) task using the CK12-QA dataset. We introduce a multimodal retrieval-augmented generation (RAG) pipeline to simulate real-world learning by providing relevant lesson paragraphs and diagrams as context. Our zero-shot experiments reveal a critical trade-off: while retrieved context improves LLaVA’s performance on text-based questions, it significantly degrades the accuracy of the more powerful LLaMA 3.2-Vision on diagram-based tasks, dropping its validation accuracy from 74.07% to 25.93%. We term this statistically significant phenomenon “catastrophic context interference.” Furthermore, fine-tuning highlights architectural differences: LLaMA 3.2-Vision’s performance improves to 71.16% on the test set, demonstrating its capacity to learn multimodal integration, whereas LLaVA’s performance declines, indicating challenges with generalization. Our results underscore the challenges MLLMs face in modality prioritization and context integration, providing a benchmark and pointing to key directions for developing more robust AI-driven educational tools.

[74] Jan-nano Technical Report

Alan Dao, Dinh Bach Vu

Main category: cs.CL

TL;DR: Jan-nano, a 4B parameter model, achieves high efficiency by specializing in instant information retrieval, outperforming larger models with 83.2% on SimpleQA benchmark.

Details

Motivation: Address the tradeoff between model capability and computational resources by focusing on specialization rather than scale.

Method: Fine-tuned from Qwen3-4B using multi-stage Reinforcement Learning with Verifiable Rewards (RLVR), eliminating next token prediction training.

Result: Achieves 83.2% on SimpleQA benchmark with 128K context length, running on consumer hardware.

Conclusion: Intelligence is about strategic specialization, not just scale, as demonstrated by Jan-nano’s efficiency.

Abstract: Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage Reinforcement Learning with Verifiable Rewards (RLVR) system that completely eliminates reliance on next token prediction training (SFT), Jan-nano achieves 83.2% on SimpleQA benchmark with MCP integration while running on consumer hardware. With 128K context length, Jan-nano proves that intelligence isn’t about scale, it’s about strategy.

[75] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, Kui Ren

Main category: cs.CL

TL;DR: ContextCache introduces a context-aware semantic caching system for multi-turn dialogues, improving precision and recall while reducing latency.

Details

Motivation: Existing semantic caching systems lack awareness of multi-turn dialogue contexts, leading to incorrect cache hits.

Method: Uses a two-stage retrieval architecture with vector-based retrieval and self-attention mechanisms for contextual matching.

Result: Improves precision and recall, with cached responses showing ~10x lower latency than direct LLM invocation.

Conclusion: ContextCache enables significant computational cost reductions for LLM conversational applications.

Abstract: Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.

[76] Stylometry recognizes human and LLM-generated texts in short samples

Karol Przystalski, Jan K. Argasiński, Iwona Grabska-Gradzińska, Jeremi K. Ochab

Main category: cs.CL

TL;DR: The paper uses stylometry to differentiate between LLM-generated and human-written texts, achieving high accuracy with tree-based models and stylometric features.

Details

Motivation: Addressing issues of model attribution, intellectual property, and ethical AI use by identifying emergent writing patterns of LLMs.

Method: Created a benchmark dataset with human and LLM-generated texts, applied tree-based models (decision trees, LightGBM) using stylometric features (lexical, grammatical, syntactic, punctuation).

Result: Achieved up to .87 Matthews correlation coefficient in multiclass and .79-1. accuracy in binary classification, with Wikipedia vs. GPT-4 reaching .98 accuracy.

Conclusion: Stylometry effectively distinguishes machine-generated from human-written texts for well-defined text types, even with sophisticated LLMs.

Abstract: The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show – crucially, in the context of the increasingly sophisticated LLMs – that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.

[77] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman

Main category: cs.CL

TL;DR: GDC Cohort Copilot is an open-source tool that uses natural language to create and refine cancer genomics cohorts in the GDC, outperforming GPT-4o with a locally-served LLM.

Details

Motivation: Users struggle to navigate complex cohort descriptors in the GDC, so a natural language-based solution is introduced to simplify the process.

Method: Developed a copilot tool using large language models (LLMs) to convert natural language descriptions into GDC cohort filters, with an interactive UI for refinement.

Result: The locally-served GDC Cohort LLM outperforms GPT-4o in generating accurate cohorts.

Conclusion: GDC Cohort Copilot successfully bridges the gap between natural language and complex cohort creation, with open-source availability for broader use.

Abstract: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.

[78] RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism

Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu

Main category: cs.CL

TL;DR: RAG-R1 is a novel training framework for LLMs that improves adaptive use of internal and external knowledge, reduces inference time, and outperforms baselines by up to 13.2%.

Details

Motivation: Addressing LLMs' limitations in generating hallucinated or outdated responses and improving training stability and inference efficiency in RAG methods.

Method: Proposes RAG-R1, a framework enabling adaptive knowledge use and expanding retrieval/generation to multi-query parallelism.

Result: Outperforms baselines by up to 13.2% on QA benchmarks and reduces inference time by 11.1%.

Conclusion: RAG-R1 effectively enhances LLM performance and efficiency, addressing key challenges in RAG methods.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.

[79] Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching

Thomas Savage

Main category: cs.CL

TL;DR: SCF, a reinforcement learning framework, improves multi-turn dialogue performance in LLMs by using branched conversation architectures, outperforming linear methods in diagnostic accuracy.

Details

Motivation: Existing fine-tuning methods like DPO and GRPO are effective for single-turn tasks but lack in multi-turn applications like medical interviews, where early turns influence outcomes.

Method: SCF introduces a branched conversation architecture, generating multiple continuations per turn to train LLMs on how early responses affect downstream interactions.

Result: SCF outperforms linear architectures in diagnostic accuracy in simulated doctor-patient conversations, likely due to richer, interdependent training signals.

Conclusion: Branched training architectures like SCF are crucial for fine-tuning LLMs in complex multi-turn conversational tasks.

Abstract: Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.

[80] DRAGON: Dynamic RAG Benchmark On News

Fedor Chernogorskii, Sergei Averkiev, Liliya Kudraleeva, Zaven Martirosian, Maria Tikhonova, Valentin Malykh, Alena Fenogenova

Main category: cs.CL

TL;DR: DRAGON is a dynamic benchmark for evaluating RAG systems in Russian, addressing the lack of resources for non-English languages by using a regularly updated news corpus and automated question generation.

Details

Motivation: Existing RAG benchmarks are scarce and static for non-English languages like Russian, failing to reflect real-world dynamics.

Method: DRAGON uses a regularly updated Russian news corpus, automated question generation via a Knowledge Graph, and supports evaluation of retriever and generator components.

Result: The benchmark includes an evaluation framework, scripts, and a public leaderboard to foster community engagement.

Conclusion: DRAGON fills a gap in RAG evaluation for Russian and offers reusable tools for multilingual settings.

Abstract: Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.

[81] ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmed, Yang Liu

Main category: cs.CL

TL;DR: ETT (Extend at Test-Time) extends context length of short-context Transformer LLMs with linear computation and constant memory, improving accuracy by up to 30%.

Details

Motivation: Quadratic computation/memory overhead in Transformer LMs limits long-sequence processing. ETT addresses this by enabling efficient context extension at test-time.

Method: ETT fine-tunes model parameters on overlapping subsequences of input context, focusing on specific Transformer modules (e.g., second FFN layer).

Result: Extends context length up to 32x (1k to 32k tokens) with 30% accuracy improvement on LongBench for GPT-Large and Phi-2.

Conclusion: Fine-tuning specific modules (e.g., FFN layers) is more effective than full fine-tuning, enhancing model accuracy for long sequences.

Abstract: Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.

[82] A Mathematical Theory of Discursive Networks

Juan B. Gutiérrez

Main category: cs.CL

TL;DR: The paper explores LLMs as a discursive network where humans and models interact equally, identifying hazards like error generation and proposing peer review (FOO algorithm) to stabilize truth.

Details

Motivation: To understand and improve the reliability of LLMs by treating them as part of a discursive network where errors are managed through mutual accountability.

Method: Develops a mathematical model of discursive networks, introduces the FOO algorithm for peer review, and analyzes hazards like drift and fabrication.

Result: A network with drift and self-repair stabilizes at modest error rates, but peer review shifts it to truth-dominance.

Conclusion: Reliability in LLM interactions depends on networked accountability, not perfecting individual models.

Abstract: Large-language models (LLMs) turn writing into a live exchange between humans and software. We characterize this new medium as a discursive network that treats people and LLMs as equal nodes and tracks how their statements circulate. We define the generation of erroneous information as invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. We develop a general mathematical model of discursive networks that shows that a network governed only by drift and self-repair stabilizes at a modest error rate. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a configurable loop in which any set of agents critique one another while a harmonizer merges their verdicts. We identify an ethical transgression, epithesis, that occurs when humans fail to engage in the discursive network. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from connecting imperfect ones into networks that enforce mutual accountability.

[83] Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams

Matthew Anderson Hendricks, Alice Cicirello

Main category: cs.CL

TL;DR: The paper proposes a strategy for automating the generation of dynamical system computational models using SysML diagrams, NLP, and LLMs, demonstrating improved performance over LLM-only approaches.

Details

Motivation: To speed up the design and deployment of engineering dynamical systems by leveraging domain knowledge and automating model generation.

Method: A five-step strategy using SysML diagrams, NLP, and LLMs for tasks like extracting dependencies, attributes, and operations, followed by code and computational model generation.

Result: Improved performance in generating computational models, illustrated through case studies and an end-to-end example of a simple pendulum.

Conclusion: The approach is versatile, not limited to specific domains, and outperforms LLM-only methods in generating accurate dynamical system models.

Abstract: This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.

[84] Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Varin Sikka, Vishal Sikka

Main category: cs.CL

TL;DR: LLMs face limitations in handling tasks beyond a certain complexity, including hallucinations and verification issues.

Details

Motivation: To investigate the computational complexity limits of LLMs and their impact on task performance and accuracy.

Method: Analyze LLMs and LLM-based agents from a computational complexity perspective.

Result: LLMs cannot perform or verify tasks beyond a specific complexity threshold.

Conclusion: LLMs have inherent computational limits affecting their reliability in complex tasks.

Abstract: In this paper we explore hallucinations and related capability limitations in LLMs and LLM-based agents from the perspective of computational complexity. We show that beyond a certain complexity, LLMs are incapable of carrying out computational and agentic tasks or verifying their accuracy.

[85] On the Effect of Instruction Tuning Loss on Generalization

Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: The paper introduces Weighted Instruction Tuning (WIT), a method to optimize loss functions in instruction tuning by differentially weighting prompt and response tokens, outperforming conventional approaches.

Details

Motivation: To address the overlooked issue of loss function optimization in instruction tuning and improve model performance and robustness.

Method: Proposes WIT, which systematically weights prompt and response tokens differently in the loss function, tested across various models, datasets, and benchmarks.

Result: WIT improves performance and robustness, with optimal weights being low-to-moderate for prompts and moderate-to-high for responses.

Conclusion: The study advocates rethinking instruction tuning loss functions and provides insights for developing more robust models, with open-sourced code.

Abstract: Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.

[86] KAT-V1: Kwai-AutoThink Technical Report

Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu

Main category: cs.CL

TL;DR: Kwaipilot-AutoThink (KAT) is a 40B LLM addressing overthinking in reasoning tasks via dynamic mode-switching, efficient training, and reinforcement learning, outperforming SOTA models while reducing token usage by ~30%.

Details

Motivation: To solve the overthinking problem in reasoning-intensive tasks by dynamically adjusting reasoning modes based on task complexity.

Method: Uses a dual-regime dataset, MTP-enhanced knowledge distillation, cold-start initialization, and Step-SRPO reinforcement learning for mode selection and accuracy.

Result: Matches or outperforms SOTA models (e.g., DeepSeek-R1-0528, Qwen3-235B-A22B) with ~30% fewer tokens. Successfully deployed in Kwaipilot.

Conclusion: KAT is efficient, scalable, and improves real-world workflows. Future work includes a 200B MoE model, showing further potential.

Abstract: We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou’s internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.

[87] DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro

Main category: cs.CL

TL;DR: DocPolarBERT is a layout-aware BERT model for document understanding, using relative polar coordinates instead of absolute 2D positional embeddings, achieving state-of-the-art results with less pre-training data.

Details

Motivation: To improve document understanding by eliminating the need for absolute 2D positional embeddings and reducing reliance on large pre-training datasets.

Method: Extends self-attention to use relative polar coordinates for text block positions, trained on a smaller dataset than IIT-CDIP.

Result: Achieves state-of-the-art performance despite using significantly less pre-training data.

Conclusion: A well-designed attention mechanism can compensate for reduced pre-training data, offering an efficient solution for document understanding.

Abstract: We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.

[88] SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems

Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn

Main category: cs.CL

TL;DR: SEALGuard is a multilingual guardrail for LLM systems, improving safety alignment across diverse languages by outperforming existing methods like LlamaGuard in detecting unsafe and jailbreak prompts.

Details

Motivation: Existing guardrails like LlamaGuard struggle with multilingual unsafe inputs, leaving LLM systems vulnerable, especially in low-resource languages.

Method: Adapts a multilingual language model into a guardrail using LoRA and evaluates it on SEALSBench, a dataset of 260,000 prompts in ten languages.

Result: SEALGuard improves Defense Success Rate by 48% over LlamaGuard and achieves top performance in DSR, precision, and F1-score.

Conclusion: SEALGuard effectively addresses the multilingual safety alignment gap, enhancing LLM system safety.

Abstract: Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?’’), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. SEALGuard advances the safety alignment of LLM systems by introducing an effective multilingual guardrail.

[89] REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once

Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu

Main category: cs.CL

TL;DR: The paper introduces REST, a stress-testing framework for evaluating Large Reasoning Models (LRMs) under simultaneous multi-problem conditions, revealing performance gaps not captured by single-question benchmarks.

Details

Motivation: Existing benchmarks for LRMs are limited by isolated problem-solving paradigms, failing to assess real-world capabilities like multi-context reasoning and dynamic cognitive load management.

Method: REST evaluates LRMs by exposing them to multiple problems simultaneously, testing contextual priority allocation, cross-problem interference resistance, and cognitive load management.

Result: State-of-the-art models like DeepSeek-R1 show significant performance degradation under REST, highlighting its discriminative power and revealing the ‘overthinking trap’ as a key issue.

Conclusion: REST offers a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands and reduces reliance on human annotation.

Abstract: Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key insights emerge from our analysis: (1) the “overthinking trap” is a critical factor contributing to the performance degradation; (2) the models trained with “long2short” technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code and results are available at https://opendatalab.github.io/REST.

cs.CV

[90] CWNet: Causal Wavelet Network for Low-Light Image Enhancement

Tongshun Zhang, Pingping Liu, Yubing Lu, Mengen Cai, Zijian Zhang, Zhe Zhang, Qiuzhan Zhou

Main category: cs.CV

TL;DR: CWNet introduces a causal reasoning approach for low-light image enhancement, using wavelet transforms and causal principles to outperform existing methods.

Details

Motivation: Traditional LLIE methods lack instance-level semantic understanding and feature-specific enhancements. CWNet addresses this by incorporating causal reasoning.

Method: CWNet combines causal reasoning (global metric learning and local CLIP semantic loss) with a wavelet transform-based backbone for frequency optimization.

Result: CWNet outperforms state-of-the-art methods across multiple datasets, demonstrating robust performance.

Conclusion: CWNet’s causal and wavelet-based approach effectively enhances low-light images, setting a new benchmark for LLIE.

Abstract: Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that effectively optimizes the recovery of frequency information, ensuring precise enhancement tailored to the specific attributes of wavelet transforms. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes. Code is available at https://github.com/bywlzts/CWNet-Causal-Wavelet-Network.

[91] Integrating Biological Knowledge for Robust Microscopy Image Profiling on De Novo Cell Lines

Jiayuan Chen, Thai-Hoang Pham, Yuanlong Wang, Ping Zhang

Main category: cs.CV

TL;DR: A novel framework integrates biological knowledge to improve microscopy image profiling for de novo cell lines by disentangling perturbation-specific and cell line-specific features.

Details

Motivation: Addressing challenges in robust perturbation screening for de novo cell lines due to morphological and biological heterogeneity.

Method: Uses a knowledge graph from protein interaction data and transcriptomic features to guide pretraining, disentangling representations.

Result: Improves generalization to de novo cell lines, validated on RxRx datasets.

Conclusion: Effective for phenotype-based drug discovery, enhancing microscopy image profiling.

Abstract: High-throughput screening techniques, such as microscopy imaging of cellular responses to genetic and chemical perturbations, play a crucial role in drug discovery and biomedical research. However, robust perturbation screening for \textit{de novo} cell lines remains challenging due to the significant morphological and biological heterogeneity across cell lines. To address this, we propose a novel framework that integrates external biological knowledge into existing pretraining strategies to enhance microscopy image profiling models. Our approach explicitly disentangles perturbation-specific and cell line-specific representations using external biological information. Specifically, we construct a knowledge graph leveraging protein interaction data from STRING and Hetionet databases to guide models toward perturbation-specific features during pretraining. Additionally, we incorporate transcriptomic features from single-cell foundation models to capture cell line-specific representations. By learning these disentangled features, our method improves the generalization of imaging models to \textit{de novo} cell lines. We evaluate our framework on the RxRx database through one-shot fine-tuning on an RxRx1 cell line and few-shot fine-tuning on cell lines from the RxRx19a dataset. Experimental results demonstrate that our method enhances microscopy image profiling for \textit{de novo} cell lines, highlighting its effectiveness in real-world phenotype-based drug discovery applications.

[92] Auditing Facial Emotion Recognition Datasets for Posed Expressions and Racial Bias

Rina Khan, Catherine Stinson

Main category: cs.CV

TL;DR: The paper audits FER datasets, revealing biases in performance for spontaneous vs. posed expressions and racial/skin tone disparities, leading to ethical concerns in real-world applications.

Details

Motivation: Address performance and ethical challenges in FER algorithms, particularly for spontaneous expressions and racial/skin tone biases.

Method: Audit two FER datasets by sampling images to classify them as spontaneous or posed, and evaluate model performance across races and skin tones.

Result: Found mislabeled posed images in ‘in-the-wild’ datasets and racial/skin tone biases in FER models, skewing predictions negatively for non-white/dark-skinned individuals.

Conclusion: FER datasets and models exhibit biases, risking harm in real-world use; improvements in data collection and model training are needed.

Abstract: Facial expression recognition (FER) algorithms classify facial expressions into emotions such as happy, sad, or angry. An evaluative challenge facing FER algorithms is the fall in performance when detecting spontaneous expressions compared to posed expressions. An ethical (and evaluative) challenge facing FER algorithms is that they tend to perform poorly for people of some races and skin colors. These challenges are linked to the data collection practices employed in the creation of FER datasets. In this study, we audit two state-of-the-art FER datasets. We take random samples from each dataset and examine whether images are spontaneous or posed. In doing so, we propose a methodology for identifying spontaneous or posed images. We discover a significant number of images that were posed in the datasets purporting to consist of in-the-wild images. Since performance of FER models vary between spontaneous and posed images, the performance of models trained on these datasets will not represent the true performance if such models were to be deployed in in-the-wild applications. We also observe the skin color of individuals in the samples, and test three models trained on each of the datasets to predict facial expressions of people from various races and skin tones. We find that the FER models audited were more likely to predict people labeled as not white or determined to have dark skin as showing a negative emotion such as anger or sadness even when they were smiling. This bias makes such models prone to perpetuate harm in real life applications.

[93] FPC-Net: Revisiting SuperPoint with Descriptor-Free Keypoint Detection via Feature Pyramids and Consistency-Based Implicit Matching

Ionuţ Grigore, Călin-Adrian Popa, Claudiu Leoveanu-Condrei

Main category: cs.CV

TL;DR: A method for interest point matching without descriptors, reducing memory usage while slightly lowering accuracy.

Details

Motivation: Traditional methods rely on descriptors for matching, which require computation, storage, and transmission. This work aims to eliminate descriptors while maintaining functionality.

Method: Interest points are inherently associated during detection, bypassing the need for descriptors.

Result: Matching accuracy is marginally lower than conventional methods, but memory usage is drastically reduced.

Conclusion: The proposed method offers a viable alternative to descriptor-based matching, especially for memory-constrained systems.

Abstract: The extraction and matching of interest points are fundamental to many geometric computer vision tasks. Traditionally, matching is performed by assigning descriptors to interest points and identifying correspondences based on descriptor similarity. This work introduces a technique where interest points are inherently associated during detection, eliminating the need for computing, storing, transmitting, or matching descriptors. Although the matching accuracy is marginally lower than that of conventional approaches, our method completely eliminates the need for descriptors, leading to a drastic reduction in memory usage for localization systems. We assess its effectiveness by comparing it against both classical handcrafted methods and modern learned approaches.

[94] A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers

Jeffrey Joan Sam, Janhavi Sathe, Nikhil Chigali, Naman Gupta, Radhey Ruparel, Yicheng Jiang, Janmajay Singh, James W. Berck, Arko Barman

Main category: cs.CV

TL;DR: A new dataset of 64k annotated spacecraft images was created to address the scarcity of training data for autonomous inspection systems. The dataset includes real and synthetic backgrounds, with added noise and distortions. YOLOv8 and YOLOv11 models were fine-tuned, achieving high performance under real-time constraints.

Details

Motivation: The need for reliable, cost-effective autonomous inspection systems for spacecraft due to risks and costs of human or robotic repairs in space.

Method: Creation of a dataset with real spacecraft models and synthetic backgrounds, addition of noise/distortions, and fine-tuning YOLOv8/YOLOv11 models for segmentation.

Result: Models achieved a Dice score of 0.92, Hausdorff distance of 0.69, and inference time of ~0.5 seconds under real-time constraints.

Conclusion: The dataset and models provide a robust benchmark for real-time spacecraft image segmentation, addressing the lack of annotated data and enabling autonomous inspection.

Abstract: Spacecraft deployed in outer space are routinely subjected to various forms of damage due to exposure to hazardous environments. In addition, there are significant risks to the subsequent process of in-space repairs through human extravehicular activity or robotic manipulation, incurring substantial operational costs. Recent developments in image segmentation could enable the development of reliable and cost-effective autonomous inspection systems. While these models often require large amounts of training data to achieve satisfactory results, publicly available annotated spacecraft segmentation data are very scarce. Here, we present a new dataset of nearly 64k annotated spacecraft images that was created using real spacecraft models, superimposed on a mixture of real and synthetic backgrounds generated using NASA’s TTALOS pipeline. To mimic camera distortions and noise in real-world image acquisition, we also added different types of noise and distortion to the images. Finally, we finetuned YOLOv8 and YOLOv11 segmentation models to generate performance benchmarks for the dataset under well-defined hardware and inference time constraints to mimic real-world image segmentation challenges for real-time onboard applications in space on NASA’s inspector spacecraft. The resulting models, when tested under these constraints, achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 second. The dataset and models for performance benchmark are available at https://github.com/RiceD2KLab/SWiM.

[95] Warehouse Spatial Question Answering with LLM Agent

Hsiang-Wei Huang, Jen-Hao Cheng, Kuang-Ming Chen, Cheng-Yen Yang, Bahaa Alattar, Yi-Ru Lin, Pyongkun Kim, Sangwon Kim, Kwangju Kim, Chung-I Huang, Jenq-Neng Hwang

Main category: cs.CV

TL;DR: A data-efficient LLM agent system enhances spatial reasoning for complex indoor warehouse tasks, outperforming previous methods.

Details

Motivation: Existing MLLMs struggle with spatial understanding, prompting the need for a more efficient solution.

Method: Proposes an LLM agent system with spatial reasoning tools and API interactions for complex spatial questions.

Result: Achieves high accuracy and efficiency in tasks like object retrieval, counting, and distance estimation on the AI City Challenge dataset.

Conclusion: The system demonstrates superior performance in spatial reasoning tasks, offering a practical solution for warehouse scenarios.

Abstract: Spatial understanding has been a challenging task for existing Multi-modal Large Language Models~(MLLMs). Previous methods leverage large-scale MLLM finetuning to enhance MLLM’s spatial understanding ability. In this paper, we present a data-efficient approach. We propose a LLM agent system with strong and advanced spatial reasoning ability, which can be used to solve the challenging spatial question answering task in complex indoor warehouse scenarios. Our system integrates multiple tools that allow the LLM agent to conduct spatial reasoning and API tools interaction to answer the given complicated spatial question. Extensive evaluations on the 2025 AI City Challenge Physical AI Spatial Intelligence Warehouse dataset demonstrate that our system achieves high accuracy and efficiency in tasks such as object retrieval, counting, and distance estimation. The code is available at: https://github.com/hsiangwei0903/SpatialAgent

[96] ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel

Main category: cs.CV

TL;DR: ThinkingViT is a nested Vision Transformer that dynamically adjusts computation based on input complexity, improving efficiency and accuracy.

Details

Motivation: Fixed computational budgets in Vision Transformers lead to inefficiencies, as all inputs receive the same compute regardless of complexity.

Method: ThinkingViT uses progressive thinking stages and Token Recycling to dynamically activate attention heads and terminate early if predictions are certain.

Result: ThinkingViT outperforms nested baselines by up to 2.9 p.p. in accuracy at equal GMACs on ImageNet-1K.

Conclusion: ThinkingViT offers a scalable, efficient, and accurate solution for Vision Transformers, also serving as a plugin upgrade for vanilla ViT.

Abstract: Vision Transformers deliver state-of-the-art performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent nested Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT initiates inference by activating a small subset of the most important attention heads and terminates early if predictions reach sufficient certainty. Otherwise, it activates additional attention heads and re-evaluates the input. At the core of ThinkingViT is our Token Recycling mechanism, which conditions each subsequent inference stage on the embeddings from the previous stage, enabling progressive improvement. Due to its backbone-preserving design, ThinkingViT also serves as a plugin upgrade for vanilla ViT. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. The source code is available at https://github.com/ds-kiel/ThinkingViT.

[97] LLM-Guided Agentic Object Detection for Open-World Understanding

Furkan Mumcu, Michael J. Jones, Anoop Cherian, Yasin Yilmaz

Main category: cs.CV

TL;DR: LAOD framework uses LLM for label-free, zero-shot object detection, improving adaptability and autonomy in open-world scenarios.

Details

Motivation: Traditional object detection lacks flexibility for novel objects; OWOD and OVOD have limitations like missing semantic labels or dependency on user prompts.

Method: LAOD leverages LLM to generate scene-specific object names, paired with an open-vocabulary detector for localization. New metrics (CAAP, SNAP) evaluate localization and naming.

Result: Experiments on LVIS, COCO, and COCO-OOD show strong performance in detecting and naming novel objects.

Conclusion: LAOD enhances autonomy and adaptability for open-world understanding, outperforming existing methods.

Abstract: Object detection traditionally relies on fixed category sets, requiring costly re-training to handle novel objects. While Open-World and Open-Vocabulary Object Detection (OWOD and OVOD) improve flexibility, OWOD lacks semantic labels for unknowns, and OVOD depends on user prompts, limiting autonomy. We propose an LLM-guided agentic object detection (LAOD) framework that enables fully label-free, zero-shot detection by prompting a Large Language Model (LLM) to generate scene-specific object names. These are passed to an open-vocabulary detector for localization, allowing the system to adapt its goals dynamically. We introduce two new metrics, Class-Agnostic Average Precision (CAAP) and Semantic Naming Average Precision (SNAP), to separately evaluate localization and naming. Experiments on LVIS, COCO, and COCO-OOD validate our approach, showing strong performance in detecting and naming novel objects. Our method offers enhanced autonomy and adaptability for open-world understanding.

[98] MFGDiffusion: Mask-Guided Smoke Synthesis for Enhanced Forest Fire Detection

Guanghao Wu, Chen Xu, Hai Song, Chong Wang, Qixing Zhang

Main category: cs.CV

TL;DR: A framework for generating realistic forest fire smoke images using deep learning, addressing data scarcity and improving detection models.

Details

Motivation: The scarcity of forest fire smoke image data hinders detection. Current inpainting models fail to produce high-quality, context-consistent smoke images.

Method: Uses pre-trained segmentation and multimodal models for masks and captions, introduces a mask-guided network, and proposes a mask random difference loss for consistency. A multimodal LLM filters synthetic images.

Result: Generated smoke images are realistic and diverse, enhancing forest fire smoke detection model performance.

Conclusion: The proposed framework effectively addresses data scarcity and improves smoke detection, with code publicly available.

Abstract: Smoke is the first visible indicator of a wildfire.With the advancement of deep learning, image-based smoke detection has become a crucial method for detecting and preventing forest fires. However, the scarcity of smoke image data from forest fires is one of the significant factors hindering the detection of forest fire smoke. Image generation models offer a promising solution for synthesizing realistic smoke images. However, current inpainting models exhibit limitations in generating high-quality smoke representations, particularly manifesting as inconsistencies between synthesized smoke and background contexts. To solve these problems, we proposed a comprehensive framework for generating forest fire smoke images. Firstly, we employed the pre-trained segmentation model and the multimodal model to obtain smoke masks and image captions.Then, to address the insufficient utilization of masks and masked images by inpainting models, we introduced a network architecture guided by mask and masked image features. We also proposed a new loss function, the mask random difference loss, which enhances the consistency of the generated effects around the mask by randomly expanding and eroding the mask edges.Finally, to generate a smoke image dataset using random masks for subsequent detection tasks, we incorporated smoke characteristics and use a multimodal large language model as a filtering tool to select diverse and reasonable smoke images, thereby improving the quality of the synthetic dataset. Experiments showed that our generated smoke images are realistic and diverse, and effectively enhance the performance of forest fire smoke detection models. Code is available at https://github.com/wghr123/MFGDiffusion.

[99] Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization

Casey Wall, Longwei Wang, Rodrigue Rizk, KC Santosh

Main category: cs.CV

TL;DR: Winsor-CAM improves Grad-CAM by aggregating info across all CNN layers, using Winsorization to reduce noise and offering human-tunable thresholds for better interpretability.

Details

Motivation: Enhancing CNN interpretability for high-stakes applications by addressing Grad-CAM's limitations in handling layer-specific semantic cues and noise.

Method: Proposes Winsor-CAM, applying Winsorization to attenuate outliers and aggregating attributions across all convolutional layers with a tunable threshold.

Result: Outperforms Grad-CAM and uniform layer-averaging in localization metrics (e.g., IoU, center-of-mass alignment) on PASCAL VOC 2012 with standard architectures.

Conclusion: Winsor-CAM advances trustworthy AI by providing interpretable, multi-layer insights with human control, improving model transparency.

Abstract: Interpreting the decision-making process of Convolutional Neural Networks (CNNs) is critical for deploying models in high-stakes domains. Gradient-weighted Class Activation Mapping (Grad-CAM) is a widely used method for visual explanations, yet it typically focuses on the final convolutional layer or na"ively averages across layers, strategies that can obscure important semantic cues or amplify irrelevant noise. We propose Winsor-CAM, a novel, human-tunable extension of Grad-CAM that generates robust and coherent saliency maps by aggregating information across all convolutional layers. To mitigate the influence of noisy or extreme attribution values, Winsor-CAM applies Winsorization, a percentile-based outlier attenuation technique. A user-controllable threshold allows for semantic-level tuning, enabling flexible exploration of model behavior across representational hierarchies. Evaluations on standard architectures (ResNet50, DenseNet121, VGG16, InceptionV3) using the PASCAL VOC 2012 dataset demonstrate that Winsor-CAM produces more interpretable heatmaps and achieves superior performance in localization metrics, including intersection-over-union and center-of-mass alignment, when compared to Grad-CAM and uniform layer-averaging baselines. Winsor-CAM advances the goal of trustworthy AI by offering interpretable, multi-layer insights with human-in-the-loop control.

[100] Sparse Fine-Tuning of Transformers for Generative Tasks

Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu

Main category: cs.CV

TL;DR: A sparse coding-inspired fine-tuning framework for transformers improves interpretability and task adaptation by representing updates as sparse combinations of feature dictionary atoms.

Details

Motivation: Existing fine-tuning methods lack interpretability in how models adapt to new tasks due to dense parameter updates.

Method: Introduces a framework where fine-tuned features are sparse combinations of feature dictionary atoms, with coefficients indicating atom importance.

Result: Enhances image editing performance and outperforms baselines in text-to-image concept customization.

Conclusion: The method provides interpretable and efficient adaptation of pre-trained models to downstream tasks.

Abstract: Large pre-trained transformers have revolutionized artificial intelligence across various domains, and fine-tuning remains the dominant approach for adapting these models to downstream tasks due to the cost of training from scratch. However, in existing fine-tuning methods, the updated representations are formed as a dense combination of modified parameters, making it challenging to interpret their contributions and understand how the model adapts to new tasks. In this work, we introduce a fine-tuning framework inspired by sparse coding, where fine-tuned features are represented as a sparse combination of basic elements, i.e., feature dictionary atoms. The feature dictionary atoms function as fundamental building blocks of the representation, and tuning atoms allows for seamless adaptation to downstream tasks. Sparse coefficients then serve as indicators of atom importance, identifying the contribution of each atom to the updated representation. Leveraging the atom selection capability of sparse coefficients, we first demonstrate that our method enhances image editing performance by improving text alignment through the removal of unimportant feature dictionary atoms. Additionally, we validate the effectiveness of our approach in the text-to-image concept customization task, where our method efficiently constructs the target concept using a sparse combination of feature dictionary atoms, outperforming various baseline fine-tuning methods.

[101] A Lightweight and Robust Framework for Real-Time Colorectal Polyp Detection Using LOF-Based Preprocessing and YOLO-v11n

Saadat Behzadi, Danial Sharifrazi, Bita Mesbahzadeh, Javad Hassannataj Joloudarid, Roohallah Alizadehsani

Main category: cs.CV

TL;DR: A lightweight framework combining LOF for noise filtering and YOLO-v11n for polyp detection achieves high accuracy and efficiency in real-time colonoscopy support.

Details

Motivation: Timely and accurate polyp detection is vital for colorectal cancer prevention, requiring efficient and robust AI solutions.

Method: Uses LOF for outlier removal and YOLO-v11n for detection, validated on five datasets with 5-fold cross-validation and augmentation.

Result: Achieves precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%.

Conclusion: The method is effective for real-time clinical use, highlighting the importance of data preprocessing and model efficiency.

Abstract: Objectives: Timely and accurate detection of colorectal polyps plays a crucial role in diagnosing and preventing colorectal cancer, a major cause of mortality worldwide. This study introduces a new, lightweight, and efficient framework for polyp detection that combines the Local Outlier Factor (LOF) algorithm for filtering noisy data with the YOLO-v11n deep learning model. Study design: An experimental study leveraging deep learning and outlier removal techniques across multiple public datasets. Methods: The proposed approach was tested on five diverse and publicly available datasets: CVC-ColonDB, CVC-ClinicDB, Kvasir-SEG, ETIS, and EndoScene. Since these datasets originally lacked bounding box annotations, we converted their segmentation masks into suitable detection labels. To enhance the robustness and generalizability of our model, we apply 5-fold cross-validation and remove anomalous samples using the LOF method configured with 30 neighbors and a contamination ratio of 5%. Cleaned data are then fed into YOLO-v11n, a fast and resource-efficient object detection architecture optimized for real-time applications. We train the model using a combination of modern augmentation strategies to improve detection accuracy under diverse conditions. Results: Our approach significantly improves polyp localization performance, achieving a precision of 95.83%, recall of 91.85%, F1-score of 93.48%, mAP@0.5 of 96.48%, and mAP@0.5:0.95 of 77.75%. Compared to previous YOLO-based methods, our model demonstrates enhanced accuracy and efficiency. Conclusions: These results suggest that the proposed method is well-suited for real-time colonoscopy support in clinical settings. Overall, the study underscores how crucial data preprocessing and model efficiency are when designing effective AI systems for medical imaging.

[102] Trexplorer Super: Topologically Correct Centerline Tree Tracking of Tubular Objects in CT Volumes

Roman Naeem, David Hagerman, Jennifer Alvén, Lennart Svensson, Fredrik Kahl

Main category: cs.CV

TL;DR: Trexplorer Super improves upon Trexplorer for 3D centerline tracking in medical images, addressing duplicate branches and premature termination. It outperforms SOTA models on new synthetic and real datasets.

Details

Motivation: Accurate tracking of tubular tree structures (e.g., blood vessels) is vital for medical tasks, but existing models like Trexplorer have limitations.

Method: Enhanced Trexplorer (Trexplorer Super) with novel advancements, evaluated on three new datasets (one synthetic, two real).

Result: Trexplorer Super outperforms SOTA models on all datasets, but synthetic performance doesn’t guarantee real-data success.

Conclusion: Trexplorer Super advances centerline tracking, with datasets and code publicly available for further research.

Abstract: Tubular tree structures, such as blood vessels and airways, are essential in human anatomy and accurately tracking them while preserving their topology is crucial for various downstream tasks. Trexplorer is a recurrent model designed for centerline tracking in 3D medical images but it struggles with predicting duplicate branches and terminating tracking prematurely. To address these issues, we present Trexplorer Super, an enhanced version that notably improves performance through novel advancements. However, evaluating centerline tracking models is challenging due to the lack of public datasets. To enable thorough evaluation, we develop three centerline datasets, one synthetic and two real, each with increasing difficulty. Using these datasets, we conduct a comprehensive evaluation of existing state-of-the-art (SOTA) models and compare them with our approach. Trexplorer Super outperforms previous SOTA models on every dataset. Our results also highlight that strong performance on synthetic data does not necessarily translate to real datasets. The code and datasets are available at https://github.com/RomStriker/Trexplorer-Super.

[103] Modernizing CNN-based Weather Forecast Model towards Higher Computational Efficiency

Minjong Cheon, Eunhan Goo, Su-Hyeon Shin, Muhammad Ahmed, Hyungjun Kim

Main category: cs.CV

TL;DR: A lightweight CNN-based model, KAI-a, achieves competitive accuracy in weather forecasting with reduced computational demands compared to Transformer-based models.

Details

Motivation: Address the high training complexity and resource demands of Transformer-based weather forecast models by proposing a more efficient CNN-based alternative.

Method: Introduces KAI-a, a modernized CNN-based model with a scale-invariant architecture and InceptionNeXt-based blocks, trained on the ERA5 dataset with 67 atmospheric variables.

Result: KAI-a matches state-of-the-art performance in medium-range forecasting, with only 7 million parameters and 12-hour training on a single GPU.

Conclusion: KAI-a offers a lightweight, efficient, and accurate solution for weather forecasting, demonstrating robust performance in extreme events.

Abstract: Recently, AI-based weather forecast models have achieved impressive advances. These models have reached accuracy levels comparable to traditional NWP systems, marking a significant milestone in data-driven weather prediction. However, they mostly leverage Transformer-based architectures, which often leads to high training complexity and resource demands due to the massive parameter sizes. In this study, we introduce a modernized CNN-based model for global weather forecasting that delivers competitive accuracy while significantly reducing computational requirements. To present a systematic modernization roadmap, we highlight key architectural enhancements across multiple design scales from an earlier CNN-based approach. KAI-a incorporates a scale-invariant architecture and InceptionNeXt-based blocks within a geophysically-aware design, tailored to the structure of Earth system data. Trained on the ERA5 daily dataset with 67 atmospheric variables, the model contains about 7 million parameters and completes training in just 12 hours on a single NVIDIA L40s GPU. Our evaluation shows that KAI-a matches the performance of state-of-the-art models in medium-range weather forecasting, while offering a significantly lightweight design. Furthermore, case studies on the 2018 European heatwave and the East Asian summer monsoon demonstrate KAI-a’s robust skill in capturing extreme events, reinforcing its practical utility.

[104] Commuting Distance Regularization for Timescale-Dependent Label Inconsistency in EEG Emotion Recognition

Xiaocong Zeng, Craig Michoski, Yan Pang, Dongyang Kuang

Main category: cs.CV

TL;DR: The paper introduces two regularization strategies, LVL and LGCL, to address Timescale Dependent Label Inconsistency (TsDLI) in EEG-based emotion recognition, improving model generalization and explainability.

Details

Motivation: The work tackles the overlooked issue of TsDLI in EEG emotion recognition, aiming to enhance model performance and interpretability.

Method: Proposes LVL and LGCL, incorporating bounded variation functions and commute-time distances in a graph theoretic framework, alongside new evaluation metrics.

Result: Experiments on DREAMER and DEAP datasets show LVL and LGCL outperform baselines, with LVL achieving the best aggregate rank.

Conclusion: The proposed methods offer a principled trade-off between interpretability and predictive power, effectively addressing TsDLI.

Abstract: In this work, we address the often-overlooked issue of Timescale Dependent Label Inconsistency (TsDLI) in training neural network models for EEG-based human emotion recognition. To mitigate TsDLI and enhance model generalization and explainability, we propose two novel regularization strategies: Local Variation Loss (LVL) and Local-Global Consistency Loss (LGCL). Both methods incorporate classical mathematical principles–specifically, functions of bounded variation and commute-time distances–within a graph theoretic framework. Complementing our regularizers, we introduce a suite of new evaluation metrics that better capture the alignment between temporally local predictions and their associated global emotion labels. We validate our approach through comprehensive experiments on two widely used EEG emotion datasets, DREAMER and DEAP, across a range of neural architectures including LSTM and transformer-based models. Performance is assessed using five distinct metrics encompassing both quantitative accuracy and qualitative consistency. Results consistently show that our proposed methods outperform state-of-the-art baselines, delivering superior aggregate performance and offering a principled trade-off between interpretability and predictive power under label inconsistency. Notably, LVL achieves the best aggregate rank across all benchmarked backbones and metrics, while LGCL frequently ranks the second, highlighting the effectiveness of our framework.

[105] GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization

Shaowen Tong, Zimin Xia, Alexandre Alahi, Xuming He, Yujiao Shi

Main category: cs.CV

TL;DR: GeoDistill is a weakly supervised framework for cross-view localization using teacher-student learning with FoV-based masking, improving accuracy and reducing uncertainty.

Details

Motivation: Existing methods rely on costly ground-truth pose annotations; GeoDistill aims to reduce this dependency.

Method: Uses teacher-student learning with FoV-based masking to align student predictions with teacher outputs, focusing on key features.

Result: Significantly improves localization performance and introduces a novel orientation estimation network.

Conclusion: GeoDistill offers a scalable, efficient solution for cross-view localization challenges.

Abstract: Cross-view localization, the task of estimating a camera’s 3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with satellite images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a Geometry guided weakly supervised self distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a panoramic image, while the student model predicts locations from a limited FoV counterpart created by FoV-based masking. By aligning the student’s predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty, regardless of whether the query images are panoramas or limited FoV images. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges. Code and model can be found at https://github.com/tongshw/GeoDistill.

[106] Graph Aggregation Prototype Learning for Semantic Change Detection in Remote Sensing

Zhengyi Xu, Haoran Wu, Wen Jiang, Jie Geng

Main category: cs.CV

TL;DR: The paper proposes GAPL-SCD, a framework for Semantic Change Detection (SCD) in remote sensing, addressing multi-task optimization challenges with adaptive weight allocation and gradient rotation. It introduces graph aggregation prototype learning for better domain alignment and feature representation, achieving state-of-the-art results.

Details

Motivation: SCD provides detailed semantic insights into changes in multi-temporal remote sensing data, but multi-task optimization challenges like negative transfer hinder performance.

Method: GAPL-SCD combines semantic segmentation, change detection, and graph aggregation prototype learning. It uses adaptive weight allocation, gradient rotation, and self-query multi-level feature interaction for improved performance.

Result: The method outperforms others on SECOND and Landsat-SCD datasets, showing higher accuracy and robustness.

Conclusion: GAPL-SCD effectively addresses multi-task challenges in SCD, offering a scalable solution with superior performance.

Abstract: Semantic change detection (SCD) extends the binary change detection task to provide not only the change locations but also the detailed “from-to” categories in multi-temporal remote sensing data. Such detailed semantic insights into changes offer considerable advantages for a wide array of applications. However, since SCD involves the simultaneous optimization of multiple tasks, the model is prone to negative transfer due to task-specific learning difficulties and conflicting gradient flows. To address this issue, we propose Graph Aggregation Prototype Learning for Semantic Change Detection in remote sensing(GAPL-SCD). In this framework, a multi-task joint optimization method is designed to optimize the primary task of semantic segmentation and change detection, along with the auxiliary task of graph aggregation prototype learning. Adaptive weight allocation and gradient rotation methods are used to alleviate the conflict between training tasks and improve multi-task learning capabilities. Specifically, the graph aggregation prototype learning module constructs an interaction graph using high-level features. Prototypes serve as class proxies, enabling category-level domain alignment across time points and reducing interference from irrelevant changes. Additionally, the proposed self-query multi-level feature interaction and bi-temporal feature fusion modules further enhance multi-scale feature representation, improving performance in complex scenes. Experimental results on the SECOND and Landsat-SCD datasets demonstrate that our method achieves state-of-the-art performance, with significant improvements in accuracy and robustness for SCD task.

[107] Robust ID-Specific Face Restoration via Alignment Learning

Yushun Fang, Lu Liu, Xiang Gao, Qiang Hu, Ning Cao, Jianghe Cui, Gang Chen, Xiaoyun Zhang

Main category: cs.CV

TL;DR: RIDFR is a novel ID-specific face restoration framework using diffusion models to address identity uncertainty in degraded inputs. It combines content and identity injection modules with alignment learning for high-fidelity results.

Details

Motivation: Current face restoration methods struggle with identity uncertainty due to obscured inputs and stochastic processes.

Method: RIDFR uses a pre-trained diffusion model with two parallel conditioning modules (content and identity injection) and alignment learning to align restoration results.

Result: RIDFR outperforms state-of-the-art methods, producing high-quality, ID-specific results with strong robustness.

Conclusion: RIDFR effectively addresses identity uncertainty in face restoration, achieving superior fidelity and robustness.

Abstract: The latest developments in Face Restoration have yielded significant advancements in visual quality through the utilization of diverse diffusion priors. Nevertheless, the uncertainty of face identity introduced by identity-obscure inputs and stochastic generative processes remains unresolved. To address this challenge, we present Robust ID-Specific Face Restoration (RIDFR), a novel ID-specific face restoration framework based on diffusion models. Specifically, RIDFR leverages a pre-trained diffusion model in conjunction with two parallel conditioning modules. The Content Injection Module inputs the severely degraded image, while the Identity Injection Module integrates the specific identity from a given image. Subsequently, RIDFR incorporates Alignment Learning, which aligns the restoration results from multiple references with the same identity in order to suppress the interference of ID-irrelevant face semantics (e.g. pose, expression, make-up, hair style). Experiments demonstrate that our framework outperforms the state-of-the-art methods, reconstructing high-quality ID-specific results with high identity fidelity and demonstrating strong robustness.

[108] Women Sport Actions Dataset for Visual Classification Using Small Scale Training Data

Palash Ray, Mahuya Sasmal, Asish Bera

Main category: cs.CV

TL;DR: The paper introduces the WomenSports dataset for women’s sports action classification, addressing the lack of diverse datasets. It proposes a CNN with channel attention for feature extraction, achieving 89.15% accuracy on the new dataset.

Details

Motivation: Existing datasets lack diversity in women's sports actions, limiting research. The study aims to fill this gap with a new dataset and improved classification method.

Method: A CNN with channel attention is used for deep feature extraction. The method is tested on multiple datasets, including the new WomenSports dataset.

Result: The proposed method achieves 89.15% top-1 accuracy on the WomenSports dataset using ResNet-50.

Conclusion: The WomenSports dataset and the proposed CNN with channel attention effectively address the lack of diverse datasets and improve classification accuracy.

Abstract: Sports action classification representing complex body postures and player-object interactions is an emerging area in image-based sports analysis. Some works have contributed to automated sports action recognition using machine learning techniques over the past decades. However, sufficient image datasets representing women sports actions with enough intra- and inter-class variations are not available to the researchers. To overcome this limitation, this work presents a new dataset named WomenSports for women sports classification using small-scale training data. This dataset includes a variety of sports activities, covering wide variations in movements, environments, and interactions among players. In addition, this study proposes a convolutional neural network (CNN) for deep feature extraction. A channel attention scheme upon local contextual regions is applied to refine and enhance feature representation. The experiments are carried out on three different sports datasets and one dance dataset for generalizing the proposed algorithm, and the performances on these datasets are noteworthy. The deep learning method achieves 89.15% top-1 classification accuracy using ResNet-50 on the proposed WomenSports dataset, which is publicly available for research at Mendeley Data.

[109] Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement

Xianmin Chen, Longfei Han, Peiliang Huang, Xiaoxu Feng, Dingwen Zhang, Junwei Han

Main category: cs.CV

TL;DR: RAWMamba, a Mamba-based method for low-light RAW image enhancement, addresses cross-domain challenges by integrating demosaicing and denoising with a Retinex Decomposition Module, achieving state-of-the-art results.

Details

Motivation: Existing methods struggle with denoising and color distortions in low-light RAW-to-sRGB mapping, especially due to overlooked demosaicing characteristics in the ISP pipeline.

Method: Proposes RAWMamba, a novel method for RAW images with different CFAs, and a Retinex Decomposition Module (RDM) to decouple illumination and reflectance for better denoising and exposure correction.

Result: RAWMamba outperforms existing methods on public datasets SID and MCR, demonstrating superior cross-domain mapping performance.

Conclusion: The integration of demosaicing and denoising with RDM in RAWMamba effectively enhances low-light RAW images, setting a new benchmark for the task.

Abstract: Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, existing two-stage approaches typically overlook the characteristic of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we propose a novel Mamba-based method customized for low light RAW images, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we introduce a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction, reducing the effect of manual linear illumination enhancement. By bridging demosaicing and denoising, better enhancement for low light RAW images is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping. The code is available at https://github.com/Cynicarlos/RetinexRawMamba.

[110] Conceptualizing Multi-scale Wavelet Attention and Ray-based Encoding for Human-Object Interaction Detection

Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See

Main category: cs.CV

TL;DR: The paper introduces a wavelet attention-like backbone and a ray-based encoder for efficient and reliable human-object interaction (HOI) detection, addressing limitations in existing methods.

Details

Motivation: Existing HOI detectors are inefficient and resource-intensive, lacking reliable predictions for complex visual scenes.

Method: Proposes a wavelet backbone for feature aggregation and a ray-based encoder for multi-scale attention, optimizing computational efficiency.

Result: Demonstrates improved performance on benchmark datasets like ImageNet and HICO-DET.

Conclusion: The proposed architecture effectively enhances HOI detection accuracy and efficiency, with publicly available code.

Abstract: Human-object interaction (HOI) detection is essential for accurately localizing and characterizing interactions between humans and objects, providing a comprehensive understanding of complex visual scenes across various domains. However, existing HOI detectors often struggle to deliver reliable predictions efficiently, relying on resource-intensive training methods and inefficient architectures. To address these challenges, we conceptualize a wavelet attention-like backbone and a novel ray-based encoder architecture tailored for HOI detection. Our wavelet backbone addresses the limitations of expressing middle-order interactions by aggregating discriminative features from the low- and high-order interactions extracted from diverse convolutional filters. Concurrently, the ray-based encoder facilitates multi-scale attention by optimizing the focus of the decoder on relevant regions of interest and mitigating computational overhead. As a result of harnessing the attenuated intensity of learnable ray origins, our decoder aligns query embeddings with emphasized regions of interest for accurate predictions. Experimental results on benchmark datasets, including ImageNet and HICO-DET, showcase the potential of our proposed architecture. The code is publicly available at [https://github.com/henry-pay/RayEncoder].

[111] A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation

M. M. A. Valiuddin, R. J. G. van Sloun, C. G. A. Viviers, P. H. N. de With, F. van der Sommen

Main category: cs.CV

TL;DR: This review consolidates uncertainty modeling in semantic segmentation, addressing gaps like distinguishing epistemic and aleatoric uncertainty, and proposes future directions for robust models.

Details

Motivation: Current segmentation models lack critical uncertainty information, leading to fragmented research and unreliable decision-making.

Method: The review examines foundational uncertainty concepts, their roles in segmentation tasks, and evaluates trends like generative models and sampling-free approaches.

Result: Identified challenges include strong spatial assumptions, lack of benchmarks, and pitfalls in uncertainty quantification. Trends include generative model adoption and interest in distribution-free methods.

Conclusion: Proposes future directions for uncertainty-aware segmentation, aiming for reliable, efficient, and interpretable models in real-world applications.

Abstract: Advances in architectural design, data availability, and compute have driven remarkable progress in semantic segmentation. Yet, these models often rely on relaxed Bayesian assumptions, omitting critical uncertainty information needed for robust decision-making. The resulting reliance on point estimates has fueled interest in probabilistic segmentation, but the literature remains fragmented. In response, this review consolidates and contextualizes foundational concepts in uncertainty modeling, including the non-trivial task of distinguishing between epistemic and aleatoric uncertainty and examining their roles across four key downstream segmentation tasks, highlighting Active Learning as particularly promising. By unifying theory, terminology, and applications, we provide a coherent foundation for researchers and identify critical challenges, such as strong assumptions in spatial aggregation, lack of standardized benchmarks, and pitfalls in current uncertainty quantification methods. We identify trends such as the adoption of contemporary generative models, driven by advances in the broader field of generative modeling, with segmentation-specific innovation primarily in the conditioning mechanisms. Moreover, we observe growing interest in distribution- and sampling-free approaches to uncertainty estimation. We further propose directions for advancing uncertainty-aware segmentation in deep learning, including pragmatic strategies for disentangling different sources of uncertainty, novel uncertainty modeling approaches and improved Transformer-based backbones. In this way, we aim to support the development of more reliable, efficient, and interpretable segmentation models that effectively incorporate uncertainty into real-world applications.

[112] Mind the Gap: Bridging Occlusion in Gait Recognition via Residual Gap Correction

Ayush Gupta, Siyuan Huang, Rama Chellappa

Main category: cs.CV

TL;DR: RG-Gait addresses occlusion challenges in gait recognition by modeling it as a residual learning task, improving occluded sequence performance without losing holistic accuracy.

Details

Motivation: Current gait recognition methods often ignore occlusions or require impractical paired data, and fail to retain performance on holistic inputs.

Method: Proposes RG-Gait, a residual correction method that adaptively integrates learned residuals to improve occluded gait recognition while preserving holistic accuracy.

Result: Evaluated on Gait3D, GREW, and BRIAR datasets, showing effectiveness in handling occlusions without compromising holistic recognition.

Conclusion: Residual learning is a viable technique for occluded gait recognition with holistic retention.

Abstract: Gait is becoming popular as a method of person re-identification because of its ability to identify people at a distance. However, most current works in gait recognition do not address the practical problem of occlusions. Among those which do, some require paired tuples of occluded and holistic sequences, which are impractical to collect in the real world. Further, these approaches work on occlusions but fail to retain performance on holistic inputs. To address these challenges, we propose RG-Gait, a method for residual correction for occluded gait recognition with holistic retention. We model the problem as a residual learning task, conceptualizing the occluded gait signature as a residual deviation from the holistic gait representation. Our proposed network adaptively integrates the learned residual, significantly improving performance on occluded gait sequences without compromising the holistic recognition accuracy. We evaluate our approach on the challenging Gait3D, GREW and BRIAR datasets and show that learning the residual can be an effective technique to tackle occluded gait recognition with holistic retention.

[113] Zero-Shot Hyperspectral Pansharpening Using Hysteresis-Based Tuning for Spectral Quality Control

Giuseppe Guarino, Matteo Ciotola, Gemine Vivone, Giovanni Poggi, Giuseppe Scarpa

Main category: cs.CV

TL;DR: A novel hyperspectral pansharpening method using a lightweight neural network with adaptive weights ensures uniform spectral quality across all bands, outperforming state-of-the-art methods.

Details

Motivation: Existing methods for hyperspectral pansharpening, borrowed from multispectral techniques, fail to address unique challenges like noise, spectral mismatch, and inconsistent quality across bands.

Method: A lightweight neural network with adaptive weights is used, dynamically adjusting spatial loss during fine-tuning to ensure spectral fidelity and account for nonlinear dependencies.

Result: The method achieves excellent sharpening quality consistently across all bands, as validated by benchmarking.

Conclusion: The proposed unsupervised, flexible, and low-complexity method outperforms state-of-the-art techniques, with shared code and results available online.

Abstract: Hyperspectral pansharpening has received much attention in recent years due to technological and methodological advances that open the door to new application scenarios. However, research on this topic is only now gaining momentum. The most popular methods are still borrowed from the more mature field of multispectral pansharpening and often overlook the unique challenges posed by hyperspectral data fusion, such as i) the very large number of bands, ii) the overwhelming noise in selected spectral ranges, iii) the significant spectral mismatch between panchromatic and hyperspectral components, iv) a typically high resolution ratio. Imprecise data modeling especially affects spectral fidelity. Even state-of-the-art methods perform well in certain spectral ranges and much worse in others, failing to ensure consistent quality across all bands, with the risk of generating unreliable results. Here, we propose a hyperspectral pansharpening method that explicitly addresses this problem and ensures uniform spectral quality. To this end, a single lightweight neural network is used, with weights that adapt on the fly to each band. During fine-tuning, the spatial loss is turned on and off to ensure a fast convergence of the spectral loss to the desired level, according to a hysteresis-like dynamic. Furthermore, the spatial loss itself is appropriately redefined to account for nonlinear dependencies between panchromatic and spectral bands. Overall, the proposed method is fully unsupervised, with no prior training on external data, flexible, and low-complexity. Experiments on a recently published benchmarking toolbox show that it ensures excellent sharpening quality, competitive with the state-of-the-art, consistently across all bands. The software code and the full set of results are shared online on https://github.com/giu-guarino/rho-PNN.

[114] SpaRTAN: Spatial Reinforcement Token-based Aggregation Network for Visual Recognition

Quan Bi Pay, Vishnu Monn Baskaran, Junn Yong Loo, KokSheik Wong, Simon See

Main category: cs.CV

TL;DR: SpaRTAN is a lightweight CNN architecture that improves spatial and channel-wise information processing, achieving competitive performance with fewer parameters and FLOPs.

Details

Motivation: CNNs and transformers exhibit simplicity bias and redundancy in MLP-like blocks, limiting their efficiency and performance.

Method: SpaRTAN uses kernels with varying receptive fields and a wave-based channel aggregation module to enhance feature capture and reduce redundancies.

Result: Achieves 77.7% accuracy on ImageNet-1k with 3.8M parameters and 50.0% AP on COCO, surpassing benchmarks.

Conclusion: SpaRTAN offers an efficient, high-performance alternative to traditional CNNs and transformers.

Abstract: The resurgence of convolutional neural networks (CNNs) in visual recognition tasks, exemplified by ConvNeXt, has demonstrated their capability to rival transformer-based architectures through advanced training methodologies and ViT-inspired design principles. However, both CNNs and transformers exhibit a simplicity bias, favoring straightforward features over complex structural representations. Furthermore, modern CNNs often integrate MLP-like blocks akin to those in transformers, but these blocks suffer from significant information redundancies, necessitating high expansion ratios to sustain competitive performance. To address these limitations, we propose SpaRTAN, a lightweight architectural design that enhances spatial and channel-wise information processing. SpaRTAN employs kernels with varying receptive fields, controlled by kernel size and dilation factor, to capture discriminative multi-order spatial features effectively. A wave-based channel aggregation module further modulates and reinforces pixel interactions, mitigating channel-wise redundancies. Combining the two modules, the proposed network can efficiently gather and dynamically contextualize discriminative features. Experimental results in ImageNet and COCO demonstrate that SpaRTAN achieves remarkable parameter efficiency while maintaining competitive performance. In particular, on the ImageNet-1k benchmark, SpaRTAN achieves 77. 7% accuracy with only 3.8M parameters and approximately 1.0 GFLOPs, demonstrating its ability to deliver strong performance through an efficient design. On the COCO benchmark, it achieves 50.0% AP, surpassing the previous benchmark by 1.2% with only 21.5M parameters. The code is publicly available at [https://github.com/henry-pay/SpaRTAN].

Yuhu Bai, Jiangning Zhang, Yunkang Cao, Guangyuan Lu, Qingdong He, Xiangtai Li, Guanzhong Tian

Main category: cs.CV

TL;DR: FiSeCLIP enhances zero-shot anomaly detection (ZSAD) by combining feature matching and cross-modal alignment with CLIP, using batch-based testing and text filtering for better performance.

Details

Motivation: To improve zero-shot anomaly detection by leveraging CLIP's capabilities while addressing challenges like noisy features and fine-grained detection.

Method: FiSeCLIP uses batch-based testing, mutual reference points within batches, and text filtering to reduce ambiguity. It also restores CLIP’s local semantic correlation for fine-grained tasks.

Result: FiSeCLIP outperforms SOTA AdaCLIP by +4.6%/+5.7% in segmentation metrics (AU-ROC/F1-max) on MVTec-AD.

Conclusion: FiSeCLIP sets a stronger baseline for ZSAD, demonstrating superior performance in anomaly classification and segmentation.

Abstract: With the advent of vision-language models (e.g., CLIP) in zero- and few-shot settings, CLIP has been widely applied to zero-shot anomaly detection (ZSAD) in recent research, where the rare classes are essential and expected in many applications. This study introduces \textbf{FiSeCLIP} for ZSAD with training-free \textbf{CLIP}, combining the feature matching with the cross-modal alignment. Testing with the entire dataset is impractical, while batch-based testing better aligns with real industrial needs, and images within a batch can serve as mutual reference points. Accordingly, FiSeCLIP utilizes other images in the same batch as reference information for the current image. However, the lack of labels for these references can introduce ambiguity, we apply text information to \textbf{fi}lter out noisy features. In addition, we further explore CLIP’s inherent potential to restore its local \textbf{se}mantic correlation, adapting it for fine-grained anomaly detection tasks to enable a more accurate filtering process. Our approach exhibits superior performance for both anomaly classification and segmentation on anomaly detection benchmarks, building a stronger baseline for the direction, e.g., on MVTec-AD, FiSeCLIP outperforms the SOTA AdaCLIP by +4.6%$\uparrow$/+5.7%$\uparrow$ in segmentation metrics AU-ROC/$F_1$-max.

[116] Semantically Informed Salient Regions Guided Radiology Report Generation

Zeyi Hou, Zeqiang Wei, Ruixin Yan, Ning Lang, Xiuzhuang Zhou

Main category: cs.CV

TL;DR: SISRNet improves radiology report generation by focusing on medically critical regions, addressing data bias and enhancing accuracy.

Details

Motivation: Existing methods produce fluent but inaccurate reports due to data bias in radiology images.

Method: SISRNet identifies salient regions using fine-grained cross-modal semantics and focuses on them during image modeling and report generation.

Result: SISRNet outperforms peers on IU-Xray and MIMIC-CXR datasets.

Conclusion: SISRNet effectively mitigates data bias and generates clinically accurate radiology reports.

Abstract: Recent advances in automated radiology report generation from chest X-rays using deep learning algorithms have the potential to significantly reduce the arduous workload of radiologists. However, due to the inherent massive data bias in radiology images, where abnormalities are typically subtle and sparsely distributed, existing methods often produce fluent yet medically inaccurate reports, limiting their applicability in clinical practice. To address this issue effectively, we propose a Semantically Informed Salient Regions-guided (SISRNet) report generation method. Specifically, our approach explicitly identifies salient regions with medically critical characteristics using fine-grained cross-modal semantics. Then, SISRNet systematically focuses on these high-information regions during both image modeling and report generation, effectively capturing subtle abnormal findings, mitigating the negative impact of data bias, and ultimately generating clinically accurate reports. Compared to its peers, SISRNet demonstrates superior performance on widely used IU-Xray and MIMIC-CXR datasets.

[117] Human-Guided Shade Artifact Suppression in CBCT-to-MDCT Translation via Schrödinger Bridge with Conditional Diffusion

Sung Ho Kang, Hyun-Cheol Park

Main category: cs.CV

TL;DR: A novel CBCT-to-MDCT translation framework using Schrodinger Bridge, GAN priors, and human-guided diffusion, ensuring anatomical fidelity and clinical preference alignment.

Details

Motivation: To improve CBCT-to-MDCT translation by integrating human feedback and enforcing boundary consistency for better clinical outcomes.

Method: Combines GAN-derived priors with human-guided conditional diffusion, using classifier-free guidance and tournament-based preference selection.

Result: Outperforms prior methods in RMSE, SSIM, LPIPS, and Dice metrics, with only 10 sampling steps.

Conclusion: The framework is effective and efficient for real-time, preference-aligned medical image translation.

Abstract: We present a novel framework for CBCT-to-MDCT translation, grounded in the Schrodinger Bridge (SB) formulation, which integrates GAN-derived priors with human-guided conditional diffusion. Unlike conventional GANs or diffusion models, our approach explicitly enforces boundary consistency between CBCT inputs and pseudo targets, ensuring both anatomical fidelity and perceptual controllability. Binary human feedback is incorporated via classifier-free guidance (CFG), effectively steering the generative process toward clinically preferred outcomes. Through iterative refinement and tournament-based preference selection, the model internalizes human preferences without relying on a reward model. Subtraction image visualizations reveal that the proposed method selectively attenuates shade artifacts in key anatomical regions while preserving fine structural detail. Quantitative evaluations further demonstrate superior performance across RMSE, SSIM, LPIPS, and Dice metrics on clinical datasets – outperforming prior GAN- and fine-tuning-based feedback methods – while requiring only 10 sampling steps. These findings underscore the effectiveness and efficiency of our framework for real-time, preference-aligned medical image translation.

[118] Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation

Sunghyun Park, Jungsoo Lee, Shubhankar Borse, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, Fatih Porikli

Main category: cs.CV

TL;DR: The paper introduces personalized open-vocabulary semantic segmentation (OVSS) to recognize user-specific concepts (e.g., ‘my mug cup’) using text prompt tuning and negative mask proposals, improving accuracy without degrading original OVSS performance.

Details

Motivation: Current OVSS fails to segment regions based on personal text descriptions, limiting user-specific applications.

Method: Proposes text prompt tuning with negative mask proposals and visual embedding injection to enhance personal concept recognition.

Result: Demonstrates superior performance on new benchmarks (FSS$^\text{per}$, CUB$^\text{per}$, ADE$^\text{per}$).

Conclusion: The method effectively addresses personalized OVSS challenges while preserving original OVSS capabilities.

Abstract: While open-vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing my mug cup’ among multiple mug cups'. To overcome this challenge, we introduce a novel task termed \textit{personalized open-vocabulary semantic segmentation} and propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs negative mask proposal’ that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS$^\text{per}$, CUB$^\text{per}$, and ADE$^\text{per}$.

[119] Efficient Dual-domain Image Dehazing with Haze Prior Perception

Lirong Zheng, Yanshan Li, Rui Yu, Kaihao Zhang

Main category: cs.CV

TL;DR: DGFDNet introduces a dual-domain framework for image dehazing, combining spatial and frequency domains with haze-aware modulation and multi-level feature fusion for superior performance and real-time efficiency.

Details

Motivation: Transformer-based models are computationally expensive for dehazing, and existing methods struggle with complex haze conditions due to weak coupling between spatial and frequency domains.

Method: DGFDNet uses a dual-domain framework with HAFM for haze-aware frequency modulation and MGAM for multi-scale feature fusion, along with PCGB for iterative prior refinement.

Result: The method achieves state-of-the-art performance on four benchmark datasets, with robustness and real-time efficiency.

Conclusion: DGFDNet effectively addresses computational and performance limitations in dehazing by integrating spatial and frequency domains with adaptive modulation and iterative refinement.

Abstract: Transformer-based models exhibit strong global modeling capabilities in single-image dehazing, but their high computational cost limits real-time applicability. Existing methods predominantly rely on spatial-domain features to capture long-range dependencies, which are computationally expensive and often inadequate under complex haze conditions. While some approaches introduce frequency-domain cues, the weak coupling between spatial and frequency branches limits the overall performance. To overcome these limitations, we propose the Dark Channel Guided Frequency-aware Dehazing Network (DGFDNet), a novel dual-domain framework that performs physically guided degradation alignment across spatial and frequency domains. At its core, the DGFDBlock comprises two key modules: 1) the Haze-Aware Frequency Modulator (HAFM), which generates a pixel-level haze confidence map from dark channel priors to adaptively enhance haze-relevant frequency components, thereby achieving global degradation-aware spectral modulation; 2) the Multi-level Gating Aggregation Module (MGAM), which fuses multi-scale features through diverse convolutional kernels and hybrid gating mechanisms to recover fine structural details. Additionally, a Prior Correction Guidance Branch (PCGB) incorporates a closed-loop feedback mechanism, enabling iterative refinement of the prior by intermediate dehazed features and significantly improving haze localization accuracy, especially in challenging outdoor scenes. Extensive experiments on four benchmark haze datasets demonstrate that DGFDNet achieves state-of-the-art performance with superior robustness and real-time efficiency. Code is available at: https://github.com/Dilizlr/DGFDNet.

[120] A Multi-View High-Resolution Foot-Ankle Complex Point Cloud Dataset During Gait for Occlusion-Robust 3D Completion

Jie-Wen Li, Zi-Han Ye, Qingyuan Zhou, Jiayi Song, Ying He, Ben Fei, Wen-Ming Chen

Main category: cs.CV

TL;DR: FootGait3D is a multi-view dataset of high-resolution ankle-foot point clouds for gait analysis, addressing occlusion challenges and enabling 3D shape completion research.

Details

Motivation: Accurate surface geometry data of the foot-ankle complex during gait is hard to collect due to occlusions and viewing limitations, necessitating a specialized dataset.

Method: FootGait3D includes 8,403 point cloud frames from 46 subjects, captured using a five-camera depth sensing system, with complete and partial views for evaluation.

Result: The dataset supports benchmarking of 3D point cloud completion methods and advances biomechanics, gait analysis, prosthetic design, and robotics.

Conclusion: FootGait3D provides a valuable resource for detailed foot modeling and research, available for public use.

Abstract: The kinematics analysis of foot-ankle complex during gait is essential for advancing biomechanical research and clinical assessment. Collecting accurate surface geometry data from the foot and ankle during dynamic gait conditions is inherently challenging due to swing foot occlusions and viewing limitations. Thus, this paper introduces FootGait3D, a novel multi-view dataset of high-resolution ankle-foot surface point clouds captured during natural gait. Different from existing gait datasets that typically target whole-body or lower-limb motion, FootGait3D focuses specifically on the detailed modeling of the ankle-foot region, offering a finer granularity of motion data. To address this, FootGait3D consists of 8,403 point cloud frames collected from 46 subjects using a custom five-camera depth sensing system. Each frame includes a complete 5-view reconstruction of the foot and ankle (serving as ground truth) along with partial point clouds obtained from only four, three, or two views. This structured variation enables rigorous evaluation of 3D point cloud completion methods under varying occlusion levels and viewpoints. Our dataset is designed for shape completion tasks, facilitating the benchmarking of state-of-the-art single-modal (e.g., PointTr, SnowflakeNet, Anchorformer) and multi-modal (e.g., SVDFormer, PointSea, CSDN) completion networks on the challenge of recovering the full foot geometry from occluded inputs. FootGait3D has significant potential to advance research in biomechanics and multi-segment foot modeling, offering a valuable testbed for clinical gait analysis, prosthetic design, and robotics applications requiring detailed 3D models of the foot during motion. The dataset is now available at https://huggingface.co/datasets/ljw285/FootGait3D.

[121] Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

Nicolas Drapier, Aladine Chetouani, Aurélien Chateigner

Main category: cs.CV

TL;DR: GLOD is a transformer-based architecture for object detection in satellite imagery, outperforming SOTA by 11.46% on xView.

Details

Motivation: To address challenges in high-resolution satellite imagery detection by leveraging transformers and novel feature integration.

Method: Uses Swin Transformer for feature extraction, UpConvMixer for upsampling, and Fusion Blocks with CBAM attention for multi-scale integration.

Result: Achieves 32.95% on xView, surpassing SOTA by 11.46%.

Conclusion: GLOD is effective for satellite imagery, combining spatial priors and computational efficiency.

Abstract: We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95% on xView, outperforming SOTA methods by 11.46%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.

[122] Alleviating Textual Reliance in Medical Language-guided Segmentation via Prototype-driven Semantic Approximation

Shuchang Ye, Usman Naseem, Mingyuan Meng, Jinman Kim

Main category: cs.CV

TL;DR: ProLearn introduces a prototype-driven framework to reduce reliance on paired image-text data for medical segmentation, improving performance when text is scarce.

Details

Motivation: Current language-guided segmentation relies on paired image-text data, limiting its use in datasets without paired reports and clinical scenarios where segmentation precedes reporting.

Method: ProLearn uses a Prototype-driven Semantic Approximation (PSA) module to approximate semantic guidance from text, enabling segmentation without paired reports.

Result: ProLearn outperforms state-of-the-art methods on QaTa-COV19, MosMedData+, and Kvasir-SEG datasets when text is limited.

Conclusion: ProLearn effectively reduces textual reliance, enhancing the applicability of language-guided segmentation in real-world clinical settings.

Abstract: Medical language-guided segmentation, integrating textual clinical reports as auxiliary guidance to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as ``textual reliance", presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, in ProLearn, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19, MosMedData+ and Kvasir-SEG demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.

[123] Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling

Hayeon Kim, Ji Ha Jang, Se Young Chun

Main category: cs.CV

TL;DR: RoMaP is a novel framework for precise local 3D Gaussian editing, addressing challenges in multi-view part segmentation and SDS loss ambiguity with 3D-GALP and regularized SDS loss.

Details

Motivation: Current methods struggle with precise local 3D edits in Gaussian Splatting due to inconsistent part segmentations and ambiguous SDS loss.

Method: Introduces 3D-GALP for robust 3D mask generation and a regularized SDS loss with L1 anchor loss and additional regularizers like Gaussian prior removal.

Result: Achieves state-of-the-art local 3D editing on reconstructed and generated Gaussian scenes, improving accuracy and flexibility.

Conclusion: RoMaP enables robust and flexible part-level 3D Gaussian editing, advancing the field of 3D content creation.

Abstract: Recent advances in 3D neural representations and instance-level editing models have enabled the efficient creation of high-quality 3D content. However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. First, we introduce a robust 3D mask generation module with our 3D-Geometry Aware Label Prediction (3D-GALP), which uses spherical harmonics (SH) coefficients to model view-dependent label variations and soft-label property, yielding accurate and consistent part segmentations across viewpoints. Second, we propose a regularized SDS loss that combines the standard SDS loss with additional regularizers. In particular, an L1 anchor loss is introduced via our Scheduled Latent Mixing and Part (SLaMP) editing method, which generates high-quality part-edited 2D images and confines modifications only to the target region while preserving contextual coherence. Additional regularizers, such as Gaussian prior removal, further improve flexibility by allowing changes beyond the existing context, and robust 3D masking prevents unintended edits. Experimental results demonstrate that our RoMaP achieves state-of-the-art local 3D editing on both reconstructed and generated Gaussian scenes and objects qualitatively and quantitatively, making it possible for more robust and flexible part-level 3D Gaussian editing.

[124] Joint angle model based learning to refine kinematic human pose estimation

Chang Peng, Yifei Zhou, Huifeng Xi, Shiqing Huang, Chuangye Chen, Jianming Yang, Bao Yang, Zhenyu Jiang

Main category: cs.CV

TL;DR: The paper proposes a joint angle-based refinement (JAR) method to improve marker-free human pose estimation (HPE) by addressing keypoint recognition errors and trajectory fluctuations.

Details

Motivation: Current HPE methods suffer from errors and fluctuations due to inaccurate manually annotated datasets, limiting deep learning model performance.

Method: The method includes joint angle-based modeling, high-order Fourier series approximation for reliable ground truth, and a bidirectional recurrent network for refining HRNet outputs.

Result: JAR outperforms state-of-the-art HPE refinement networks, especially in challenging cases like figure skating and breaking.

Conclusion: Joint angle-based modeling and refinement significantly enhance HPE accuracy and robustness.

Abstract: Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. This paper proposed a novel method to overcome the difficulty through joint angle-based modeling. The key techniques include: (i) A joint angle-based model of human pose, which is robust to describe kinematic human poses; (ii) Approximating temporal variation of joint angles through high order Fourier series to get reliable “ground truth”; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of well-established HRNet. Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking.

[125] GKNet: Graph-based Keypoints Network for Monocular Pose Estimation of Non-cooperative Spacecraft

Weizhao Ma, Dong Zhou, Yuhui Hu, Zipeng He

Main category: cs.CV

TL;DR: A graph-based keypoints network (GKNet) is proposed for accurate monocular pose estimation of non-cooperative spacecraft, addressing challenges like symmetry and occlusion. A dataset (SKD) is introduced for validation, and GKNet outperforms existing methods.

Details

Motivation: Accurate pose estimation is crucial for on-orbit service tasks, but current keypoint detectors struggle with structural symmetry and occlusion in non-cooperative spacecraft.

Method: GKNet uses geometric constraints of keypoints graphs for pose estimation. A dataset (SKD) with 90,000 simulated images and annotations is created for validation.

Result: GKNet achieves high accuracy and outperforms state-of-the-art spacecraft keypoint detectors.

Conclusion: GKNet and SKD dataset provide an effective solution for monocular pose estimation of non-cooperative spacecraft, with potential applications in on-orbit services.

Abstract: Monocular pose estimation of non-cooperative spacecraft is significant for on-orbit service (OOS) tasks, such as satellite maintenance, space debris removal, and station assembly. Considering the high demands on pose estimation accuracy, mainstream monocular pose estimation methods typically consist of keypoint detectors and PnP solver. However, current keypoint detectors remain vulnerable to structural symmetry and partial occlusion of non-cooperative spacecraft. To this end, we propose a graph-based keypoints network for the monocular pose estimation of non-cooperative spacecraft, GKNet, which leverages the geometric constraint of keypoints graph. In order to better validate keypoint detectors, we present a moderate-scale dataset for the spacecraft keypoint detection, named SKD, which consists of 3 spacecraft targets, 90,000 simulated images, and corresponding high-precise keypoint annotations. Extensive experiments and an ablation study have demonstrated the high accuracy and effectiveness of our GKNet, compared to the state-of-the-art spacecraft keypoint detectors. The code for GKNet and the SKD dataset is available at https://github.com/Dongzhou-1996/GKNet.

[126] Automatic Road Subsurface Distress Recognition from Ground Penetrating Radar Images using Deep Learning-based Cross-verification

Chang Peng, Bao Yang, Meiqi Li, Ge Zhang, Hui Sun, Zhenyu Jiang

Main category: cs.CV

TL;DR: A novel cross-verification strategy using YOLO models improves RSD recognition from GPR images, achieving high recall and reducing inspection labor by 90%.

Details

Motivation: Manual RSD recognition from GPR images is labor-intensive and expertise-dependent, while current deep learning methods suffer from dataset scarcity and limited network capability.

Method: Constructed a 3D GPR dataset with 2134 samples and proposed a cross-verification strategy based on YOLO models’ varying sensitivity to RSD types.

Result: Achieved over 98.6% recall in field tests and integrated the approach into an online system, reducing inspection labor by 90%.

Conclusion: The proposed method significantly enhances automatic RSD recognition efficiency and accuracy, offering a practical solution for road inspection.

Abstract: Ground penetrating radar (GPR) has become a rapid and non-destructive solution for road subsurface distress (RSD) detection. However, RSD recognition from GPR images is labor-intensive and heavily relies on inspectors’ expertise. Deep learning offers the possibility for automatic RSD recognition, but its current performance is limited by two factors: Scarcity of high-quality dataset for network training and insufficient capability of network to distinguish RSD. In this study, a rigorously validated 3D GPR dataset containing 2134 samples of diverse types was constructed through field scanning. Based on the finding that the YOLO model trained with one of the three scans of GPR images exhibits varying sensitivity to specific type of RSD, we proposed a novel cross-verification strategy with outstanding accuracy in RSD recognition, achieving recall over 98.6% in field tests. The approach, integrated into an online RSD detection system, can reduce the labor of inspection by around 90%.

[127] Atmos-Bench: 3D Atmospheric Structures for Climate Insight

Tianchi Xu

Main category: cs.CV

TL;DR: The paper introduces Atmos-Bench, the first 3D atmospheric benchmark, and FourCastX, a novel network for atmospheric structure recovery, outperforming existing methods without auxiliary inputs.

Details

Motivation: Existing methods for atmospheric structure recovery rely on approximations and lack standardized benchmarks, leading to uncertainties and incomplete capture of radiative effects.

Method: FourCastX uses a frequency-enhanced spatio-temporal mixture-of-experts network, embedding physical constraints and generating high-quality 3D references from simulated data.

Result: The method achieves consistent improvements on Atmos-Bench, outperforming state-of-the-art models across 355 nm and 532 nm bands.

Conclusion: Atmos-Bench sets a new standard for 3D atmospheric recovery, enabling deeper climate insights.

Abstract: Atmospheric structure, represented by backscatter coefficients (BC) recovered from satellite LiDAR attenuated backscatter (ATB), provides a volumetric view of clouds, aerosols, and molecules, playing a critical role in human activities, climate understanding, and extreme weather forecasting. Existing methods often rely on auxiliary inputs and simplified physics-based approximations, and lack a standardized 3D benchmark for fair evaluation. However, such approaches may introduce additional uncertainties and insufficiently capture realistic radiative transfer and atmospheric scattering-absorption effects. To bridge these gaps, we present Atmos-Bench: the first 3D atmospheric benchmark, along with a novel FourCastX: Frequency-enhanced Spatio-Temporal Mixture-of-Experts Network that (a) generates 921,600 image slices from 3D scattering volumes simulated at 532 nm and 355 nm by coupling WRF with an enhanced COSP simulator over 384 land-ocean time steps, yielding high-quality voxel-wise references; (b) embeds ATB-BC physical constraints into the model architecture, promoting energy consistency during restoration; (c) achieves consistent improvements on the Atmos-Bench dataset across both 355 nm and 532 nm bands, outperforming state-of-the-art baseline models without relying on auxiliary inputs. Atmos-Bench establishes a new standard for satellite-based 3D atmospheric structure recovery and paves the way for deeper climate insight.

[128] A Survey on Interpretability in Visual Recognition

Qiyang Wan, Chengzhi Gao, Ruiping Wang, Xilin Chen

Main category: cs.CV

TL;DR: A systematic review of interpretability in visual recognition models, proposing a human-centered taxonomy and exploring evaluation metrics and new opportunities.

Details

Motivation: To understand and improve the interpretability of visual recognition models for critical applications like autonomous driving and medical diagnostics.

Method: Proposes a taxonomy categorizing interpretable methods by Intent, Object, Presentation, and Methodology, and reviews evaluation metrics and emerging technologies.

Result: A coherent taxonomy for XAI methods and insights into evaluation requirements and future research directions.

Conclusion: Organizes existing research and inspires further investigation into interpretability for visual recognition models.

Abstract: In recent years, visual recognition methods have advanced significantly, finding applications across diverse fields. While researchers seek to understand the mechanisms behind the success of these models, there is also a growing impetus to deploy them in critical areas like autonomous driving and medical diagnostics to better diagnose failures, which promotes the development of interpretability research. This paper systematically reviews existing research on the interpretability of visual recognition models and proposes a taxonomy of methods from a human-centered perspective. The proposed taxonomy categorizes interpretable recognition methods based on Intent, Object, Presentation, and Methodology, thereby establishing a systematic and coherent set of grouping criteria for these XAI methods. Additionally, we summarize the requirements for evaluation metrics and explore new opportunities enabled by recent technologies, such as large multimodal models. We aim to organize existing research in this domain and inspire future investigations into the interpretability of visual recognition models.

[129] KptLLM++: Towards Generic Keypoint Comprehension with Large Language Model

Jie Yang, Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Zhen Li, Ruimao Zhang

Main category: cs.CV

TL;DR: KptLLM++ is a multimodal large language model designed for fine-grained keypoint comprehension, achieving state-of-the-art performance through a novel identify-then-detect paradigm and large-scale training.

Details

Motivation: Existing MLLMs struggle with fine-grained semantic tasks like keypoint analysis, which is crucial for applications like object retrieval and behavior recognition.

Method: KptLLM++ uses an identify-then-detect approach with structured chain-of-thought reasoning, trained on 500K diverse samples.

Result: The model achieves remarkable accuracy and generalization, outperforming benchmarks in keypoint detection.

Conclusion: KptLLM++ offers a unified solution for fine-grained image understanding, enhancing human-AI collaboration.

Abstract: The emergence of Multimodal Large Language Models (MLLMs) has revolutionized image understanding by bridging textual and visual modalities. However, these models often struggle with capturing fine-grained semantic information, such as the precise identification and analysis of object keypoints. Keypoints, as structure-aware, pixel-level, and compact representations of objects, particularly articulated ones, play a crucial role in applications such as fine-grained image analysis, object retrieval, and behavior recognition. In this paper, we propose KptLLM++, a novel multimodal large language model that specifically designed for generic keypoint comprehension through the integration of diverse input modalities guided by user-defined instructions. By unifying keypoint detection across varied contexts, KptLLM++ establishes itself as an advanced interface, fostering more effective human-AI collaboration. The model is built upon a novel identify-then-detect paradigm, which first interprets keypoint semantics and subsequently localizes their precise positions through a structured chain-of-thought reasoning mechanism. To push the boundaries of performance, we have scaled up the training dataset to over 500K samples, encompassing diverse objects, keypoint categories, image styles, and scenarios with complex occlusions. This extensive scaling enables KptLLM++ to unlock its potential, achieving remarkable accuracy and generalization. Comprehensive experiments on multiple keypoint detection benchmarks demonstrate its state-of-the-art performance, underscoring its potential as a unified solution for fine-grained image understanding and its transformative implications for human-AI interaction.

[130] Jellyfish Species Identification: A CNN Based Artificial Neural Network Approach

Md. Sabbir Hossen, Md. Saiduzzaman, Pabon Shaha, Mostofa Kamal Nasir

Main category: cs.CV

TL;DR: A deep learning framework for jellyfish species detection achieves 98% accuracy using MobileNetV3 and hybrid classifiers, aiding marine biodiversity monitoring.

Details

Motivation: Accurate jellyfish species identification is vital for ecological monitoring and management, addressing challenges posed by their rapid proliferation.

Method: The study combines advanced feature extraction (MobileNetV3, ResNet50, etc.) with traditional and neural network classifiers, using softmax for direct species classification.

Result: The hybrid MobileNetV3 and Artificial Neural Network model achieves 98% accuracy, outperforming other combinations.

Conclusion: Deep learning and hybrid frameworks effectively address biodiversity challenges, enhancing species detection in marine environments.

Abstract: Jellyfish, a diverse group of gelatinous marine organisms, play a crucial role in maintaining marine ecosystems but pose significant challenges for biodiversity and conservation due to their rapid proliferation and ecological impact. Accurate identification of jellyfish species is essential for ecological monitoring and management. In this study, we proposed a deep learning framework for jellyfish species detection and classification using an underwater image dataset. The framework integrates advanced feature extraction techniques, including MobileNetV3, ResNet50, EfficientNetV2-B0, and VGG16, combined with seven traditional machine learning classifiers and three Feedforward Neural Network classifiers for precise species identification. Additionally, we activated the softmax function to directly classify jellyfish species using the convolutional neural network models. The combination of the Artificial Neural Network with MobileNetV3 is our best-performing model, achieving an exceptional accuracy of 98%, significantly outperforming other feature extractor-classifier combinations. This study demonstrates the efficacy of deep learning and hybrid frameworks in addressing biodiversity challenges and advancing species detection in marine environments.

[131] Try Harder: Hard Sample Generation and Learning for Clothes-Changing Person Re-ID

Hankun Liu, Yujian Zhao, Guanglin Niu

Main category: cs.CV

TL;DR: The paper introduces HSGL, a multimodal framework for defining, generating, and optimizing hard samples in clothing-changing person Re-ID, improving robustness and performance.

Details

Motivation: Hard samples in CC-ReID lack explicit definitions and hinder model robustness. The paper aims to address this by leveraging multimodal cues.

Method: HSGL includes DGHSG for generating hard samples using multimodal cues and HSAL for hardness-aware optimization.

Result: HSGL achieves state-of-the-art performance on PRCC and LTCC datasets, accelerating convergence.

Conclusion: Multimodal-guided hard sample generation and learning enhances CC-ReID robustness and discriminative capability.

Abstract: Hard samples pose a significant challenge in person re-identification (ReID) tasks, particularly in clothing-changing person Re-ID (CC-ReID). Their inherent ambiguity or similarity, coupled with the lack of explicit definitions, makes them a fundamental bottleneck. These issues not only limit the design of targeted learning strategies but also diminish the model’s robustness under clothing or viewpoint changes. In this paper, we propose a novel multimodal-guided Hard Sample Generation and Learning (HSGL) framework, which is the first effort to unify textual and visual modalities to explicitly define, generate, and optimize hard samples within a unified paradigm. HSGL comprises two core components: (1) Dual-Granularity Hard Sample Generation (DGHSG), which leverages multimodal cues to synthesize semantically consistent samples, including both coarse- and fine-grained hard positives and negatives for effectively increasing the hardness and diversity of the training data. (2) Hard Sample Adaptive Learning (HSAL), which introduces a hardness-aware optimization strategy that adjusts feature distances based on textual semantic labels, encouraging the separation of hard positives and drawing hard negatives closer in the embedding space to enhance the model’s discriminative capability and robustness to hard samples. Extensive experiments on multiple CC-ReID benchmarks demonstrate the effectiveness of our approach and highlight the potential of multimodal-guided hard sample generation and learning for robust CC-ReID. Notably, HSAL significantly accelerates the convergence of the targeted learning procedure and achieves state-of-the-art performance on both PRCC and LTCC datasets. The code is available at https://github.com/undooo/TryHarder-ACMMM25.

[132] MMOne: Representing Multiple Modalities in One Scene

Zhifeng Gu, Bing Wang

Main category: cs.CV

TL;DR: The paper introduces MMOne, a framework for multimodal scene representation, addressing modality conflicts like property and granularity disparities through a modality modeling module and decomposition mechanism.

Details

Motivation: Humans use multimodal cues to understand the world, but modality conflicts (property and granularity disparities) hinder effective scene representation.

Method: Proposes MMOne with a modality modeling module (using modality indicators) and a decomposition mechanism to separate multimodal Gaussians into single-modal ones, disentangling shared and modality-specific components.

Result: MMOne improves representation for each modality and scales to additional modalities, as shown in experiments.

Conclusion: MMOne offers a compact, efficient solution for multimodal scene representation, addressing modality conflicts effectively.

Abstract: Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.

[133] RMAU-NET: A Residual-Multihead-Attention U-Net Architecture for Landslide Segmentation and Detection from Remote Sensing Images

Lam Pham, Cam Le, Hieu Tang, Khang Truong, Truong Nguyen, Jasmin Lampert, Alexander Schindler, Martin Boyer, Son Phan

Main category: cs.CV

TL;DR: A deep-learning model for landslide detection and segmentation using remote sensing images, achieving high F1 and mIoU scores on benchmark datasets.

Details

Motivation: Frequent landslide disasters due to extreme weather and human activities, coupled with challenges in large-area observation, drive the need for automated solutions.

Method: Proposed an end-to-end deep-learning model for landslide detection and segmentation, leveraging remote sensing images.

Result: Achieved F1 scores of 98.23 and 93.83 for detection, and mIoU scores of 63.74 and 76.88 for segmentation on benchmark datasets.

Conclusion: The model shows potential for real-life landslide observation systems.

Abstract: In recent years, landslide disasters have reported frequently due to the extreme weather events of droughts, floods , storms, or the consequence of human activities such as deforestation, excessive exploitation of natural resources. However, automatically observing landslide is challenging due to the extremely large observing area and the rugged topography such as mountain or highland. This motivates us to propose an end-to-end deep-learning-based model which explores the remote sensing images for automatically observing landslide events. By considering remote sensing images as the input data, we can obtain free resource, observe large and rough terrains by time. To explore the remote sensing images, we proposed a novel neural network architecture which is for two tasks of landslide detection and landslide segmentation. We evaluated our proposed model on three different benchmark datasets of LandSlide4Sense, Bijie, and Nepal. By conducting extensive experiments, we achieve F1 scores of 98.23, 93.83 for the landslide detection task on LandSlide4Sense, Bijie datasets; mIoU scores of 63.74, 76.88 on the segmentation tasks regarding LandSlide4Sense, Nepal datasets. These experimental results prove potential to integrate our proposed model into real-life landslide observation systems.

[134] Assessing Color Vision Test in Large Vision-language Models

Hongfei Ye, Bin Chen, Wenxi Liu, Yu Zhang, Zhao Li, Dandan Ni, Hongyang Chen

Main category: cs.CV

TL;DR: The paper explores color vision abilities in large vision-language models, introduces a testing task and dataset, analyzes errors, and suggests fine-tuning strategies for improvement.

Details

Motivation: The color vision capabilities of large vision-language models are understudied, prompting the need for a systematic evaluation and enhancement.

Method: A color vision testing task is defined, and a diverse dataset is constructed. Error analysis is performed, and fine-tuning strategies are proposed.

Result: The study identifies common errors in models and suggests methods to improve their color vision performance.

Conclusion: Fine-tuning strategies can enhance the color vision abilities of large vision-language models, addressing gaps in their current capabilities.

Abstract: With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset \footnote{Anonymous Github Showing some of the data https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.

[135] Clustering-Guided Multi-Layer Contrastive Representation Learning for Citrus Disease Classification

Jun Chen, Yonghua Yu, Weifu Li, Yaohui Chen, Hong Chen

Main category: cs.CV

TL;DR: The paper proposes a self-supervised learning method (CMCRL) for citrus disease detection, leveraging unannotated data and outperforming existing methods by 4.5%-30.1% in accuracy.

Details

Motivation: Citrus diseases cause significant yield losses, and current AI methods require extensive annotated data. The paper aims to reduce reliance on labeled data while improving detection accuracy.

Method: Introduces CMCRL, a clustering-guided self-supervised learning algorithm with contrasting cluster centroids and multi-layer contrastive training (MCT).

Result: Achieves state-of-the-art performance on the CDD dataset, narrowing the gap with fully supervised methods and excelling in metrics like F1 score, precision, and recall.

Conclusion: CMCRL offers a robust, efficient solution for citrus disease detection, reducing dependency on labeled data and addressing class imbalance.

Abstract: Citrus, as one of the most economically important fruit crops globally, suffers severe yield depressions due to various diseases. Accurate disease detection and classification serve as critical prerequisites for implementing targeted control measures. Recent advancements in artificial intelligence, particularly deep learning-based computer vision algorithms, have substantially decreased time and labor requirements while maintaining the accuracy of detection and classification. Nevertheless, these methods predominantly rely on massive, high-quality annotated training examples to attain promising performance. By introducing two key designs: contrasting with cluster centroids and a multi-layer contrastive training (MCT) paradigm, this paper proposes a novel clustering-guided self-supervised multi-layer contrastive representation learning (CMCRL) algorithm. The proposed method demonstrates several advantages over existing counterparts: (1) optimizing with massive unannotated samples; (2) effective adaptation to the symptom similarity across distinct citrus diseases; (3) hierarchical feature representation learning. The proposed method achieves state-of-the-art performance on the public citrus image set CDD, outperforming existing methods by 4.5%-30.1% accuracy. Remarkably, our method narrows the performance gap with fully supervised counterparts (all samples are labeled). Beyond classification accuracy, our method shows great performance on other evaluation metrics (F1 score, precision, and recall), highlighting the robustness against the class imbalance challenge.

[136] How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study

Che Liu, Jiazhen Pan, Weixiang Shen, Wenjia Bai, Daniel Rueckert, Rossella Arcucci

Main category: cs.CV

TL;DR: The paper evaluates general-purpose and medical-specific Vision-Language Models (VLMs) on medical benchmarks, finding that large general-purpose models often outperform medical-specific ones, though reasoning lags behind understanding, and no model is clinically reliable yet.

Details

Motivation: To assess the competence of VLMs in medical tasks, given their increasing use in healthcare despite limited exploration of their medical task performance.

Method: Evaluated VLMs (3B to 72B parameters) on eight medical benchmarks, separating performance into understanding and reasoning components.

Result: General-purpose models match or surpass medical-specific ones; reasoning is weaker than understanding; performance varies by benchmark.

Conclusion: No model is clinically reliable yet, highlighting the need for better multimodal alignment and evaluation protocols.

Abstract: Vision-Language Models (VLMs) trained on web-scale corpora excel at natural image tasks and are increasingly repurposed for healthcare; however, their competence in medical tasks remains underexplored. We present a comprehensive evaluation of open-source general-purpose and medically specialised VLMs, ranging from 3B to 72B parameters, across eight benchmarks: MedXpert, OmniMedVQA, PMC-VQA, PathVQA, MMMU, SLAKE, and VQA-RAD. To observe model performance across different aspects, we first separate it into understanding and reasoning components. Three salient findings emerge. First, large general-purpose models already match or surpass medical-specific counterparts on several benchmarks, demonstrating strong zero-shot transfer from natural to medical images. Second, reasoning performance is consistently lower than understanding, highlighting a critical barrier to safe decision support. Third, performance varies widely across benchmarks, reflecting differences in task design, annotation quality, and knowledge demands. No model yet reaches the reliability threshold for clinical deployment, underscoring the need for stronger multimodal alignment and more rigorous, fine-grained evaluation protocols.

[137] A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition

Xinkui Zhao, Jinsong Shu, Yangyang Wu, Guanjie Cheng, Zihe Liu, Naibo Wang, Shuiguang Deng, Zhongle Xie, Jianwei Yin

Main category: cs.CV

TL;DR: MCULoRA is a novel framework for efficient training of incomplete multimodal learning models, addressing gradient conflicts in existing methods by decoupling shared and distinct modality information and dynamically adjusting training ratios.

Details

Motivation: Existing MER methods struggle with incomplete multimodality due to conflicting training gradients from different modality combinations, degrading performance.

Method: MCULoRA uses two modules: MCLA (decouples shared/distinct modality information) and DPFT (dynamically adjusts training ratios based on modality separability).

Result: MCULoRA outperforms previous methods in downstream task accuracy across multiple benchmark datasets.

Conclusion: MCULoRA effectively addresses gradient conflicts and improves performance in incomplete multimodal learning scenarios.

Abstract: Multimodal Emotion Recognition (MER) often encounters incomplete multimodality in practical applications due to sensor failures or privacy protection requirements. While existing methods attempt to address various incomplete multimodal scenarios by balancing the training of each modality combination through additional gradients, these approaches face a critical limitation: training gradients from different modality combinations conflict with each other, ultimately degrading the performance of the final prediction model. In this paper, we propose a unimodal decoupled dynamic low-rank adaptation method based on modality combinations, named MCULoRA, which is a novel framework for the parameter-efficient training of incomplete multimodal learning models. MCULoRA consists of two key modules, modality combination aware low-rank adaptation (MCLA) and dynamic parameter fine-tuning (DPFT). The MCLA module effectively decouples the shared information from the distinct characteristics of individual modality combinations. The DPFT module adjusts the training ratio of modality combinations based on the separability of each modality’s representation space, optimizing the learning efficiency across different modality combinations. Our extensive experimental evaluation in multiple benchmark datasets demonstrates that MCULoRA substantially outperforms previous incomplete multimodal learning approaches in downstream task accuracy.

[138] NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models

X. Feng, H. Yu, M. Wu, S. Hu, J. Chen, C. Zhu, J. Wu, X. Chu, K. Huang

Main category: cs.CV

TL;DR: The paper introduces NarrLV, the first benchmark for evaluating narrative expression in long video generation models, using Temporal Narrative Atoms (TNAs) and a novel MLLM-based metric.

Details

Motivation: Current benchmarks lack focus on narrative richness in long videos, limiting the evaluation of advanced video generation models.

Method: Proposes TNAs to measure narrative richness, an automatic prompt generation pipeline, and an MLLM-based evaluation metric.

Result: NarrLV aligns with human judgments and reveals capability boundaries of current models in narrative expression.

Conclusion: NarrLV provides a comprehensive benchmark for assessing narrative capabilities in long video generation, filling a critical gap in the field.

Abstract: With the rapid development of foundation video generation technologies, long video generation models have exhibited promising research potential thanks to expanded content creation space. Recent studies reveal that the goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos. However, due to the lack of evaluation benchmarks specifically designed for long video generation models, the current assessment of these models primarily relies on benchmarks with simple narrative prompts (e.g., VBench). To the best of our knowledge, our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models. Inspired by film narrative theory, (i) we first introduce the basic narrative unit maintaining continuous visual presentation in videos as Temporal Narrative Atom (TNA), and use its count to quantitatively measure narrative richness. Guided by three key film narrative elements influencing TNA changes, we construct an automatic prompt generation pipeline capable of producing evaluation prompts with a flexibly expandable number of TNAs. (ii) Then, based on the three progressive levels of narrative content expression, we design an effective evaluation metric using the MLLM-based question generation and answering framework. (iii) Finally, we conduct extensive evaluations on existing long video generation models and the foundation generation models. Experimental results demonstrate that our metric aligns closely with human judgments. The derived evaluation outcomes reveal the detailed capability boundaries of current video generation models in narrative content expression.

[139] Fairness-Aware Grouping for Continuous Sensitive Variables: Application for Debiasing Face Analysis with respect to Skin Tone

Veronika Shilova, Emmanuel Malherbe, Giovanni Palma, Laurent Risser, Jean-Michel Loubes

Main category: cs.CV

TL;DR: Proposes a fairness-based grouping method for continuous sensitive attributes to better identify and address discrimination in datasets and models.

Details

Motivation: Existing fairness assessments often overlook discrimination in continuous sensitive attributes (e.g., skin color) due to predefined group divisions.

Method: Groups data based on observed discrimination levels, maximizing inter-group variance to isolate critical subgroups. Validated on synthetic and real datasets (CelebA, FFHQ).

Result: Uncovers nuanced discrimination patterns, stable across datasets. Improves fairness with minimal accuracy loss when used for debiasing.

Conclusion: The method effectively identifies and mitigates discrimination in continuous sensitive attributes, enabling practical industrial deployment.

Abstract: Within a legal framework, fairness in datasets and models is typically assessed by dividing observations into predefined groups and then computing fairness measures (e.g., Disparate Impact or Equality of Odds with respect to gender). However, when sensitive attributes such as skin color are continuous, dividing into default groups may overlook or obscure the discrimination experienced by certain minority subpopulations. To address this limitation, we propose a fairness-based grouping approach for continuous (possibly multidimensional) sensitive attributes. By grouping data according to observed levels of discrimination, our method identifies the partition that maximizes a novel criterion based on inter-group variance in discrimination, thereby isolating the most critical subgroups. We validate the proposed approach using multiple synthetic datasets and demonstrate its robustness under changing population distributions - revealing how discrimination is manifested within the space of sensitive attributes. Furthermore, we examine a specialized setting of monotonic fairness for the case of skin color. Our empirical results on both CelebA and FFHQ, leveraging the skin tone as predicted by an industrial proprietary algorithm, show that the proposed segmentation uncovers more nuanced patterns of discrimination than previously reported, and that these findings remain stable across datasets for a given model. Finally, we leverage our grouping model for debiasing purpose, aiming at predicting fair scores with group-by-group post-processing. The results demonstrate that our approach improves fairness while having minimal impact on accuracy, thus confirming our partition method and opening the door for industrial deployment.

[140] ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He

Main category: cs.CV

TL;DR: ViewSRD improves 3D visual grounding by decomposing complex queries into simpler statements and integrating multi-view textual-scene interactions.

Details

Motivation: Existing methods struggle with complex multi-anchor queries and spatial inconsistencies due to perspective variations.

Method: ViewSRD uses Simple Relation Decoupling (SRD) to simplify queries, Multi-view Textual-Scene Interaction (Multi-TSI) with CCVTs for cross-modal consistency, and a reasoning module for unified predictions.

Result: ViewSRD outperforms state-of-the-art methods, especially in complex spatial queries.

Conclusion: The framework effectively addresses challenges in 3D visual grounding, enhancing accuracy in complex scenarios.

Abstract: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.

[141] YOLOatr : Deep Learning Based Automatic Target Detection and Localization in Thermal Infrared Imagery

Aon Safdar, Usman Akram, Waseem Anwar, Basit Malik, Mian Ibad Ali

Main category: cs.CV

TL;DR: The paper addresses challenges in Automatic Target Detection and Recognition (ATD/ATR) from Thermal Infrared (TI) imagery, proposing YOLOatr, a modified YOLOv5s model, achieving 99.6% accuracy.

Details

Motivation: The defense and surveillance domain faces unique challenges in ATR due to limited datasets, hardware constraints, and environmental factors, causing current deep learning models to underperform.

Method: A modified YOLOv5s model (YOLOatr) with optimized detection heads, feature fusion, and custom augmentation is proposed.

Result: YOLOatr achieves state-of-the-art ATR performance of up to 99.6% on the DSIAC MWIR dataset.

Conclusion: The proposed YOLOatr model effectively addresses ATR challenges in TI imagery, outperforming existing methods.

Abstract: Automatic Target Detection (ATD) and Recognition (ATR) from Thermal Infrared (TI) imagery in the defense and surveillance domain is a challenging computer vision (CV) task in comparison to the commercial autonomous vehicle perception domain. Limited datasets, peculiar domain-specific and TI modality-specific challenges, i.e., limited hardware, scale invariance issues due to greater distances, deliberate occlusion by tactical vehicles, lower sensor resolution and resultant lack of structural information in targets, effects of weather, temperature, and time of day variations, and varying target to clutter ratios all result in increased intra-class variability and higher inter-class similarity, making accurate real-time ATR a challenging CV task. Resultantly, contemporary state-of-the-art (SOTA) deep learning architectures underperform in the ATR domain. We propose a modified anchor-based single-stage detector, called YOLOatr, based on a modified YOLOv5s, with optimal modifications to the detection heads, feature fusion in the neck, and a custom augmentation profile. We evaluate the performance of our proposed model on a comprehensive DSIAC MWIR dataset for real-time ATR over both correlated and decorrelated testing protocols. The results demonstrate that our proposed model achieves state-of-the-art ATR performance of up to 99.6%.

[142] Tomato Multi-Angle Multi-Pose Dataset for Fine-Grained Phenotyping

Yujie Zhang, Sabine Struckmeyer, Andreas Kolb, Sven Reichardt

Main category: cs.CV

TL;DR: TomatoMAP is a dataset for Solanum lycopersicum, using IoT-based imaging and standardized protocols, validated by deep learning models achieving expert-level accuracy.

Details

Motivation: To address observer bias and inconsistencies in traditional plant phenotyping methods.

Method: Developed TomatoMAP with 64,464 RGB images and annotations, validated using a cascading deep learning framework (MobileNetv3, YOLOv11, MaskRCNN).

Result: Models trained on TomatoMAP achieved accuracy and speed comparable to human experts, confirmed by Cohen’s Kappa and inter-rater agreement.

Conclusion: TomatoMAP enables reliable, automated fine-grained plant phenotyping, overcoming traditional limitations.

Abstract: Observer bias and inconsistencies in traditional plant phenotyping methods limit the accuracy and reproducibility of fine-grained plant analysis. To overcome these challenges, we developed TomatoMAP, a comprehensive dataset for Solanum lycopersicum using an Internet of Things (IoT) based imaging system with standardized data acquisition protocols. Our dataset contains 64,464 RGB images that capture 12 different plant poses from four camera elevation angles. Each image includes manually annotated bounding boxes for seven regions of interest (ROIs), including leaves, panicle, batch of flowers, batch of fruits, axillary shoot, shoot and whole plant area, along with 50 fine-grained growth stage classifications based on the BBCH scale. Additionally, we provide 3,616 high-resolution image subset with pixel-wise semantic and instance segmentation annotations for fine-grained phenotyping. We validated our dataset using a cascading model deep learning framework combining MobileNetv3 for classification, YOLOv11 for object detection, and MaskRCNN for segmentation. Through AI vs. Human analysis involving five domain experts, we demonstrate that the models trained on our dataset achieve accuracy and speed comparable to the experts. Cohen’s Kappa and inter-rater agreement heatmap confirm the reliability of automated fine-grained phenotyping using our approach.

[143] Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

An-Lun Liu, Yu-Wei Chao, Yi-Ting Chen

Main category: cs.CV

TL;DR: The paper introduces task-oriented human grasp synthesis, using task-aware contact maps to improve grasp quality and task performance.

Details

Motivation: Traditional grasp synthesis lacks task and context awareness, limiting its practical application.

Method: A two-stage pipeline: (1) constructs task-aware contact maps considering scene and task, (2) synthesizes grasps using these maps.

Result: Experiments show significant improvements in grasp quality and task performance over existing methods.

Conclusion: Task and scene awareness are critical for accurate grasp synthesis, validated by a new dataset and metric.

Abstract: In this paper, we study task-oriented human grasp synthesis, a new grasp synthesis task that demands both task and context awareness. At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the manipulated object and its relation with the hand, our enhanced maps take into account scene and task information. This comprehensive map is critical for hand-object interaction, enabling accurate grasping poses that align with the task. We propose a two-stage pipeline that first constructs a task-aware contact map informed by the scene and task. In the subsequent stage, we use this contact map to synthesize task-oriented human grasps. We introduce a new dataset and a metric for the proposed task to evaluate our approach. Our experiments validate the importance of modeling both scene and task, demonstrating significant improvements over existing methods in both grasp quality and task performance. See our project page for more details: https://hcis-lab.github.io/TOHGS/

[144] Detección y Cuantificación de Erosión Fluvial con Visión Artificial

Paúl Maji, Marlon Túquerres, Stalin Valencia, Marcela Valenzuela, Christian Mejia-Escobar

Main category: cs.CV

TL;DR: The paper proposes an AI-based method using YOLOv11 and LiDAR images to automatically detect and quantify fluvial erosion, achieving 70% accuracy, and introduces the EROSCAN web tool for practical use.

Details

Motivation: Traditional methods for detecting and monitoring fluvial erosion are manual and require expertise. The study aims to automate this process using AI to improve efficiency and accuracy.

Method: The study fine-tunes YOLOv11, a computer vision model, trained with photographs and LiDAR images. Data was segmented and labeled using Roboflow.

Result: The method achieves 70% accuracy in detecting erosion patterns and reliably estimates eroded areas in pixels and square meters.

Conclusion: The developed EROSCAN system automates erosion detection, aiding risk management and territorial planning through an interactive web application.

Abstract: Fluvial erosion is a natural process that can generate significant impacts on soil stability and strategic infrastructures. The detection and monitoring of this phenomenon is traditionally addressed by photogrammetric methods and analysis in geographic information systems. These tasks require specific knowledge and intensive manual processing. This study proposes an artificial intelligence-based approach for automatic identification of eroded zones and estimation of their area. The state-of-the-art computer vision model YOLOv11, adjusted by fine-tuning and trained with photographs and LiDAR images, is used. This combined dataset was segmented and labeled using the Roboflow platform. Experimental results indicate efficient detection of erosion patterns with an accuracy of 70%, precise identification of eroded areas and reliable calculation of their extent in pixels and square meters. As a final product, the EROSCAN system has been developed, an interactive web application that allows users to upload images and obtain automatic segmentations of fluvial erosion, together with the estimated area. This tool optimizes the detection and quantification of the phenomenon, facilitating decision making in risk management and territorial planning.

[145] A Mixed-Primitive-based Gaussian Splatting Method for Surface Reconstruction

Haoxuan Qu, Yujun Cai, Hossein Rahmani, Ajay Kumar, Junsong Yuan, Jun Liu

Main category: cs.CV

TL;DR: The paper proposes a novel Gaussian Splatting (GS) framework using multiple primitive types for improved surface reconstruction, addressing limitations of single-primitive methods.

Details

Motivation: Existing GS-based methods use only one type of splatting primitive, which is insufficient for high-quality representation of diverse 3D object surfaces.

Method: The framework introduces a compositional splatting strategy, mixed-primitive initialization, and vertex pruning to leverage multiple primitives in GS.

Result: Extensive experiments demonstrate the framework’s efficacy and accurate surface reconstruction performance.

Conclusion: The proposed method enhances GS by incorporating multiple primitives, improving surface representation quality.

Abstract: Recently, Gaussian Splatting (GS) has received a lot of attention in surface reconstruction. However, while 3D objects can be of complex and diverse shapes in the real world, existing GS-based methods only limitedly use a single type of splatting primitive (Gaussian ellipse or Gaussian ellipsoid) to represent object surfaces during their reconstruction. In this paper, we highlight that this can be insufficient for object surfaces to be represented in high quality. Thus, we propose a novel framework that, for the first time, enables Gaussian Splatting to incorporate multiple types of (geometrical) primitives during its surface reconstruction process. Specifically, in our framework, we first propose a compositional splatting strategy, enabling the splatting and rendering of different types of primitives in the Gaussian Splatting pipeline. In addition, we also design our framework with a mixed-primitive-based initialization strategy and a vertex pruning mechanism to further promote its surface representation learning process to be well executed leveraging different types of primitives. Extensive experiments show the efficacy of our framework and its accurate surface reconstruction performance.

[146] MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network

Jianfei Jiang, Qiankun Liu, Haochen Yu, Hongyuan Liu, Liyong Wang, Jiansheng Chen, Huimin Ma

Main category: cs.CV

TL;DR: MonoMVSNet integrates monocular depth and feature priors into MVS to improve performance in challenging regions like textureless areas.

Details

Motivation: Existing MVS methods struggle with feature matching in difficult regions, while monocular depth estimation excels there.

Method: Uses monocular features and depth, cross-view position encoding, dynamic depth candidate updates, and a relative consistency loss.

Result: Achieves state-of-the-art performance on DTU and Tanks-and-Temples datasets.

Conclusion: MonoMVSNet effectively bridges the gap between monocular and multi-view depth estimation.

Abstract: Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position encoding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks. The source code is available at https://github.com/JianfeiJ/MonoMVSNet.

[147] UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Peiran Wu, Yunze Liu, Zhengdong Zhu, Enmin Zhou, Shawn Shen

Main category: cs.CV

TL;DR: The paper introduces UGC-VideoCap, a benchmark and model framework for omnimodal video captioning, addressing the lack of audio-visual integration in existing methods.

Details

Motivation: Existing video captioning benchmarks and models are visual-centric, neglecting audio's role in scene dynamics and narrative context. This gap hinders multimodal video understanding.

Method: UGC-VideoCap includes 1000 TikTok videos with balanced audio-visual annotations and 4000 QA pairs. The proposed UGC-VideoCaptioner (3B) uses a two-stage training strategy (supervised fine-tuning and GRPO) for efficient adaptation.

Result: The benchmark and model provide a high-quality, data-efficient solution for omnimodal captioning in real-world UGC settings.

Conclusion: UGC-VideoCap and its model framework advance omnimodal video captioning by integrating audio and visual modalities effectively.

Abstract: Real-world user-generated videos, especially on platforms like TikTok, often feature rich and intertwined audio visual content. However, existing video captioning benchmarks and models remain predominantly visual centric, overlooking the crucial role of audio in conveying scene dynamics, speaker intent, and narrative context. This lack of omni datasets and lightweight, capable models hampers progress in fine grained, multimodal video understanding. To address these challenges, we introduce UGC-VideoCap, a new benchmark and model framework specifically designed for detailed omnimodal captioning of short form user-generated videos. Unlike prior datasets, UGC-VideoCap emphasizes balanced integration of audio and visual modalities, featuring 1000 TikTok videos annotated through a structured three stage human-in-the-loop pipeline covering audio only, visual only, and joint audio visual semantics. The benchmark also includes 4000 carefully crafted QA pairs probing both unimodal and cross modal understanding. Alongside the dataset, we propose UGC-VideoCaptioner(3B), a 3B parameter captioning model distilled from Gemini 2.5 Flash. Using a novel two-stage training strategy supervised fine tuning followed by Group Relative Policy Optimization (GRPO), our approach enables efficient adaptation from limited data while maintaining competitive performance. Together, our benchmark and model offer a high-quality foundation and a data-efficient solution for advancing omnimodal video captioning in unconstrained real-world UGC settings.

[148] Attributes Shape the Embedding Space of Face Recognition Models

Pierrick Leroy, Antonio Mastropietro, Marco Nurisso, Francesco Vaccarino

Main category: cs.CV

TL;DR: The paper explores the geometric structure in face recognition embedding spaces, influenced by facial and image attributes, and introduces a physics-inspired metric to analyze model invariance.

Details

Motivation: To understand how facial and image attributes influence the geometric structure of embedding spaces in face recognition models.

Method: Proposes a geometric approach and a physics-inspired alignment metric, tested on controlled models and fine-tuned FR models with synthetic data.

Result: Models show varying invariance to different attributes, revealing strengths, weaknesses, and interpretability.

Conclusion: The approach provides deeper insights into FR models’ behavior, enhancing interpretability and understanding of attribute dependencies.

Abstract: Face Recognition (FR) tasks have made significant progress with the advent of Deep Neural Networks, particularly through margin-based triplet losses that embed facial images into high-dimensional feature spaces. During training, these contrastive losses focus exclusively on identity information as labels. However, we observe a multiscale geometric structure emerging in the embedding space, influenced by interpretable facial (e.g., hair color) and image attributes (e.g., contrast). We propose a geometric approach to describe the dependence or invariance of FR models to these attributes and introduce a physics-inspired alignment metric. We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability. Code available here: https://github.com/mantonios107/attrs-fr-embs}{https://github.com/mantonios107/attrs-fr-embs

[149] Implementing Adaptations for Vision AutoRegressive Model

Kaif Shaikh, Antoni Kowalczuk, Franziska Boenisch, Adam Dziedzic

Main category: cs.CV

TL;DR: VAR outperforms DMs in non-DP adaptations but struggles with DP adaptations, highlighting a need for further research in private adaptations for VAR.

Details

Motivation: To explore and benchmark adaptation strategies for VAR, especially in DP settings, where existing solutions are lacking compared to DMs.

Method: Implemented and benchmarked various adaptation strategies for VAR, comparing them to state-of-the-art DM adaptation techniques.

Result: VAR performs better than DMs for non-DP adaptations but underperforms in DP settings.

Conclusion: Further research is needed to improve DP adaptations for VAR, as current methods lag behind DMs.

Abstract: Vision AutoRegressive model (VAR) was recently introduced as an alternative to Diffusion Models (DMs) in image generation domain. In this work we focus on its adaptations, which aim to fine-tune pre-trained models to perform specific downstream tasks, like medical data generation. While for DMs there exist many techniques, adaptations for VAR remain underexplored. Similarly, differentially private (DP) adaptations-ones that aim to preserve privacy of the adaptation data-have been extensively studied for DMs, while VAR lacks such solutions. In our work, we implement and benchmark many strategies for VAR, and compare them to state-of-the-art DM adaptation strategies. We observe that VAR outperforms DMs for non-DP adaptations, however, the performance of DP suffers, which necessitates further research in private adaptations for VAR. Code is available at https://github.com/sprintml/finetuning_var_dp.

[150] COLI: A Hierarchical Efficient Compressor for Large Images

Haoran Wang, Hanyu Pei, Yang Lyu, Kai Zhang, Li Li, Feng-Lei Fan

Main category: cs.CV

TL;DR: COLI introduces a novel framework using Neural Representations for Videos (NeRV) to improve INR-based compression for large images, addressing slow speed and suboptimal ratios with accelerated training and Hyper-Compression.

Details

Motivation: High-resolution imagery demands efficient compression, but conventional methods lose details, and data-driven approaches lack generalizability. INRs offer promise but face speed and ratio challenges.

Method: COLI uses NeRV, accelerates training via pretraining-finetuning, mixed-precision, and parallelizable loss. Hyper-Compression post-training enhances ratios.

Result: COLI achieves better PSNR and SSIM at lower bpp, with 4x faster NeRV training on medical imaging datasets.

Conclusion: COLI effectively addresses INR limitations, offering efficient, high-quality compression for large images.

Abstract: The escalating adoption of high-resolution, large-field-of-view imagery amplifies the need for efficient compression methodologies. Conventional techniques frequently fail to preserve critical image details, while data-driven approaches exhibit limited generalizability. Implicit Neural Representations (INRs) present a promising alternative by learning continuous mappings from spatial coordinates to pixel intensities for individual images, thereby storing network weights rather than raw pixels and avoiding the generalization problem. However, INR-based compression of large images faces challenges including slow compression speed and suboptimal compression ratios. To address these limitations, we introduce COLI (Compressor for Large Images), a novel framework leveraging Neural Representations for Videos (NeRV). First, recognizing that INR-based compression constitutes a training process, we accelerate its convergence through a pretraining-finetuning paradigm, mixed-precision training, and reformulation of the sequential loss into a parallelizable objective. Second, capitalizing on INRs’ transformation of image storage constraints into weight storage, we implement Hyper-Compression, a novel post-training technique to substantially enhance compression ratios while maintaining minimal output distortion. Evaluations across two medical imaging datasets demonstrate that COLI consistently achieves competitive or superior PSNR and SSIM metrics at significantly reduced bits per pixel (bpp), while accelerating NeRV training by up to 4 times.

[151] HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing

Pan Du, Mingqi Xu, Xiaozhi Zhu, Jian-xun Wang

Main category: cs.CV

TL;DR: HUG-VAS is a hierarchical NURBS generative model for vascular geometry synthesis, combining NURBS parameterization with diffusion-based modeling to create realistic aortic geometries with multi-branch structures.

Details

Motivation: Traditional SSM methods are limited by linear assumptions, making them less expressive for complex vascular topologies. HUG-VAS aims to overcome this by integrating advanced generative techniques.

Method: HUG-VAS uses a hierarchical architecture with a denoising diffusion model for centerlines and a guided diffusion model for radial profiles, trained on 21 patient-specific samples.

Result: The model generates anatomically accurate aortas with biomarker distributions matching the original dataset, supporting zero-shot conditional generation from image-derived priors.

Conclusion: HUG-VAS bridges image-derived priors with generative shape modeling, enabling applications like segmentation, reconstruction, and device optimization.

Abstract: Accurate characterization of vascular geometry is essential for cardiovascular diagnosis and treatment planning. Traditional statistical shape modeling (SSM) methods rely on linear assumptions, limiting their expressivity and scalability to complex topologies such as multi-branch vascular structures. We introduce HUG-VAS, a Hierarchical NURBS Generative model for Vascular geometry Synthesis, which integrates NURBS surface parameterization with diffusion-based generative modeling to synthesize realistic, fine-grained aortic geometries. Trained with 21 patient-specific samples, HUG-VAS generates anatomically faithful aortas with supra-aortic branches, yielding biomarker distributions that closely match those of the original dataset. HUG-VAS adopts a hierarchical architecture comprising a denoising diffusion model that generates centerlines and a guided diffusion model that synthesizes radial profiles conditioned on those centerlines, thereby capturing two layers of anatomical variability. Critically, the framework supports zero-shot conditional generation from image-derived priors, enabling practical applications such as interactive semi-automatic segmentation, robust reconstruction under degraded imaging conditions, and implantable device optimization. To our knowledge, HUG-VAS is the first SSM framework to bridge image-derived priors with generative shape modeling via a unified integration of NURBS parameterization and hierarchical diffusion processes.

[152] C-FBI: A Combinatorial method using Convolutions for Circle Fitting in Blurry Images

Esteban Román Catafau, Torbjörn E. M. Nordling

Main category: cs.CV

TL;DR: 3C-FBI is a robust circle detection algorithm combining combinatorial sampling and convolution-based density estimation, excelling in accuracy and speed across medical, synthetic, and degraded imaging scenarios.

Details

Motivation: Addressing the challenge of robust circle detection in degraded imaging conditions, particularly for applications like medical imaging and industrial inspection.

Method: Combines combinatorial edge pixel sampling with convolution-based density estimation in parameter space.

Result: Achieves state-of-the-art accuracy (Jaccard index 0.896) and real-time performance (40.3 fps), outperforming classical methods. Maintains high accuracy even with outliers and low resolutions.

Conclusion: 3C-FBI is ideal for medical, robotics, and industrial applications due to its accuracy, speed, and robustness.

Abstract: This paper addresses the fundamental computer vision challenge of robust circle detection and fitting in degraded imaging conditions. We present Combinatorial Convolution-based Circle Fitting for Blurry Images (3C-FBI), an algorithm that bridges the gap between circle detection and precise parametric fitting by combining (1) efficient combinatorial edge pixel (edgel) sampling and (2) convolution-based density estimation in parameter space. We evaluate 3C-FBI across three experimental frameworks: (1) real-world medical data from Parkinson’s disease assessments (144 frames from 36 videos), (2) controlled synthetic data following established circle-fitting benchmarks, and (3) systematic analysis across varying spatial resolutions and outlier contamination levels. Results show that 3C-FBI achieves state-of-the-art accuracy (Jaccard index 0.896) while maintaining real-time performance (40.3 fps), significantly outperforming classical methods like RCD (6.8 fps) on a standard CPU (i7-10875H). It maintains near-perfect accuracy (Jaccard almost 1.0) at high resolutions (480x480) and reliable performance (Jaccard higher than 0.95) down to 160x160 with up to 20% outliers. In extensive synthetic testing, 3C-FBI achieves a mean Jaccard Index of 0.989 across contamination levels, comparable to modern methods like Qi et al. (2024, 0.991), and surpassing RHT (0.964). This combination of accuracy, speed, and robustness makes 3C-FBI ideal for medical imaging, robotics, and industrial inspection under challenging conditions.

[153] COLIBRI Fuzzy Model: Color Linguistic-Based Representation and Interpretation

Pakizar Shamoi, Nuray Toganas, Muragul Muratbekova, Elnara Kadyrgali, Adilet Yerkin, Ayan Igali, Malika Ziyada, Ayana Adilova, Aron Karatayev, Yerdauit Torekhan

Main category: cs.CV

TL;DR: The paper introduces COLIBRI, a fuzzy color model aligning computational color representation with human perception, validated by large-scale human experiments.

Details

Motivation: Bridging the gap between computational color models and human visual perception for better applications in design, AI, and HCI.

Method: A three-phase approach: identifying color stimuli, conducting a large-scale human survey (n=2496), and using fuzzy logic to model perceptual uncertainty.

Result: COLIBRI outperforms traditional models (RGB, HSV, LAB) in aligning with human perception.

Conclusion: The model is significant for fields requiring perceptually accurate color representation, like design, AI, and marketing.

Abstract: Colors are omnipresent in today’s world and play a vital role in how humans perceive and interact with their surroundings. However, it is challenging for computers to imitate human color perception. This paper introduces the Human Perception-Based Fuzzy Color Model, COLIBRI (Color Linguistic-Based Representation and Interpretation), designed to bridge the gap between computational color representations and human visual perception. The proposed model uses fuzzy sets and logic to create a framework for color categorization. Using a three-phase experimental approach, the study first identifies distinguishable color stimuli for hue, saturation, and intensity through preliminary experiments, followed by a large-scale human categorization survey involving more than 1000 human subjects. The resulting data are used to extract fuzzy partitions and generate membership functions that reflect real-world perceptual uncertainty. The model incorporates a mechanism for adaptation that allows refinement based on feedback and contextual changes. Comparative evaluations demonstrate the model’s alignment with human perception compared to traditional color models, such as RGB, HSV, and LAB. To the best of our knowledge, no previous research has documented the construction of a model for color attribute specification based on a sample of this size or a comparable sample of the human population (n = 2496). Our findings are significant for fields such as design, artificial intelligence, marketing, and human-computer interaction, where perceptually relevant color representation is critical.

[154] CATVis: Context-Aware Thought Visualization

Tariq Mehmood, Hamza Ahmad, Muhammad Haroon Shakeel, Murtaza Taj

Main category: cs.CV

TL;DR: A 5-stage EEG-to-image framework outperforms SOTA methods in accuracy and image quality.

Details

Motivation: Decoding visual representations from noisy EEG signals is challenging.

Method: Uses EEG encoder, cross-modal alignment, caption refinement, weighted interpolation, and Stable Diffusion.

Result: Outperforms SOTA by 13.43% in Classification Accuracy and 15.21% in Generation Accuracy.

Conclusion: The framework enables high-quality, context-aware EEG-to-image generation.

Abstract: EEG-based brain-computer interfaces (BCIs) have shown promise in various applications, such as motor imagery and cognitive state monitoring. However, decoding visual representations from EEG signals remains a significant challenge due to their complex and noisy nature. We thus propose a novel 5-stage framework for decoding visual representations from EEG signals: (1) an EEG encoder for concept classification, (2) cross-modal alignment of EEG and text embeddings in CLIP feature space, (3) caption refinement via re-ranking, (4) weighted interpolation of concept and caption embeddings for richer semantics, and (5) image generation using a pre-trained Stable Diffusion model. We enable context-aware EEG-to-image generation through cross-modal alignment and re-ranking. Experimental results demonstrate that our method generates high-quality images aligned with visual stimuli, outperforming SOTA approaches by 13.43% in Classification Accuracy, 15.21% in Generation Accuracy and reducing Fr'echet Inception Distance by 36.61%, indicating superior semantic alignment and image quality.

[155] CharaConsist: Fine-Grained Consistent Character Generation

Mengyu Wang, Henghui Ding, Jianing Peng, Yao Zhao, Yunpeng Chen, Yunchao Wei

Main category: cs.CV

TL;DR: CharaConsist improves text-to-image generation by ensuring fine-grained consistency in foreground and background details, even with large motion variations, using point-tracking attention and adaptive token merge.

Details

Motivation: Existing methods fail to maintain consistent background details and struggle with identity and clothing inconsistencies during large motion variations.

Method: Proposes CharaConsist, which uses point-tracking attention, adaptive token merge, and decoupled control of foreground and background.

Result: Enables consistent generation of characters in continuous or discrete shots, tailored for DiT models, producing high-quality outputs.

Conclusion: CharaConsist broadens applicability in real-world scenarios by maintaining fine-grained consistency and leveraging advanced base models.

Abstract: In text-to-image generation, producing a series of consistent contents that preserve the same identity is highly valuable for real-world applications. Although a few works have explored training-free methods to enhance the consistency of generated subjects, we observe that they suffer from the following problems. First, they fail to maintain consistent background details, which limits their applicability. Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background. CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes. Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. Its ability to maintain fine-grained consistency, combined with the larger capacity of latest base model, enables it to produce high-quality visual outputs, broadening its applicability to a wider range of real-world scenarios. The source code has been released at https://github.com/Murray-Wang/CharaConsist

[156] Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: The paper proposes a streaming 4D visual geometry transformer for real-time 4D reconstruction, inspired by autoregressive language models. It uses causal attention and implicit memory for efficiency and maintains spatial consistency.

Details

Motivation: To enable interactive and real-time 4D spatial-temporal geometry reconstruction from videos, addressing the challenge of processing sequences efficiently.

Method: A causal transformer architecture with temporal causal attention and implicit memory (cached keys/values) is used for online processing. Knowledge is distilled from a dense bidirectional transformer (VGGT) for training, and efficient attention operators (e.g., FlashAttention) are leveraged for inference.

Result: The model achieves faster inference in online scenarios while maintaining competitive performance on 4D geometry benchmarks.

Conclusion: The proposed method advances scalable and interactive 4D vision systems, with code publicly available.

Abstract: Perceiving and reconstructing 4D spatial-temporal geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and real-time applications, we propose a streaming 4D visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 4D reconstruction. This design can handle real-time 4D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operator (e.g., FlashAttention) from the field of large language models. Extensive experiments on various 4D geometry perception benchmarks demonstrate that our model increases the inference speed in online scenarios while maintaining competitive performance, paving the way for scalable and interactive 4D vision systems. Code is available at: https://github.com/wzzheng/StreamVGGT.

[157] Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation

Zhen Xu, Hongyu Zhou, Sida Peng, Haotong Lin, Haoyu Guo, Jiahao Shao, Peishan Yang, Qinglin Yang, Sheng Miao, Xingyi He, Yifan Wang, Yue Wang, Ruizhen Hu, Yiyi Liao, Xiaowei Zhou, Hujun Bao

Main category: cs.CV

TL;DR: A survey on depth estimation in 3D vision, focusing on the shift from traditional hardware-based methods to vision-based approaches and the potential of depth foundation models for robust generalization.

Details

Motivation: Overcome limitations of traditional hardware sensors (e.g., LiDAR) and address challenges in vision-based methods, such as generalization and stability, by leveraging large-scale datasets and foundation models.

Method: Survey of deep learning architectures and paradigms for depth estimation (monocular, stereo, multi-view, monocular video) and analysis of large-scale datasets.

Result: Identifies key architectures and training strategies for developing robust depth foundation models with strong zero-shot generalization.

Conclusion: Depth foundation models show promise for advancing depth estimation, with future research needed to optimize their potential and applications.

Abstract: Depth estimation is a fundamental task in 3D computer vision, crucial for applications such as 3D reconstruction, free-viewpoint rendering, robotics, autonomous driving, and AR/VR technologies. Traditional methods relying on hardware sensors like LiDAR are often limited by high costs, low resolution, and environmental sensitivity, limiting their applicability in real-world scenarios. Recent advances in vision-based methods offer a promising alternative, yet they face challenges in generalization and stability due to either the low-capacity model architectures or the reliance on domain-specific and small-scale datasets. The emergence of scaling laws and foundation models in other domains has inspired the development of “depth foundation models”: deep neural networks trained on large datasets with strong zero-shot generalization capabilities. This paper surveys the evolution of deep learning architectures and paradigms for depth estimation across the monocular, stereo, multi-view, and monocular video settings. We explore the potential of these models to address existing challenges and provide a comprehensive overview of large-scale datasets that can facilitate their development. By identifying key architectures and training strategies, we aim to highlight the path towards robust depth foundation models, offering insights into their future research and applications.

[158] Pavlok-Nudge: A Feedback Mechanism for Atomic Behaviour Modification with Snoring Usecase

Md Rakibul Hasan, Shreya Ghosh, Pradyumna Agrawal, Zhixi Cai, Abhinav Dhall, Tom Gedeon

Main category: cs.CV

TL;DR: A framework automates behavior modification using Pavlok, detecting snoring via a lightweight CNN and nudging users for posture changes.

Details

Motivation: Manual operation limits Pavlok's effectiveness; automation can enhance behavior modification.

Method: Uses a lightweight 1D CNN to detect snoring, then triggers Pavlok for nudges.

Result: Achieves 99% test accuracy with a highly efficient model (99.8% fewer parameters than SOTA).

Conclusion: The solution can effectively modify atomic habits for long-term health benefits.

Abstract: This paper proposes an atomic behaviour intervention strategy using the Pavlok wearable device. Pavlok utilises beeps, vibration and shocks as a mode of aversion technique to help individuals with behaviour modification. While the device can be useful in certain periodic daily life situations, like alarms and exercise notifications, it relies on manual operations that limit its usage. To automate behaviour modification, we propose a framework that first detects targeted behaviours through a lightweight deep learning model and subsequently nudges the user. Our proposed solution is implemented and verified in the context of snoring, which captures audio from the environment following a prediction of whether the audio content is a snore or not using a lightweight 1D convolutional neural network. Based on the prediction, we use Pavlok to nudge users for preventive measures, such as a change in sleeping posture. We believe that this simple solution can help people change their atomic habits, which may lead to long-term health benefits. Our proposed lightweight model (99.8% fewer parameters over SOTA; 790,273$\rightarrow$1,337) achieves SOTA test accuracy of 0.99 on a public benchmark. The code and model are publicly available at https://github.com/hasan-rakibul/pavlok-nudge-snore.

[159] Augmenting End-to-End Steering Angle Prediction with CAN Bus Data

Amit Singh

Main category: cs.CV

TL;DR: Improving autonomous vehicle steering prediction by fusing CAN bus data with video data, reducing prediction error by 20-80% without costly LiDAR or radar sensors.

Details

Motivation: To enhance the accuracy of end-to-end steering prediction in autonomous vehicles without the high cost of LiDAR or radar sensors.

Method: Sensor fusion of CAN bus data (vehicle state information) with video data to improve computer vision model accuracy.

Result: RMSE reduced from 0.02492 (without CAN bus data) to 0.01970 (with CAN bus data), a 20% error reduction, with some models achieving 80% reduction.

Conclusion: Fusing CAN bus data with video data significantly improves steering prediction accuracy cost-effectively.

Abstract: In recent years, end to end steering prediction for autonomous vehicles has become a major area of research. The primary method for achieving end to end steering was to use computer vision models on a live feed of video data. However, to further increase accuracy, many companies have added data from light detection and ranging (LiDAR) and or radar sensors through sensor fusion. However, the addition of lasers and sensors comes at a high financial cost. In this paper, I address both of these issues by increasing the accuracy of the computer vision models without the increased cost of using LiDAR and or sensors. I achieved this by improving the accuracy of computer vision models by sensor fusing CAN bus data, a vehicle protocol, with video data. CAN bus data is a rich source of information about the vehicle’s state, including its speed, steering angle, and acceleration. By fusing this data with video data, the accuracy of the computer vision model’s predictions can be improved. When I trained the model without CAN bus data, I obtained an RMSE of 0.02492, while the model trained with the CAN bus data achieved an RMSE of 0.01970. This finding indicates that fusing CAN Bus data with video data can reduce the computer vision model’s prediction error by 20% with some models decreasing the error by 80%.

[160] Roadside Monocular 3D Detection Prompted by 2D Detection

Yechi Ma, Yanan Li, Wei Hua, Shu Kong

Main category: cs.CV

TL;DR: Pro3D is a novel 3D detector using 2D detections as prompts to improve 3D object detection, achieving state-of-the-art results.

Details

Motivation: Roadside monocular 3D detection is crucial for traffic control and cooperative perception, but traditional 3D detectors are harder to train and less precise in 2D localization.

Method: Pro3D leverages 2D detections as prompts, exploring three fusion methods (concatenation, attentive fusion, and encoding 2D box properties) to lift 2D detections into 3D.

Result: The third method (encoding 2D box properties) outperforms others, significantly enhancing existing methods on benchmarks.

Conclusion: Pro3D demonstrates the effectiveness of 2D prompts for 3D detection, offering adaptability and superior performance.

Abstract: Roadside monocular 3D detection requires detecting objects of predefined classes in an RGB frame and predicting their 3D attributes, such as bird’s-eye-view (BEV) locations. It has broad applications in traffic control, vehicle-vehicle communication, and vehicle-infrastructure cooperative perception. To address this task, we introduce Promptable 3D Detector (Pro3D), a novel detector design that leverages 2D detections as prompts. We build our Pro3D upon two key insights. First, compared to a typical 3D detector, a 2D detector is ``easier’’ to train due to fewer loss terms and performs significantly better at localizing objects w.r.t 2D metrics. Second, once 2D detections precisely locate objects in the image, a 3D detector can focus on lifting these detections into 3D BEV, especially when fixed camera pose or scene geometry provide an informative prior. To encode and incorporate 2D detections, we explore three methods: (a) concatenating features from both 2D and 3D detectors, (b) attentively fusing 2D and 3D detector features, and (c) encoding properties of predicted 2D bounding boxes {$x$, $y$, width, height, label} and attentively fusing them with the 3D detector feature. Interestingly, the third method significantly outperforms the others, underscoring the effectiveness of 2D detections as prompts that offer precise object targets and allow the 3D detector to focus on lifting them into 3D. Pro3D is adaptable for use with a wide range of 2D and 3D detectors with minimal modifications. Comprehensive experiments demonstrate that our Pro3D significantly enhances existing methods, achieving state-of-the-art results on two contemporary benchmarks.

[161] VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning

Alexandros Xenos, Niki Maria Foteinopoulou, Ioanna Ntinou, Ioannis Patras, Georgios Tzimiropoulos

Main category: cs.CV

TL;DR: The paper proposes a two-stage method using Vision-and-Large-Language Models (VLLMs) to improve in-context emotion classification by generating natural language descriptions and fusing text-visual features.

Details

Motivation: Existing methods for emotion recognition in context rely on limited contextual information or complex pipelines, prompting the need for a simpler, more effective approach.

Method: A two-stage approach: (1) VLLMs generate emotion descriptions in context, (2) a transformer-based model fuses text and visual features for classification.

Result: Outperforms individual modalities and achieves state-of-the-art performance on BoLD, EMOTIC, and CAER-S datasets.

Conclusion: The method simplifies training and improves performance by leveraging VLLMs and multimodal fusion, setting new benchmarks.

Abstract: Recognising emotions in context involves identifying an individual’s apparent emotions while considering contextual cues from the surrounding scene. Previous approaches to this task have typically designed explicit scene-encoding architectures or incorporated external scene-related information, such as captions. However, these methods often utilise limited contextual information or rely on intricate training pipelines to decouple noise from relevant information. In this work, we leverage the capabilities of Vision-and-Large-Language Models (VLLMs) to enhance in-context emotion classification in a more straightforward manner. Our proposed method follows a simple yet effective two-stage approach. First, we prompt VLLMs to generate natural language descriptions of the subject’s apparent emotion in relation to the visual context. Second, the descriptions, along with the visual input, are used to train a transformer-based architecture that fuses text and visual features before the final classification task. This method not only simplifies the training process but also significantly improves performance. Experimental results demonstrate that the textual descriptions effectively guide the model to constrain the noisy visual input, allowing our fused architecture to outperform individual modalities. Our approach achieves state-of-the-art performance across three datasets, BoLD, EMOTIC, and CAER-S, without bells and whistles. The code will be made publicly available on github: https://github.com/NickyFot/EmoCommonSense.git

[162] A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection

Jiangning Zhang, Haoyang He, Zhenye Gan, Qingdong He, Yuxuan Cai, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

Main category: cs.CV

TL;DR: The paper introduces ADer, a comprehensive benchmark for visual anomaly detection, addressing the lack of standardized evaluation in the field. It includes multiple datasets, implements 15 methods, and offers a GPU-assisted evaluation tool for faster metrics computation.

Details

Motivation: The absence of standardized benchmarks in visual anomaly detection leads to biased evaluations and erroneous conclusions. This work aims to provide a unified framework for fair and comprehensive method comparison.

Method: ADer is proposed as a modular, extensible benchmark with multiple datasets, 15 state-of-the-art methods, and 9 metrics. The GPU-assisted ADEval package accelerates evaluation by over 1000x.

Result: Extensive experiments objectively compare methods, revealing their strengths and weaknesses. The benchmark and tools significantly reduce evaluation time.

Conclusion: ADer serves as a valuable resource for researchers, promoting robust and generalizable anomaly detection systems. The work highlights challenges and future directions for multi-class anomaly detection.

Abstract: Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across different datasets under the practical multi-class setting. The absence of standardized experimental setups can lead to potential biases in training epochs, resolution, and metric results, resulting in erroneous conclusions. This paper addresses this issue by proposing a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework that is highly extensible for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. Additionally, we have proposed the GPU-assisted ADEval package to address the slow evaluation problem of metrics like time-consuming mAU-PRO on large-scale data, significantly reducing evaluation time by more than \textit{1000-fold}. Through extensive experimental results, we objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection. We hope that ADer will become a valuable resource for researchers and practitioners in the field, promoting the development of more robust and generalizable anomaly detection systems. Full codes are open-sourced at https://github.com/zhangzjn/ader.

[163] PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models

Jinhua Zhang, Hualian Sheng, Sijia Cai, Bing Deng, Qiao Liang, Wen Li, Ying Fu, Jieping Ye, Shuhang Gu

Main category: cs.CV

TL;DR: PerLDiff introduces a novel method for street view image generation using 3D geometric priors, improving precision and controllability over existing methods.

Details

Motivation: Addressing the challenge of annotating 3D data for autonomous driving by enhancing controllable generation precision.

Method: Integrates perspective 3D geometric information into the generation process, leveraging 3D geometric priors for object-level control.

Result: Superior controllability and precision on NuScenes and KITTI datasets compared to existing methods.

Conclusion: PerLDiff offers a robust and effective solution for precise street view image generation in autonomous driving contexts.

Abstract: Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or ControlNet, to produce commendable outcomes in controllable generation. However, such approaches intrinsically restrict generation performance to the learning capacities of predefined network architectures. In this paper, we explore the innovative integration of controlling information and introduce PerLDiff (\textbf{Per}spective-\textbf{L}ayout \textbf{Diff}usion Models), a novel method for effective street view image generation that fully leverages perspective 3D geometric information. Our PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process, resulting in a more robust and controllable output. Moreover, it demonstrates superior controllability compared to alternative layout control methods. Empirical results justify that our PerLDiff markedly enhances the precision of controllable generation on the NuScenes and KITTI datasets.

[164] CycleSAM: Few-Shot Surgical Scene Segmentation with Cycle- and Scene-Consistent Feature Matching

Aditya Murali, Farahdiba Zarin, Adrien Meyer, Pietro Mascagni, Didier Mutter, Nicolas Padoy

Main category: cs.CV

TL;DR: CycleSAM improves surgical image segmentation by enhancing visual prompt learning with self-supervised features and consistency constraints, outperforming existing methods by 2-4x.

Details

Motivation: Surgical image segmentation faces challenges due to scarce annotated data and domain gaps, which existing SAM-based methods struggle to address robustly.

Method: CycleSAM uses surgery-specific self-supervised feature extractors, adapts features efficiently, and applies consistency constraints to generate robust point prompts.

Result: CycleSAM outperforms few-shot SAM approaches by 2-4x and surpasses traditional methods like linear probing and pseudo-labeling.

Conclusion: CycleSAM effectively bridges the domain gap in surgical image segmentation, offering a label-efficient and robust solution.

Abstract: Surgical image segmentation is highly challenging, primarily due to scarcity of annotated data. Generalist prompted segmentation models like the Segment-Anything Model (SAM) can help tackle this task, but because they require image-specific visual prompts for effective performance, their use is limited to improving data annotation efficiency. Recent approaches extend SAM to automatic segmentation by using a few labeled reference images to predict point prompts; however, they rely on feature matching pipelines that lack robustness to out-of-domain data like surgical images. To tackle this problem, we introduce CycleSAM, an improved visual prompt learning approach that employs a data-efficient training phase and enforces a series of soft constraints to produce high-quality feature similarity maps. CycleSAM label-efficiently addresses domain gap by leveraging surgery-specific self-supervised feature extractors, then adapts the resulting features through a short parameter-efficient training stage, enabling it to produce informative similarity maps. CycleSAM further filters the similarity maps with a series of consistency constraints before robustly sampling diverse point prompts for each object instance. In our experiments on four diverse surgical datasets, we find that CycleSAM outperforms existing few-shot SAM approaches by a factor of 2-4x in both 1-shot and 5-shot settings, while also achieving strong performance gains over traditional linear probing, parameter-efficient adaptation, and pseudo-labeling methods.

[165] ED$^4$: Explicit Data-level Debiasing for Deepfake Detection

Jikang Cheng, Ying Zhang, Qin Zou, Zhiyuan Yan, Chao Liang, Zhongyuan Wang, Chen Li

Main category: cs.CV

TL;DR: ED$^4$ addresses spatial bias in deepfake detection by using ClockMix for diverse data generation and AdvSCM to prevent spatial bias, improving detector generalizability.

Details

Motivation: Deepfake detectors often fail due to intrinsic biases like spatial bias, where detectors focus on the image center for forgery clues, limiting generalization.

Method: ED$^4$ combines ClockMix for diverse data augmentation and AdvSCM to enforce spatial-inconsistent feature learning, debiasing detectors.

Result: ED$^4$ significantly improves deepfake detection performance and generalizability, outperforming existing methods.

Conclusion: ED$^4$ is a model-agnostic, plug-and-play solution that effectively mitigates biases in deepfake detection.

Abstract: Learning intrinsic bias from limited data has been considered the main reason for the failure of deepfake detection with generalizability. Apart from the discovered content and specific-forgery bias, we reveal a novel spatial bias, where detectors inertly anticipate observing structural forgery clues appearing at the image center, also can lead to the poor generalization of existing methods. We present ED$^4$, a simple and effective strategy, to address aforementioned biases explicitly at the data level in a unified framework rather than implicit disentanglement via network design. In particular, we develop ClockMix to produce facial structure preserved mixtures with arbitrary samples, which allows the detector to learn from an exponentially extended data distribution with much more diverse identities, backgrounds, local manipulation traces, and the co-occurrence of multiple forgery artifacts. We further propose the Adversarial Spatial Consistency Module (AdvSCM) to prevent extracting features with spatial bias, which adversarially generates spatial-inconsistent images and constrains their extracted feature to be consistent. As a model-agnostic debiasing strategy, ED$^4$ is plug-and-play: it can be integrated with various deepfake detectors to obtain significant benefits. We conduct extensive experiments to demonstrate its effectiveness and superiority over existing deepfake detection approaches.

[166] Advancing Depth Anything Model for Unsupervised Monocular Depth Estimation in Endoscopy

Bojian Li, Bo Liu, Xinning Yao, Jinghua Yue, Fugen Zhou

Main category: cs.CV

TL;DR: A novel fine-tuning strategy for the Depth Anything Model is introduced, combined with an intrinsic-based unsupervised monocular depth estimation framework, achieving state-of-the-art performance in endoscopic depth estimation.

Details

Motivation: Current depth estimation networks lack global information capture, and foundation models trained on natural images perform poorly on endoscopic images.

Method: Proposes a fine-tuning strategy with low-rank adaptation and a residual block using depthwise separable convolution to enhance local feature capture.

Result: Achieves state-of-the-art performance on SCARED and Hamlyn datasets with minimal trainable parameters.

Conclusion: The method improves spatial awareness in endoscopic surgeries, enhancing precision and safety.

Abstract: Depth estimation is a cornerstone of 3D reconstruction and plays a vital role in minimally invasive endoscopic surgeries. However, most current depth estimation networks rely on traditional convolutional neural networks, which are limited in their ability to capture global information. Foundation models offer a promising approach to enhance depth estimation, but those models currently available are primarily trained on natural images, leading to suboptimal performance when applied to endoscopic images. In this work, we introduce a novel fine-tuning strategy for the Depth Anything Model and integrate it with an intrinsic-based unsupervised monocular depth estimation framework. Our approach includes a low-rank adaptation technique based on random vectors, which improves the model’s adaptability to different scales. Additionally, we propose a residual block built on depthwise separable convolution to compensate for the transformer’s limited ability to capture local features. Our experimental results on the SCARED dataset and Hamlyn dataset show that our method achieves state-of-the-art performance while minimizing the number of trainable parameters. Applying this method in minimally invasive endoscopic surgery can enhance surgeons’ spatial awareness, thereby improving the precision and safety of the procedures.

[167] EEG Emotion Copilot: Optimizing Lightweight LLMs for Emotional EEG Interpretation with Assisted Medical Record Generation

Hongyu Chen, Weiming Zeng, Chengcheng Chen, Luhui Cai, Fei Wang, Yuhu Shi, Lei Wang, Wei Zhang, Yueyang Li, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

Main category: cs.CV

TL;DR: The paper introduces EEG Emotion Copilot, a lightweight LLM-based system for real-time EEG emotion recognition and personalized medical suggestions, outperforming larger models in accuracy and efficiency.

Details

Motivation: Addressing challenges in end-to-end EEG emotion recognition, such as real-time processing and individual adaptation, to enhance affective computing in healthcare.

Method: Utilizes a 0.5B parameter LLM with novel prompt data structures, model pruning, fine-tuning, and deployment strategies for efficiency.

Result: Achieves superior accuracy in emotion recognition and medical record automation compared to larger models (1.5B-7B parameters).

Conclusion: The EEG Emotion Copilot advances affective computing in medicine, offering a practical solution for mental health monitoring.

Abstract: In the fields of affective computing (AC) and brain-machine interface (BMI), the analysis of physiological and behavioral signals to discern individual emotional states has emerged as a critical research frontier. While deep learning-based approaches have made notable strides in EEG emotion recognition, particularly in feature extraction and pattern recognition, significant challenges persist in achieving end-to-end emotion computation, including real-time processing, individual adaptation, and seamless user interaction. This paper presents the EEG Emotion Copilot, a system optimizing a lightweight large language model (LLM) with 0.5B parameters operating in a local setting, which first recognizes emotional states directly from EEG signals, subsequently generates personalized diagnostic and treatment suggestions, and finally supports the automation of assisted electronic medical records. Specifically, we demonstrate the critical techniques in the novel data structure of prompt, model pruning and fine-tuning training, and deployment strategies aiming at improving real-time performance and computational efficiency. Extensive experiments show that our optimized lightweight LLM-based copilot achieves an enhanced intuitive interface for participant interaction, superior accuracy of emotion recognition and assisted electronic medical records generation, in comparison to such models with similar scale parameters or large-scale parameters such as 1.5B, 1.8B, 3B and 7B. In summary, through these efforts, the proposed copilot is expected to advance the application of AC in the medical domain, offering innovative solution to mental health monitoring. The codes will be released at https://github.com/NZWANG/EEG_Emotion_Copilot.

[168] SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Yongkun Du, Zhineng Chen, Hongtao Xie, Caiyan Jia, Yu-Gang Jiang

Main category: cs.CV

TL;DR: SVTRv2 improves CTC-based STR methods by addressing text irregularity and linguistic context, outperforming encoder-decoder methods in accuracy and speed.

Details

Motivation: CTC-based STR methods like SVTR are fast but struggle with text irregularity and lack of linguistic context, leading to lower accuracy compared to encoder-decoder methods.

Method: SVTRv2 introduces multi-size resizing to avoid distortion, a feature rearrangement module for CTC alignment, and a semantic guidance module for linguistic context.

Result: SVTRv2 outperforms most encoder-decoder methods in accuracy and speed across various benchmarks.

Conclusion: SVTRv2 effectively combines the simplicity of CTC with improved handling of text irregularities and linguistic context, achieving superior performance.

Abstract: Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. First, a multi-size resizing strategy is proposed to resize text instances to appropriate predefined sizes, effectively avoiding severe text distortion. Meanwhile, we introduce a feature rearrangement module to ensure that visual features accommodate the requirement of CTC, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module. It integrates linguistic context into the visual features, allowing CTC model to leverage language information for accuracy improvement. This module can be omitted at the inference stage and would not increase the time cost. We extensively evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared to popular STR models across multiple scenarios, including different types of text irregularity, languages, long text, and whether employing pretraining. SVTRv2 surpasses most EDTRs across the scenarios in terms of accuracy and inference speed. Code: https://github.com/Topdu/OpenOCR.

[169] MVCTrack: Boosting 3D Point Cloud Tracking via Multimodal-Guided Virtual Cues

Zhaofeng Hu, Sifan Zhou, Zhihang Yuan, Dawei Yang, Shibo Zhao, Ci-Jyun Liang

Main category: cs.CV

TL;DR: A Multimodal-guided Virtual Cues Projection (MVCP) scheme is proposed to enhance 3D single object tracking by generating virtual cues for sparse point clouds, improving performance in autonomous driving and robotics.

Details

Motivation: Existing methods struggle with sparse and incomplete point clouds in 3D tracking, limiting their effectiveness in real-world applications like autonomous driving.

Method: The MVCP scheme integrates RGB sensors with LiDAR to generate dense 3D virtual cues from 2D detections, enriching sparse point clouds. An enhanced tracker, MVCTrack, is introduced based on these cues.

Result: The method achieves competitive performance on the NuScenes dataset, demonstrating significant improvements in tracking accuracy.

Conclusion: The MVCP scheme and MVCTrack effectively address sparsity issues in point clouds, offering a robust solution for 3D single object tracking.

Abstract: 3D single object tracking is essential in autonomous driving and robotics. Existing methods often struggle with sparse and incomplete point cloud scenarios. To address these limitations, we propose a Multimodal-guided Virtual Cues Projection (MVCP) scheme that generates virtual cues to enrich sparse point clouds. Additionally, we introduce an enhanced tracker MVCTrack based on the generated virtual cues. Specifically, the MVCP scheme seamlessly integrates RGB sensors into LiDAR-based systems, leveraging a set of 2D detections to create dense 3D virtual cues that significantly improve the sparsity of point clouds. These virtual cues can naturally integrate with existing LiDAR-based 3D trackers, yielding substantial performance gains. Extensive experiments demonstrate that our method achieves competitive performance on the NuScenes dataset.

[170] TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models

Pooyan Rahmanzadehgervi, Hung Huy Nguyen, Rosanne Liu, Long Mai, Anh Totti Nguyen

Main category: cs.CV

TL;DR: The paper introduces a 1-head Transformer Attention Bottleneck (TAB) layer to improve interpretability and intervention in vision-language models (VLMs) by constraining attention to [0, 1].

Details

Motivation: Multi-head self-attention (MHSA) in Transformers obscures input-output attribution, limiting interpretability. TAB aims to address this by providing a bottleneck for clearer attention control.

Method: A TAB layer is inserted after MHSA, constraining total attention to [0, 1]. This allows for controlled information propagation and enables attention editing for debugging.

Result: VLMs with TAB perform similarly to baselines in captioning but excel in localizing changes and identifying no-change scenarios. TAB also enables effective debugging via attention editing.

Conclusion: TAB enhances interpretability and intervention in VLMs without compromising performance, offering a practical tool for debugging and analysis.

Abstract: Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model. We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. Unlike standard self-attention, TAB constrains the total attention over all patches to $\in [0, 1]$. That is, when the total attention is 0, no visual information is propagated further into the network, and the vision-language model (VLM) would default to a generic, image-independent response. To demonstrate the advantages of TAB, we train VLMs with TAB to perform image-difference captioning. Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur. TAB is the first architecture to enable users to debug by editing attention, which often produces expected outputs by VLMs.

Genc Hoxha, Olivér Angyal, Begüm Demir

Main category: cs.CV

TL;DR: The paper introduces a novel cross-modal text-image time series retrieval (text-ITSR) method in remote sensing, enabling retrieval of image time series using text queries and vice versa. It uses modality-specific encoders and projection heads, along with two fusion strategies (GFF and TFF), achieving accurate results on benchmark datasets.

Details

Motivation: Existing ITSR methods in remote sensing assume unimodal queries (image time series), which may not hold in real-world scenarios. This work addresses the gap by enabling cross-modal retrieval (text-to-image and vice versa).

Method: The method involves: 1) modality-specific encoders for bitemporal images and text, 2) projection heads to align representations in a shared space, and 3) two fusion strategies (GFF and TFF) for temporal modeling.

Result: Experiments on benchmark datasets show the method effectively retrieves semantically relevant bitemporal images or text sentences.

Conclusion: The proposed self-supervised cross-modal text-ITSR method successfully addresses the limitation of unimodal retrieval, demonstrating strong performance in cross-modal scenarios.

Abstract: The development of image time series retrieval (ITSR) methods is a growing research interest in remote sensing (RS). Given a user-defined image time series (i.e., the query time series), ITSR methods search and retrieve from large archives the image time series that have similar content to the query time series. Existing ITSR methods in RS are designed for unimodal retrieval problems, relying on an assumption that users always have access to a query image time series in the considered image modality. In operational scenarios, this assumption may not hold. To overcome this issue, as a first time in RS we introduce the task of cross-modal text-image time series retrieval (text-ITSR). In detail, we present a self-supervised cross-modal text-ITSR method that enables the retrieval of image time series using text sentences as queries, and vice versa. We focus our attention on text-ITSR in pairs of images (i.e., bitemporal images). Our text-ITSR method consists of two key components: 1) modality-specific encoders to model the semantic content of bitemporal images and text sentences with discriminative features; and 2) modality-specific projection heads to align textual and image representations in a shared embedding space. To effectively model the temporal information in the bitemporal images, we exploit two fusion strategies: i) global feature fusion (GFF) strategy that combines global image features through simple yet effective operators; and ii) transformer-based feature fusion (TFF) strategy that leverages transformers for fine-grained temporal integration. Extensive experiments conducted on two benchmark RS archives demonstrate the effectiveness of our method in accurately retrieving semantically relevant bitemporal images (or text sentences) to a query text sentence (or bitemporal image). The code of this work is publicly available at https://git.tu-berlin.de/rsim/cross-modal-text-tsir .

[172] Biomechanics-Guided Residual Approach to Generalizable Human Motion Generation and Estimation

Zixi Kang, Xinghan Wang, Yadong Mu

Main category: cs.CV

TL;DR: BioVAE is a biomechanics-aware framework for generating physically plausible human motions by integrating EMG signals, kinematic features, and acceleration constraints, outperforming existing methods.

Details

Motivation: Existing methods for human motion generation often lack biomechanical realism, and RL approaches are limited by simulation dependencies. BioVAE aims to address these gaps.

Method: BioVAE combines EMG signals, kinematic features, and acceleration constraints, integrates with diffusion models, and uses biomechanical priors for generalization.

Result: BioVAE achieves state-of-the-art performance on benchmarks, ensuring physically accurate motion generation and pose estimation.

Conclusion: BioVAE bridges the gap between data-driven motion synthesis and biomechanical authenticity, setting new standards for motion generation.

Abstract: Human pose, action, and motion generation are critical for applications in digital humans, character animation, and humanoid robotics. However, many existing methods struggle to produce physically plausible movements that are consistent with biomechanical principles. Although recent autoregressive and diffusion models deliver impressive visual quality, they often neglect key biodynamic features and fail to ensure physically realistic motions. Reinforcement Learning (RL) approaches can address these shortcomings but are highly dependent on simulation environments, limiting their generalizability. To overcome these challenges, we propose BioVAE, a biomechanics-aware framework with three core innovations: (1) integration of muscle electromyography (EMG) signals and kinematic features with acceleration constraints to enable physically plausible motion without simulations; (2) seamless coupling with diffusion models for stable end-to-end training; and (3) biomechanical priors that promote strong generalization across diverse motion generation and estimation tasks. Extensive experiments demonstrate that BioVAE achieves state-of-the-art performance on multiple benchmarks, bridging the gap between data-driven motion synthesis and biomechanical authenticity while setting new standards for physically accurate motion generation and pose estimation.

[173] Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

Zhixuan Li, Hyunse Yoon, Sanghoon Lee, Weisi Lin

Main category: cs.CV

TL;DR: The paper introduces amodal reasoning segmentation, a task combining amodal segmentation with user text interaction, and proposes AURA, a model for handling complex occlusions.

Details

Motivation: Current amodal segmentation methods lack user interaction and struggle with complex occlusions, while multi-modal LLMs like LISA are limited to visible regions.

Method: The authors develop a dataset generation pipeline, create a new dataset for daily life scenarios, and propose AURA with global and spatial-level designs.

Result: Experiments show AURA’s effectiveness in handling complex occlusions and reasoning tasks.

Conclusion: AURA addresses limitations of existing methods, offering improved performance in amodal reasoning segmentation.

Abstract: Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region’s appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA’s effectiveness on the proposed dataset.

[174] GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: GroundingSuite introduces an automated data annotation framework, a large-scale training dataset, and a curated benchmark to address limitations in pixel grounding tasks, achieving state-of-the-art results.

Details

Motivation: Existing datasets for pixel grounding tasks like RES suffer from limited object categories, textual diversity, and annotation quality, hindering advancements.

Method: GroundingSuite includes an automated annotation framework using VLM agents, a 9.56M-expression training dataset, and a 3,800-image evaluation benchmark.

Result: Models trained on GroundingSuite achieve 68.9 cIoU on gRefCOCO and 55.3 gIoU on RefCOCOm, with annotation 4.5x faster than GLaMM.

Conclusion: GroundingSuite effectively addresses dataset limitations, enabling significant performance improvements and efficient annotation.

Abstract: Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than GLaMM.

[175] COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation

Sanghyun Jo, Seo Jin Lee, Seungwoo Lee, Seohyung Hong, Hyungseok Seo, Kyungsu Kim

Main category: cs.CV

TL;DR: COIN is an annotation-free framework for unsupervised cell instance segmentation, using confidence scoring and self-distillation to outperform existing methods.

Details

Motivation: Unsupervised CIS models struggle with inaccurate cell boundaries due to lack of error-free instances, prompting the need for a better annotation-free solution.

Method: COIN involves unsupervised semantic segmentation with optimal transport, instance-level confidence scoring, and recursive self-distillation to refine predictions.

Result: COIN outperforms existing UCIS methods and even semi-/weakly-supervised approaches on MoNuSeg and TNBC datasets.

Conclusion: COIN provides a robust, annotation-free solution for accurate cell instance segmentation, validated by superior performance across multiple datasets.

Abstract: Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor-intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error-free instances as a key limitation, we present COIN (COnfidence score-guided INstance distillation), a novel annotation-free framework with three key steps: (1) Increasing the sensitivity for the presence of error-free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance-level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self-distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi- and weakly-supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code is available at https://github.com/shjo-april/COIN.

[176] AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization

Martin Kišš, Michal Hradiš, Martina Dvořáková, Václav Jiroušek, Filip Kersch

Main category: cs.CV

TL;DR: The AnnoPage Dataset is a collection of 7,550 historical document pages (Czech/German) annotated with 25 non-textual element categories, supporting layout analysis and object detection research.

Details

Motivation: To provide a resource for advancing research in document layout analysis and object detection, particularly for historical documents.

Method: Pages are annotated with axis-aligned bounding boxes (AABB) for 25 non-textual categories by expert librarians. The dataset includes development and test subsets, with baseline results from YOLO and DETR detectors.

Result: The dataset is publicly available with ground-truth annotations, offering a benchmark for future research.

Conclusion: The AnnoPage Dataset fills a gap in historical document analysis, providing a valuable resource for the community.

Abstract: We introduce the AnnoPage Dataset, a novel collection of 7,550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present, focusing on the late 19th and early 20th centuries. The dataset is designed to support research in document layout analysis and object detection. Each page is annotated with axis-aligned bounding boxes (AABB) representing elements of 25 categories of non-textual elements, such as images, maps, decorative elements, or charts, following the Czech Methodology of image document processing. The annotations were created by expert librarians to ensure accuracy and consistency. The dataset also incorporates pages from multiple, mainly historical, document datasets to enhance variability and maintain continuity. The dataset is divided into development and test subsets, with the test set carefully selected to maintain the category distribution. We provide baseline results using YOLO and DETR object detectors, offering a reference point for future research. The AnnoPage Dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.12788419), along with ground-truth annotations in YOLO format.

[177] Archival Faces: Detection of Faces in Digitized Historical Documents

Marek Vaško, Adam Herout, Michal Hradiš

Main category: cs.CV

TL;DR: A new dataset for face detection in historical newspapers is introduced to improve the poor performance (24% mAP) of existing tools. The dataset includes 2.2k images and 11k annotations, enabling better retraining of detectors.

Details

Motivation: Existing face detectors perform poorly on historical documents, necessitating a domain-specific dataset to enhance accuracy.

Method: A manually annotated dataset (2.2k images, 11k bounding boxes, and landmarks) is created. Existing detectors are retrained and evaluated.

Result: Retraining with the new dataset improves face detection performance closer to modern standards.

Conclusion: The dataset bridges the gap in face detection for historical documents, enabling better digitization and searchability.

Abstract: When digitizing historical archives, it is necessary to search for the faces of celebrities and ordinary people, especially in newspapers, link them to the surrounding text, and make them searchable. Existing face detectors on datasets of scanned historical documents fail remarkably – current detection tools only achieve around 24% mAP at 50:90% IoU. This work compensates for this failure by introducing a new manually annotated domain-specific dataset in the style of the popular Wider Face dataset, containing 2.2k new images from digitized historical newspapers from the 19th to 20th century, with 11k new bounding-box annotations and associated facial landmarks. This dataset allows existing detectors to be retrained to bring their results closer to the standard in the field of face detection in the wild. We report several experimental results comparing different families of fine-tuned detectors against publicly available pre-trained face detectors and ablation studies of multiple detector sizes with comprehensive detection and landmark prediction performance results.

[178] Fully Unified Motion Planning for End-to-End Autonomous Driving

Lin Liu, Caiyan Jia, Ziying Song, Hongyu Pan, Bencheng Liao, Wenchao Sun, Yongchang Zhang, Lei Yang, Yandan Luo

Main category: cs.CV

TL;DR: The paper introduces FUMP, a two-stage trajectory generation framework for autonomous driving, addressing limitations of current methods by leveraging expert data from multiple vehicles.

Details

Motivation: Current methods rely solely on ego-vehicle expert data, missing diverse driving scenarios and policies. Joint learning from multiple vehicles' data can enhance performance but faces challenges like observational discrepancies and missing modalities.

Method: FUMP decouples trajectory planning into two stages: a shared decoder for initial trajectories and a planning-specific refinement stage. It includes a state predictor and an Equivariant Context-Sharing Adapter (ECSA) to address cross-vehicle discrepancies.

Result: The proposed framework improves learning from multiple vehicles’ expert data, enhancing performance in planning tasks, including long-tail scenarios.

Conclusion: FUMP effectively addresses challenges in joint learning from multiple vehicles, offering a unified solution for motion planning in autonomous driving.

Abstract: Current end-to-end autonomous driving methods typically learn only from expert planning data collected from a single ego vehicle, severely limiting the diversity of learnable driving policies and scenarios. However, a critical yet overlooked fact is that in any driving scenario, multiple high-quality trajectories from other vehicles coexist with a specific ego vehicle’s trajectory. Existing methods fail to fully exploit this valuable resource, missing important opportunities to improve the models’ performance (including long-tail scenarios) through learning from other experts. Intuitively, Jointly learning from both ego and other vehicles’ expert data is beneficial for planning tasks. However, this joint learning faces two critical challenges. (1) Different scene observation perspectives across vehicles hinder inter-vehicle alignment of scene feature representations; (2) The absence of partial modality in other vehicles’ data (e.g., vehicle states) compared to ego-vehicle data introduces learning bias. To address these challenges, we propose FUMP (Fully Unified Motion Planning), a novel two-stage trajectory generation framework. Building upon probabilistic decomposition, we model the planning task as a specialized subtask of motion prediction. Specifically, our approach decouples trajectory planning into two stages. In Stage 1, a shared decoder jointly generates initial trajectories for both tasks. In Stage 2, the model performs planning-specific refinement conditioned on an ego-vehicle’s state. The transition between the two stages is bridged by a state predictor trained exclusively on ego-vehicle data. To address the cross-vehicle discrepancy in observational perspectives, we propose an Equivariant Context-Sharing Adapter (ECSA) before Stage 1 for improving cross-vehicle generalization of scene representations.

[179] Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation

Siyu Chen, Ting Han, Changshe Zhang, Xin Luo, Meiliu Wu, Guorong Cai, Jinhe Su

Main category: cs.CV

TL;DR: DepthForge integrates depth information with Vision Foundation Models (VFMs) to enhance geometric consistency and generalization in Domain Generalized Semantic Segmentation (DGSS).

Details

Motivation: Visual cues are susceptible to changes, while geometry (depth) remains stable. Leveraging depth can improve VFM robustness.

Method: Proposes DepthForge, a framework combining VFM features (DINOv2/EVA02) with depth cues (Depth Anything V2), using depth-aware tokens and a depth refinement decoder.

Result: Outperforms alternatives in DGSS, showing stronger performance, steadier attention, and better generalization, especially in extreme conditions.

Conclusion: DepthForge effectively enhances VFM performance in DGSS by integrating depth, demonstrating superior robustness and generalization.

Abstract: Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse-xzrptkvyqc/DepthForge.

[180] Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Kesen Zhao, Beier Zhu, Qianru Sun, Hanwang Zhang

Main category: cs.CV

TL;DR: UV-CoT introduces an unsupervised framework for visual Chain-of-Thought reasoning, eliminating the need for labeled bounding-box data by using preference optimization and automatic data generation.

Details

Motivation: Existing CoT methods focus on text, limiting their use of visual cues. Visual CoT is underexplored, and current approaches rely on supervised fine-tuning with extensive labeled data, which is hard to generalize.

Method: UV-CoT uses preference comparisons between model-generated bounding boxes, avoiding annotations. An automatic pipeline generates preference data, and an evaluator MLLM ranks responses to train the target MLLM.

Result: UV-CoT outperforms state-of-the-art textual and visual CoT methods on six datasets and shows strong generalization in zero-shot testing on four unseen datasets.

Conclusion: UV-CoT enhances visual comprehension, especially in spatial reasoning, by emulating human perception without relying on labeled data, demonstrating superior performance and generalization.

Abstract: Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception–identifying key regions and reasoning based on them–UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.

[181] Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space

Hong Zhang, Zhongjie Duan, Xingjun Wang, Yuze Zhao, Weiyi Lu, Zhipeng Di, Yixuan Xu, Yingda Chen, Yu Zhang

Main category: cs.CV

TL;DR: Nexus-Gen is a unified multimodal model integrating image understanding, generation, and editing, outperforming existing models by combining autoregressive and diffusion models in a shared embedding space.

Details

Motivation: Existing unified models struggle with image synthesis quality, autoregressive errors, and editing capabilities, prompting the need for a more robust solution.

Method: Proposes Nexus-Gen, using a shared image embedding space to bridge autoregressive and diffusion models, with a prefilled autoregression strategy to reduce error accumulation.

Result: Achieves state-of-the-art performance on image understanding, generation, and editing tasks after training on 26.3 million samples.

Conclusion: Nexus-Gen advances multimodal modeling by unifying tasks and releasing models, datasets, and code for further research.

Abstract: Unified multimodal generative models aim to integrate image understanding and generation abilities, offering significant advantages in harnessing multimodal corpora, particularly interleaved text-image data. However, existing unified models exhibit limitations in image synthesis quality, autoregressive error accumulation, and image editing capability. In this work, we propose Nexus-Gen, a novel architecture that unifies image understanding, generation, and editing tasks in a shared image embedding space. This shared space serves as a bridge for the autoregressive and diffusion models, which seamlessly integrates their complementary strengths in cross-modal modeling. To mitigate the severe error accumulation during autoregressive embedding prediction, we propose a novel prefilled autoregression strategy that aligns training-inference dynamics by prefilling input sequences with learnable embeddings. After multi-stage and multi-task training on our constructed large-scale dataset with 26.3 million samples, Nexus-Gen achieves state-of-the-art performance on the evaluation benchmarks spanning image understanding, generation and editing tasks. All models, datasets, and source codes are released in https://github.com/modelscope/Nexus-Gen to facilitate further advancements across the field.

[182] Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs

Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, Quy Duong Dang, Satwik Ramchandre, Son Lam Phung, Zhibin Liao, Minh-Son To, Johan Verjans, Phi Le Nguyen, Vu Minh Hieu Phan

Main category: cs.CV

TL;DR: The paper introduces HEAL-MedVQA, a benchmark to evaluate hallucination and localization issues in medical LMMs, and proposes the LobA framework to improve visual reasoning.

Details

Motivation: Current medical LMMs often generate hallucinations due to poor localization reasoning, relying on linguistic patterns or irrelevant image areas.

Method: HEAL-MedVQA includes evaluation protocols and a dataset of 67K VQA pairs with doctor-annotated masks. The LobA framework trains LMMs to localize and self-prompt pathological regions.

Result: The LobA framework outperforms state-of-the-art biomedical LMMs on HEAL-MedVQA, improving robustness in medical VQA.

Conclusion: The work addresses critical limitations in medical LMMs by enhancing localization and reducing hallucinations, advancing reliable medical data interpretation.

Abstract: Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in medical data interpretation. However, these models frequently generate hallucinations contradicting source evidence, particularly due to inadequate localization reasoning. This work reveals a critical limitation in current medical LMMs: instead of analyzing relevant pathological regions, they often rely on linguistic patterns or attend to irrelevant image areas when responding to disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization MedVQA), a comprehensive benchmark designed to evaluate LMMs’ localization abilities and hallucination robustness. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, with doctor-annotated anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.

Robert Aufschläger, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, Martin Schramm

Main category: cs.CV

TL;DR: A novel framework, cRID, is introduced to detect and leverage textual describable PII in street-level datasets, improving privacy protection and person re-identification.

Details

Motivation: Street-level datasets for autonomous driving and AI research contain PII, posing privacy risks. Current methods focus on biometric traits, missing other identifiable clues.

Method: Combines Large Vision-Language Models, Graph Attention Networks, and representation learning to detect and utilize interpretable PII features.

Result: Improved performance in cross-dataset person Re-ID, demonstrated from Market-1501 to CUHK03-np.

Conclusion: cRID effectively addresses privacy risks by detecting semantic PII, enhancing Re-ID, and is practical for real-world applications.

Abstract: The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework’s practical utility. Code is available at https://github.com/RAufschlaeger/cRID.

[184] FlowAlign: Trajectory-Regularized, Inversion-Free Flow-based Image Editing

Jeongsol Kim, Yeobin Hong, Jonghyun Park, Jong Chul Ye

Main category: cs.CV

TL;DR: FlowAlign is an inversion-free flow-based framework for consistent image editing using optimal control-based trajectory control, improving source consistency and editing stability.

Details

Motivation: Existing inversion-free methods like FlowEdit suffer from unstable editing trajectories and poor source consistency due to lack of latent inversion.

Method: FlowAlign introduces terminal point regularization to balance semantic alignment with edit prompts and structural consistency with the source image, leveraging optimal control for smoother trajectories.

Result: FlowAlign outperforms existing methods in source preservation and editing controllability, supporting reverse editing by reversing the ODE trajectory.

Conclusion: FlowAlign provides a robust solution for consistent and reversible image editing without inversion, enhancing both stability and controllability.

Abstract: Recent inversion-free, flow-based image editing methods such as FlowEdit leverages a pre-trained noise-to-image flow model such as Stable Diffusion 3, enabling text-driven manipulation by solving an ordinary differential equation (ODE). While the lack of exact latent inversion is a core advantage of these methods, it often results in unstable editing trajectories and poor source consistency. To address this limitation, we propose {\em FlowAlign}, a novel inversion-free flow-based framework for consistent image editing with optimal control-based trajectory control. Specifically, FlowAlign introduces source similarity at the terminal point as a regularization term to promote smoother and more consistent trajectories during the editing process. Notably, our terminal point regularization is shown to explicitly balance semantic alignment with the edit prompt and structural consistency with the source image along the trajectory. Furthermore, FlowAlign naturally supports reverse editing by simply reversing the ODE trajectory, highliting the reversible and consistent nature of the transformation. Extensive experiments demonstrate that FlowAlign outperforms existing methods in both source preservation and editing controllability.

[185] Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel

Main category: cs.CV

TL;DR: Prompt4Trust is an RL framework for prompt augmentation in MLLMs to improve confidence calibration, enhancing trustworthiness in healthcare applications.

Details

Motivation: Addressing MLLMs' sensitivity to prompts and overconfidence in incorrect responses, critical for clinical reliability.

Method: Uses a lightweight LLM to generate context-aware auxiliary prompts for better confidence calibration in MLLMs.

Result: Achieves SOTA on PMC-VQA benchmark and shows zero-shot generalization to larger MLLMs.

Conclusion: Demonstrates scalable, automated prompt engineering for trustworthy MLLMs in safety-critical settings.

Abstract: Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

[186] PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Jeonghyeok Do, Sungpyo Kim, Geunhyuk Youk, Jaehyup Lee, Munchurl Kim

Main category: cs.CV

TL;DR: PAN-Crafter addresses cross-modality misalignment in PAN-sharpening by introducing modality-adaptive reconstruction and cross-modality alignment-aware attention, outperforming state-of-the-art methods in accuracy, speed, and memory efficiency.

Details

Motivation: Cross-modality misalignment in PAN-sharpening causes spectral distortion and blurring, which conventional methods fail to address due to reliance on perfect alignment assumptions.

Method: Proposes PAN-Crafter with Modality-Adaptive Reconstruction (MARs) for joint HRMS and PAN reconstruction and Cross-Modality Alignment-Aware Attention (CM3A) for bidirectional feature alignment.

Result: Outperforms state-of-the-art methods in all metrics, with 50.11x faster inference and 0.63x memory usage, and shows strong generalization on unseen datasets.

Conclusion: PAN-Crafter effectively mitigates misalignment issues, offering robust and efficient PAN-sharpening with superior performance and generalization.

Abstract: PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment – caused by sensor placement, acquisition timing, and resolution disparity – induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN’s high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11$\times$ faster inference time and 0.63$\times$ the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.

[187] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

Shuo Zhang

Main category: cs.CV

TL;DR: EECD-Net is a multi-stage crack detection method using SRCNN, SCU, and GAT modules, achieving 98.6% accuracy and 33% energy reduction.

Details

Motivation: Address challenges of low-quality images and high energy consumption in road crack detection for real-time monitoring.

Method: Uses SRCNN for image enhancement, SCU for energy efficiency, and GAT for multi-scale feature fusion.

Result: 98.6% detection accuracy and 5.6 mJ energy consumption, outperforming Hybrid-Segmentor by 1.5%.

Conclusion: EECD-Net offers a scalable, low-power solution for real-time infrastructure monitoring.

Abstract: Crack detection on road surfaces is a critical measurement technology in the instrumentation domain, essential for ensuring infrastructure safety and transportation reliability. However, due to limited energy and low-resolution imaging, smart terminal devices struggle to maintain real-time monitoring performance. To overcome these challenges, this paper proposes a multi-stage detection approach for road crack detection, EECD-Net, to enhance accuracy and energy efficiency of instrumentation. Specifically, the sophisticated Super-Resolution Convolutional Neural Network (SRCNN) is employed to address the inherent challenges of low-quality images, which effectively enhance image resolution while preserving critical structural details. Meanwhile, a Spike Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is proposed to convert these images into sparse pulse sequences, significantly reducing power consumption. Additionally, a Gated Attention Transformer (GAT) module is designed to strategically fuse multi-scale feature representations through adaptive attention mechanisms, effectively capturing both long-range dependencies and intricate local crack patterns, and significantly enhancing detection robustness across varying crack morphologies. The experiments on the CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6% detection accuracy, surpassing state-of-the-art counterparts such as Hybrid-Segmentor by a significant 1.5%. Notably, the EECD-Net maintains exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial 33% reduction compared to baseline implementations. This work pioneers a transformative approach in instrumentation-based crack detection, offering a scalable, low-power solution for real-time, large-scale infrastructure monitoring in resource-constrained environments.

[188] MARL-MambaContour: Unleashing Multi-Agent Deep Reinforcement Learning for Active Contour Optimization in Medical Image Segmentation

Ruicheng Zhang, Yu Sun, Zeyu Zhang, Jinai Li, Xiaofan Liu, Au Hoi Fan, Haowei Guo, Puxin Yan

Main category: cs.CV

TL;DR: MARL-MambaContour is a contour-based medical image segmentation framework using MARL, optimizing contour alignment with SAC and ERAM, and enhancing policy with BCHFM for state-of-the-art performance.

Details

Motivation: Address limitations of pixel-based methods by ensuring topological consistency and structural awareness in medical image segmentation.

Method: Models contour points as agents adjusting positions via MARL, using SAC with ERAM for optimization and a Mamba-based policy network with BCHFM for inter-agent communication.

Result: Achieves state-of-the-art performance on five medical imaging datasets, demonstrating accuracy and robustness.

Conclusion: MARL-MambaContour is a promising framework for clinical applications due to its precision and adaptability.

Abstract: We introduce MARL-MambaContour, the first contour-based medical image segmentation framework based on Multi-Agent Reinforcement Learning (MARL). Our approach reframes segmentation as a multi-agent cooperation task focused on generate topologically consistent object-level contours, addressing the limitations of traditional pixel-based methods which could lack topological constraints and holistic structural awareness of anatomical regions. Each contour point is modeled as an autonomous agent that iteratively adjusts its position to align precisely with the target boundary, enabling adaptation to blurred edges and intricate morphologies common in medical images. This iterative adjustment process is optimized by a contour-specific Soft Actor-Critic (SAC) algorithm, further enhanced with the Entropy Regularization Adjustment Mechanism (ERAM) which dynamically balance agent exploration with contour smoothness. Furthermore, the framework incorporates a Mamba-based policy network featuring a novel Bidirectional Cross-attention Hidden-state Fusion Mechanism (BCHFM). This mechanism mitigates potential memory confusion limitations associated with long-range modeling in state space models, thereby facilitating more accurate inter-agent information exchange and informed decision-making. Extensive experiments on five diverse medical imaging datasets demonstrate the state-of-the-art performance of MARL-MambaContour, highlighting its potential as an accurate and robust clinical application.

[189] Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement

Kun Yuan, Tingxuan Chen, Shi Li, Joel L. Lavanchy, Christian Heiliger, Ege Özsoy, Yiming Huang, Long Bai, Nassir Navab, Vinkle Srivastav, Hongliang Ren, Nicolas Padoy

Main category: cs.CV

TL;DR: SPA is a lightweight framework for surgical workflow understanding, adapting foundation models to institutional settings with minimal annotation, achieving state-of-the-art performance in few-shot phase recognition.

Details

Motivation: The complexity and diversity of surgical workflows make generalizable models challenging. Existing foundation models struggle with domain shifts in unseen environments.

Method: SPA uses few-shot spatial adaptation, diffusion modeling for temporal consistency, and dynamic test-time adaptation to align multi-modal embeddings with institution-specific scenes.

Result: SPA outperforms full-shot models with just 32-shot labeled data, achieving top performance in cross-institutional and cross-procedural phase recognition.

Conclusion: SPA offers a practical, efficient solution for hospitals to customize phase recognition models with minimal annotation, enhancing reliability under distribution shifts.

Abstract: The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data. Code is available at https://github.com/CAMMA-public/SPA

[190] Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Lin Mei, Peiyi Shen, Liang Zhang

Main category: cs.CV

TL;DR: The paper introduces CBM-HNMU, a method to enhance interpretability and accuracy in deep learning models by refining concepts in the Concept Bottleneck Model and distilling corrected knowledge back into black-box models.

Details

Motivation: Deep learning models are becoming more complex and less interpretable, with existing methods lacking effective interventions or model modifications.

Method: CBM-HNMU uses the Concept Bottleneck Model to approximate black-box reasoning, identifies and refines detrimental concepts, and distills corrected knowledge back into the model.

Result: Evaluated on multiple datasets, CBM-HNMU improves model accuracy by up to 2.64% and increases average accuracy by 1.03%.

Conclusion: CBM-HNMU successfully enhances both interpretability and accuracy in deep learning models.

Abstract: Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: https://github.com/XiGuaBo/CBM-HNMU.

[191] FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

Quang-Huy Che, Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: FA-Seg is a fast, accurate, training-free framework for open-vocabulary semantic segmentation using diffusion models, achieving high performance with efficient inference.

Details

Motivation: Existing methods either lose spatial precision (contrastive learning) or face computation-quality trade-offs (diffusion models). FA-Seg aims to bridge this gap.

Method: FA-Seg uses a pretrained diffusion model with (1+1)-step segmentation, a dual-prompt mechanism, Hierarchical Attention Refinement (HARD), and Test-Time Flipping (TTF).

Result: Achieves 43.8% average mIoU on PASCAL VOC, PASCAL Context, and COCO Object benchmarks with superior efficiency.

Conclusion: FA-Seg balances segmentation quality and inference efficiency, offering a strong foundation for future extensions.

Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code will be open-sourced after this paper is accepted.

[192] Similarity Memory Prior is All You Need for Medical Image Segmentation

Hao Tang, Zhiqing Guo, Liejun Wang, Chao Liu

Main category: cs.CV

TL;DR: The paper introduces Sim-MPNet, a network for medical image segmentation, leveraging similarity memory priors and dynamic updates to improve segmentation accuracy.

Details

Motivation: Inspired by 'grandmother cells' in macaque V1, the study aims to enhance medical image segmentation by mimicking their recognition capabilities.

Method: Proposes Sim-MPNet with DMW-LA for dynamic memory updates and DS-GIM for feature distribution analysis using cosine similarity and Euclidean distance.

Result: Outperforms state-of-the-art methods on four public datasets.

Conclusion: Sim-MPNet effectively improves segmentation by leveraging similarity memory priors and dynamic feature updates.

Abstract: In recent years, it has been found that “grandmother cells” in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights-Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on https://github.com/vpsg-research/Sim-MPNet.

[193] Robustifying 3D Perception via Least-Squares Graphs for Multi-Agent Object Tracking

Maria Damanaki, Ioulia Kapsali, Nikos Piperigkos, Alexandros Gkillas, Aris S. Lalos

Main category: cs.CV

TL;DR: A novel multi-agent tracking framework using least-squares graphs on 3D LiDAR data mitigates adversarial noise, outperforming state-of-the-art methods by 23.3%.

Details

Motivation: Enhancing resilience of EdgeAI systems like autonomous vehicles against adversarial threats by improving multi-agent cooperation for better context understanding and robustness.

Method: Uses least-squares graphs to refine object detection centroids via overlapped bounding boxes, fusing multi-vehicle detections and tracking in two stages.

Result: Outperforms single and multi-agent tracking methods by up to 23.3% on the V2V4Real dataset under adversarial conditions.

Conclusion: The proposed framework is a resilient solution for adversarial threats in 3D LiDAR scenes, requiring no additional defenses.

Abstract: The critical perception capabilities of EdgeAI systems, such as autonomous vehicles, are required to be resilient against adversarial threats, by enabling accurate identification and localization of multiple objects in the scene over time, mitigating their impact. Single-agent tracking offers resilience to adversarial attacks but lacks situational awareness, underscoring the need for multi-agent cooperation to enhance context understanding and robustness. This paper proposes a novel mitigation framework on 3D LiDAR scene against adversarial noise by tracking objects based on least-squares graph on multi-agent adversarial bounding boxes. Specifically, we employ the least-squares graph tool to reduce the induced positional error of each detection’s centroid utilizing overlapped bounding boxes on a fully connected graph via differential coordinates and anchor points. Hence, the multi-vehicle detections are fused and refined mitigating the adversarial impact, and associated with existing tracks in two stages performing tracking to further suppress the adversarial threat. An extensive evaluation study on the real-world V2V4Real dataset demonstrates that the proposed method significantly outperforms both state-of-the-art single and multi-agent tracking frameworks by up to 23.3% under challenging adversarial conditions, operating as a resilient approach without relying on additional defense mechanisms.

[194] What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies

Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall

Main category: cs.CV

TL;DR: A survey categorizing critical traffic elements into anomalies and normal but critical entities, analyzing 35 vision-driven tasks and 73 datasets, aiming to unify standards and optimize resources.

Details

Motivation: To improve road safety by leveraging advances in vision-based sensors and computer vision algorithms, providing a unified framework for traffic scenario analysis.

Method: Systematic categorization of traffic entities into two main groups (anomalies and normal but critical), integrating ten categories and twenty subclasses, and analyzing tasks and datasets.

Result: A taxonomy connecting related fields, comprehensive analysis of tasks and datasets, and identification of weaknesses and potential solutions.

Conclusion: The survey offers a holistic overview, guides resource selection, and highlights research gaps, contributing to the field’s advancement.

Abstract: Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.

[195] Longitudinal Study of Facial Biometrics at the BEZ: Temporal Variance Analysis

Mathias Schulz, Alexander Spenke, Pia Funk, Florian Blümel, Markus Rohde, Ralph Breithaupt, Gerd Nolden, Norbert Jung, Robert Lange

Main category: cs.CV

TL;DR: Long-term biometric evaluations show daily score fluctuations are more significant than long-term changes, emphasizing the need for extended testing in controlled environments.

Details

Motivation: To understand the variability of biometric characteristics over time and across diverse populations.

Method: Conducted a 2.5-year study with 400+ participants using GDPR-compliant biometric data (238,000+ datasets) and state-of-the-art face recognition algorithms.

Result: Biometric comparison scores fluctuate more between days than over the entire period.

Conclusion: Long-term, controlled testing is crucial for accurate biometric analysis and future advancements.

Abstract: This study presents findings from long-term biometric evaluations conducted at the Biometric Evaluation Center (bez). Over the course of two and a half years, our ongoing research with over 400 participants representing diverse ethnicities, genders, and age groups were regularly assessed using a variety of biometric tools and techniques at the controlled testing facilities. Our findings are based on the General Data Protection Regulation-compliant local bez database with more than 238.000 biometric data sets categorized into multiple biometric modalities such as face and finger. We used state-of-the-art face recognition algorithms to analyze long-term comparison scores. Our results show that these scores fluctuate more significantly between individual days than over the entire measurement period. These findings highlight the importance of testing biometric characteristics of the same individuals over a longer period of time in a controlled measurement environment and lays the groundwork for future advancements in biometric data analysis.

Jiaxu Wan, Xu Wang, Mengwei Xie, Xinyuan Chang, Xinran Liu, Zheng Pan, Mu Xu, Ding Yuan

Main category: cs.CV

TL;DR: The paper introduces Online Map Association (OMA), a benchmark for associating global SD maps with online HD maps to improve autonomous vehicle navigation. It includes a dataset and a baseline method called Map Association Transformer.

Details

Motivation: Autonomous vehicles struggle with integrating global SD maps and online HD maps for hybrid navigation, limiting their planning capabilities.

Method: The authors propose the OMA benchmark with a dataset of 480k roads and 260k lane paths. They also introduce the Map Association Transformer, using path-aware and spatial attention mechanisms.

Result: The OMA benchmark and baseline method aim to enhance autonomous vehicle navigation by addressing the gap in hybrid map association.

Conclusion: The work provides a foundational benchmark and method for improving autonomous vehicle navigation through better map association.

Abstract: Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at https://github.com/WallelWan/OMA-MAT.

[197] HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking

Ruixiang Chen, Guolei Sun, Yawei Li, Jie Qin, Luca Benini

Main category: cs.CV

TL;DR: Enhancements to SAM2 for video object tracking improve accuracy with hierarchical motion estimation and optimized memory bank, achieving state-of-the-art results.

Details

Motivation: Address challenges like occlusions, background clutter, and target reappearance in video object tracking.

Method: Hierarchical motion estimation (linear prediction + non-linear refinement) and optimized memory bank (long-term/short-term frames).

Result: 9.6% and 7.2% AUC improvements on LaSOT and LaSOText; larger gains on smaller models.

Conclusion: Trainless, low-overhead enhancements effectively boost long-term tracking performance.

Abstract: This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.

[198] Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays

Ethan Dack, Chengliang Dai

Main category: cs.CV

TL;DR: The paper revisits the ‘Name That Dataset’ task for chest X-ray datasets to explore biases, applies transformations, and analyzes results using various network architectures.

Details

Motivation: To investigate if biases exist in popular open-source chest X-ray datasets and ensure AI methods focus on pathology, not shortcuts.

Method: Applies the ‘Name That Dataset’ task to NIH, CheXpert, MIMIC-CXR, and PadChest datasets, uses transformations, and tests with different network architectures.

Result: Identifies and explains biases in the datasets, emphasizing the need for explainable research.

Conclusion: Encourages more open-source medical datasets and transparent research to improve AI applications in medical imaging.

Abstract: Recent works have revisited the infamous task ``Name That Dataset’’, demonstrating that non-medical datasets contain underlying biases and that the dataset origin task can be solved with high accuracy. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. To extend our work, we apply simple transformations to the datasets, repeat the same task, and perform an analysis to identify and explain any detected biases. Given the importance of AI applications in medical imaging, it’s vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. Our code can be found here: https://github.com/eedack01/x_ray_ds_bias.

Josh Qixuan Sun, Xiaoying Xing, Huaiyuan Weng, Chul Min Yeum, Mark Crowley

Main category: cs.CV

TL;DR: The paper introduces V2-VLNCE, a generalized scenario for Vision-Language Navigation in Continuous Environments (VLNCE) with varied viewpoints, and proposes VIL, a view-invariant post-training strategy to enhance navigation policy robustness.

Details

Motivation: Current navigation policies are sensitive to viewpoint changes (camera height and angle), limiting their robustness. The paper aims to address this by generalizing the scenario and improving policy adaptability.

Method: VIL uses contrastive learning for sparse, view-invariant features and a teacher-student framework for the Waypoint Predictor Module. It employs end-to-end training to optimize components jointly.

Result: VIL outperforms state-of-the-art methods by 8-15% in Success Rate on R2R-CE and RxR-CE datasets. It also achieves SOTA performance on RxR-CE and maintains standard VLNCE performance.

Conclusion: VIL is a plug-and-play post-training method that enhances robustness to viewpoint changes without compromising standard performance, making it practical for real-world applications.

Abstract: Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent’s observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.

[200] Learning and Transferring Better with Depth Information in Visual Reinforcement Learning

Zichun Xu, Yuntao Li, Zhaomin Wang, Lei Zhuang, Guocai Yang, Jingdong Zhao

Main category: cs.CV

TL;DR: A vision transformer-based backbone fuses RGB and depth data for better generalization, using contrastive learning and curriculum learning for sim2real transfer.

Details

Motivation: Depth information is robust to appearance variations and provides 3D spatial details, making it valuable for enhancing generalization in visual tasks.

Method: Separate CNN stems process RGB and depth modalities, followed by a scalable vision transformer for fusion. Contrastive learning with masked tokens improves sample efficiency, and curriculum learning aids sim2real transfer.

Result: The proposed method effectively combines RGB and depth data, leveraging transformer architecture and unsupervised learning for robust performance.

Conclusion: The fusion of RGB and depth via vision transformers, aided by contrastive and curriculum learning, enhances generalization and sim2real transfer.

Abstract: Depth information is robust to scene appearance variations and inherently carries 3D spatial details. In this paper, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive unsupervised learning scheme is designed with masked and unmasked tokens to accelerate the sample efficiency during the reinforcement learning progress. For sim2real transfer, a flexible curriculum learning schedule is developed to deploy domain randomization over training processes.

[201] Supercharging Floorplan Localization with Semantic Rays

Yuval Grader, Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: A semantic-aware floorplan localization framework improves accuracy by jointly estimating depth and semantic rays, outperforming state-of-the-art methods.

Details

Motivation: Current floorplan localization techniques ignore rich semantic details like windows and doors, focusing only on depth-based structural cues.

Method: The framework constructs a structural-semantic probability volume in a coarse-to-fine manner, refining high-probability regions for precise 2D location and orientation predictions.

Result: Outperforms state-of-the-art methods on benchmarks, with significant recall improvements and flexibility to incorporate metadata like room labels.

Conclusion: The semantic-aware approach enhances localization accuracy and efficiency, leveraging detailed floorplan semantics for better performance.

Abstract: Floorplans provide a compact representation of the building’s structure, revealing not only layout information but also detailed semantics such as the locations of windows and doors. However, contemporary floorplan localization techniques mostly focus on matching depth-based structural cues, ignoring the rich semantics communicated within floorplans. In this work, we introduce a semantic-aware localization framework that jointly estimates depth and semantic rays, consolidating over both for predicting a structural-semantic probability volume. Our probability volume is constructed in a coarse-to-fine manner: We first sample a small set of rays to obtain an initial low-resolution probability volume. We then refine these probabilities by performing a denser sampling only in high-probability regions and process the refined values for predicting a 2D location and orientation angle. We conduct an evaluation on two standard floorplan localization benchmarks. Our experiments demonstrate that our approach substantially outperforms state-of-the-art methods, achieving significant improvements in recall metrics compared to prior works. Moreover, we show that our framework can easily incorporate additional metadata such as room labels, enabling additional gains in both accuracy and efficiency.

[202] ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models

Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, Dongyan Zhao

Main category: cs.CV

TL;DR: The paper introduces ProactiveVideoQA, a benchmark for evaluating proactive interaction in multimodal dialogue systems, and PAUC, a metric accounting for temporal dynamics of responses, showing better alignment with human preferences.

Details

Motivation: To address the need for proactive interaction in multimodal dialogue systems, especially during video playback, where traditional turn-by-turn dialogue falls short.

Method: Introduces ProactiveVideoQA as a benchmark and PAUC, a new metric evaluating temporal dynamics of responses. Benchmarks baseline systems and conducts a user study.

Result: PAUC aligns better with human preferences than traditional metrics, providing a more accurate evaluation of proactive interaction.

Conclusion: PAUC offers a faithful assessment of user experience in proactive scenarios, advancing the field of multimodal dialogue systems.

Abstract: With the growing research focus on multimodal dialogue systems, the capability for proactive interaction is gradually gaining recognition. As an alternative to conventional turn-by-turn dialogue, users increasingly expect multimodal systems to be more initiative, for example, by autonomously determining the timing of multi-turn responses in real time during video playback. To facilitate progress in this emerging area, we introduce ProactiveVideoQA, the first comprehensive benchmark to evaluate a system’s ability to engage in proactive interaction. Since model responses are generated at varying timestamps, we further propose PAUC, the first metric that accounts for the temporal dynamics of model responses. This enables a more accurate evaluation of systems operating in proactive settings. Through extensive benchmarking of various baseline systems on ProactiveVideoQA and a user study of human preferences, we show that PAUC is in better agreement with human preferences than traditional evaluation metrics, which typically only consider the textual content of responses. These findings demonstrate that PAUC provides a more faithful assessment of user experience in proactive interaction scenarios. Project homepage: https://github.com/yellow-binary-tree/ProactiveVideoQA

[203] (Almost) Free Modality Stitching of Foundation Models

Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto

Main category: cs.CV

TL;DR: Hyma proposes a hypernetwork-based solution for efficient uni-modal model selection and connector training in multi-modal models, reducing computational costs.

Details

Motivation: The complexity and computational demands of selecting and aligning uni-modal models for multi-modal tasks motivate the need for an efficient solution.

Method: Hyma leverages hypernetworks to jointly train connector modules for multiple uni-modal model combinations, avoiding exhaustive grid search.

Result: Hyma reduces the search cost by 10x while matching the performance of grid search in multi-modal benchmarks.

Conclusion: Hyma offers an efficient and scalable approach for model selection and alignment in multi-modal tasks.

Abstract: Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for $N \times M$ combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by $10\times$, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.

[204] Text Embedding Knows How to Quantize Text-Guided Diffusion Models

Hongjae Lee, Myungjun Son, Dongjea Kang, Seung-Won Jung

Main category: cs.CV

TL;DR: QLIP introduces a text-prompt-guided quantization method for diffusion models, reducing computational complexity while improving image quality.

Details

Motivation: Existing quantization methods for diffusion models ignore input conditions like text prompts, limiting efficiency and performance.

Method: QLIP uses text prompts to dynamically select bit precision for each layer and time step, integrating with existing quantization techniques.

Result: QLIP reduces computational complexity and enhances image quality across multiple datasets.

Conclusion: QLIP effectively addresses the limitations of current quantization methods by leveraging text prompts for efficient diffusion model optimization.

Abstract: Despite the success of diffusion models in image generation tasks such as text-to-image, the enormous computational complexity of diffusion models limits their use in resource-constrained environments. To address this, network quantization has emerged as a promising solution for designing efficient diffusion models. However, existing diffusion model quantization methods do not consider input conditions, such as text prompts, as an essential source of information for quantization. In this paper, we propose a novel quantization method dubbed Quantization of Language-to-Image diffusion models using text Prompts (QLIP). QLIP leverages text prompts to guide the selection of bit precision for every layer at each time step. In addition, QLIP can be seamlessly integrated into existing quantization methods to enhance quantization efficiency. Our extensive experiments demonstrate the effectiveness of QLIP in reducing computational complexity and improving the quality of the generated images across various datasets.

[205] Text-Visual Semantic Constrained AI-Generated Image Quality Assessment

Qiang Li, Qingsen Yan, Haojian Huang, Peng Wu, Haokui Zhang, Yanning Zhang

Main category: cs.CV

TL;DR: The paper introduces SC-AGIQA, a framework for assessing AI-generated image quality by addressing semantic misalignment and detail perception issues using text-visual constraints.

Details

Motivation: Existing methods for evaluating AI-generated images face challenges like semantic misalignment and missing details, necessitating a more robust solution.

Method: SC-AGIQA integrates text-visual semantic constraints with two modules: TSAM for semantic alignment using MLLMs and FFDPM for fine-grained distortion perception via frequency-domain analysis.

Result: SC-AGIQA outperforms state-of-the-art methods on benchmark datasets.

Conclusion: The proposed framework effectively enhances the evaluation of AI-generated image quality by addressing key limitations of existing approaches.

Abstract: With the rapid advancements in Artificial Intelligence Generated Image (AGI) technology, the accurate assessment of their quality has become an increasingly vital requirement. Prevailing methods typically rely on cross-modal models like CLIP or BLIP to evaluate text-image alignment and visual quality. However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. Our approach integrates key capabilities from multiple models and tackles the aforementioned challenges by introducing two core modules: the Text-assisted Semantic Alignment Module (TSAM), which leverages Multimodal Large Language Models (MLLMs) to bridge the semantic gap by generating an image description and comparing it against the original prompt for a refined consistency check, and the Frequency-domain Fine-Grained Degradation Perception Module (FFDPM), which draws inspiration from Human Visual System (HVS) properties by employing frequency domain analysis combined with perceptual sensitivity weighting to better quantify subtle visual distortions and enhance the capture of fine-grained visual quality details in images. Extensive experiments conducted on multiple benchmark datasets demonstrate that SC-AGIQA outperforms existing state-of-the-art methods. The code is publicly available at https://github.com/mozhu1/SC-AGIQA.

cs.AI

Hari Masoor

Main category: cs.AI

TL;DR: SAMEP is a protocol enabling secure, persistent, and searchable memory sharing among AI agents, improving collaboration and reducing redundancy.

Details

Motivation: Current AI agents lack persistent memory sharing, hindering collaboration and knowledge retention across sessions and agents.

Method: SAMEP uses a distributed memory repository with semantic search, cryptographic controls (AES-256-GCM), and standardized APIs (MCP, A2A).

Result: 73% fewer redundant computations, 89% better context relevance, and full regulatory compliance (e.g., HIPAA).

Conclusion: SAMEP enables secure, persistent collaboration among AI agents, enhancing efficiency and compliance.

Abstract: Current AI agent architectures suffer from ephemeral memory limitations, preventing effective collaboration and knowledge sharing across sessions and agent boundaries. We introduce SAMEP (Secure Agent Memory Exchange Protocol), a novel framework that enables persistent, secure, and semantically searchable memory sharing among AI agents. Our protocol addresses three critical challenges: (1) persistent context preservation across agent sessions, (2) secure multi-agent collaboration with fine-grained access control, and (3) efficient semantic discovery of relevant historical context. SAMEP implements a distributed memory repository with vector-based semantic search, cryptographic access controls (AES-256-GCM), and standardized APIs compatible with existing agent communication protocols (MCP, A2A). We demonstrate SAMEP’s effectiveness across diverse domains including multi-agent software development, healthcare AI with HIPAA compliance, and multi-modal processing pipelines. Experimental results show 73% reduction in redundant computations, 89% improvement in context relevance scores, and complete compliance with regulatory requirements including audit trail generation. SAMEP enables a new paradigm of persistent, collaborative AI agent ecosystems while maintaining security and privacy guarantees.

[207] AI Mother Tongue: Self-Emergent Communication in MARL via Endogenous Symbol Systems

Hung Ming Liu

Main category: cs.AI

TL;DR: The study challenges traditional inductive biases in MARL by introducing the AIM framework, showing endogenous symbol systems enable spontaneous semantic convergence without external biases, aligning with neuroscience and LLM research.

Details

Motivation: To question if artificial inductive biases in MARL are over-engineering and explore if endogenous symbol systems can naturally achieve effective communication.

Method: Uses the AIM framework with VQ-VAE to demonstrate spontaneous semantic compression and Nash equilibrium-driven convergence in agents.

Result: AIM achieves efficient symbolic communication without external biases, with interpretable analysis revealing power-law distribution in symbol usage and yielding three theoretical insights.

Conclusion: The findings bridge symbolism and connectionism, suggesting future work with HQ-VAE and RL pre-training to enhance AIM’s capabilities.

Abstract: In Decentralized Multi-Agent Reinforcement Learning (MARL), the development of Emergent Communication has long been constrained by the Joint Exploration Dilemma'', leading agents to fall into a Communication Vacuum Equilibrium’’ . Traditional methods address this by introducing inductive biases to facilitate communication emergence . This study fundamentally questions whether such artificial inductive biases are, in fact, over-engineering. Through experiments with the AI Mother Tongue'' (AIM) framework, based on a Vector Quantized Variational Autoencoder (VQ-VAE), we demonstrate that when agents possess an endogenous symbol system, their neural representations naturally exhibit spontaneous semantic compression and Nash equilibrium-driven semantic convergence, achieving effective symbolic communication without external inductive biases. This aligns with recent neuroscience findings suggesting that the human brain does not directly use human language for internal thought , and resonates with research on soft thinking’’ capabilities in Large Language Models (LLMs) . Compared to traditional explicit communication methods, AIM demonstrates stronger generality and efficiency. The interpretable analysis toolkit developed in this study confirms that symbol usage exhibits a significant power-law distribution, leading to three major theoretical insights: the Neural Communication Hypothesis'', the Tool-First Principle’’, and the Semantic Interpretability Paradigm''. Future research will explore the integration of Hierarchical Quantized Variational Autoencoders (HQ-VAE) to enhance AIM's complex expressive capabilities and investigate the potential for Reinforcement Learning (RL) Low-Level Pre-training’’. This discovery offers new avenues for bridging symbolism and connectionism.

[208] Orchestrator-Agent Trust: A Modular Agentic AI Visual Classification System with Trust-Aware Orchestration and RAG-Based Reasoning

Konstantinos I. Roumeliotis, Ranjan Sapkota, Manoj Karkee, Nikolaos D. Tselikas

Main category: cs.AI

TL;DR: A modular AI framework integrates multimodal agents with a reasoning orchestrator and RAG for trust-aware zero-shot visual classification, achieving 85.63% accuracy in apple leaf disease diagnosis.

Details

Motivation: Addressing trust challenges in zero-shot AI agents by combining multimodal understanding with calibrated orchestration.

Method: Proposes a framework with three configurations: zero-shot, fine-tuned, and trust-calibrated orchestration using CLIP-based retrieval and re-evaluation.

Result: 77.94% accuracy improvement in zero-shot setting, with GPT-4o showing better calibration than Qwen-2.5-VL.

Conclusion: The system separates perception from meta-reasoning, enabling scalable, interpretable AI for trust-critical domains, with open-source release for reproducibility.

Abstract: Modern Artificial Intelligence (AI) increasingly relies on multi-agent architectures that blend visual and language understanding. Yet, a pressing challenge remains: How can we trust these agents especially in zero-shot settings with no fine-tuning? We introduce a novel modular Agentic AI visual classification framework that integrates generalist multimodal agents with a non-visual reasoning orchestrator and a Retrieval-Augmented Generation (RAG) module. Applied to apple leaf disease diagnosis, we benchmark three configurations: (I) zero-shot with confidence-based orchestration, (II) fine-tuned agents with improved performance, and (III) trust-calibrated orchestration enhanced by CLIP-based image retrieval and re-evaluation loops. Using confidence calibration metrics (ECE, OCR, CCC), the orchestrator modulates trust across agents. Our results demonstrate a 77.94% accuracy improvement in the zero-shot setting using trust-aware orchestration and RAG, achieving 85.63% overall. GPT-4o showed better calibration, while Qwen-2.5-VL displayed overconfidence. Furthermore, image-RAG grounded predictions with visually similar cases, enabling correction of agent overconfidence via iterative re-evaluation. The proposed system separates perception (vision agents) from meta-reasoning (orchestrator), enabling scalable and interpretable multi-agent AI. This blueprint is extensible to diagnostics, biology, and other trust-critical domains. All models, prompts, results, and system components including the complete software source code are openly released to support reproducibility, transparency, and community benchmarking at Github: https://github.com/Applied-AI-Research-Lab/Orchestrator-Agent-Trust

[209] From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents

Tatiana Petrova, Aleksandr Puzikov, Boris Bliznukov, Radu State

Main category: cs.AI

TL;DR: The paper provides a comprehensive evolutionary overview of the Web of Agents (WoA), linking modern protocols to historical standards and introducing a taxonomy for unified analysis. It highlights a paradigm shift in intelligence locus and identifies socio-technical challenges as the next research frontier.

Details

Motivation: To address the fragmented understanding of the Web of Agents (WoA) by connecting modern LLM-powered frameworks with historical Multi-Agent Systems (MAS) and Semantic Web developments.

Method: Introduces a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism) to analyze and compare agent architectures across generations.

Result: Reveals a clear evolutionary lineage in agent architectures and identifies a paradigm shift in intelligence locus, foundational for modern Agentic AI.

Conclusion: New protocols alone are insufficient for a robust WoA ecosystem; future research must tackle socio-technical challenges like decentralized identity, economic models, security, and governance.

Abstract: The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users’ behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field’s trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the ’locus of intelligence’: from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent’s core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.

[210] Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning

Zheng Zhang

Main category: cs.AI

TL;DR: LLMs exhibit fluency but struggle with symbolic reasoning, arithmetic, and logic due to a gap between comprehension and competence, termed the ‘split-brain syndrome.’

Details

Motivation: To diagnose why LLMs fail at tasks requiring principled reasoning despite their surface fluency.

Method: Controlled experiments and architectural analysis to identify the gap between comprehension and competence.

Result: LLMs articulate correct principles but fail in execution due to geometric and functional dissociation of pathways.

Conclusion: Future models need metacognitive control and structural grounding to overcome these limitations.

Abstract: Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structural diagnosis of such failures, revealing a persistent gap between \textit{comprehension} and \textit{competence}. Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them–a failure rooted not in knowledge access, but in computational execution. We term this phenomenon the computational \textit{split-brain syndrome}, where instruction and action pathways are geometrically and functionally dissociated. This core limitation recurs across domains, from mathematical operations to relational inferences, and explains why model behavior remains brittle even under idealized prompting. We argue that LLMs function as powerful pattern completion engines, but lack the architectural scaffolding for principled, compositional reasoning. Our findings delineate the boundary of current LLM capabilities and motivate future models with metacognitive control, principle lifting, and structurally grounded execution. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles, and why the geometric separation between instruction and execution pathways suggests limitations in neural introspection and mechanistic analysis.

[211] Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems

Dany Moshkovich, Sergey Zeltyn

Main category: cs.AI

TL;DR: AgentOps is a framework for managing uncertainty in LLM-powered agentic systems, addressing the needs of developers, testers, SREs, and business users through a six-stage automation pipeline.

Details

Motivation: Traditional observability practices are inadequate for the unique uncertainties introduced by LLM-powered agentic systems, necessitating a specialized framework.

Method: Introduces AgentOps, a six-stage automation pipeline (observation, metric collection, issue detection, root cause analysis, recommendations, runtime automation) tailored for agentic AI systems.

Result: AgentOps provides a structured approach to manage uncertainty, enabling safe and adaptive operation of LLM-powered systems.

Conclusion: AgentOps leverages automation to tame uncertainty, ensuring effective and self-improving AI systems without eliminating their adaptive nature.

Abstract: Large Language Models (LLMs) are increasingly deployed within agentic systems-collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper introduces AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles-developers, testers, site reliability engineers (SREs), and business users-each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems-not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation.

[212] Enhancing the Capabilities of Large Language Models for API calls through Knowledge Graphs

Ye Yang, Xue Xiao, Ping Yin, Taotao Xie

Main category: cs.AI

TL;DR: KG2data integrates knowledge graphs, LLMs, and tools to improve API call accuracy and domain-specific query handling in meteorology, outperforming RAG2data and chat2data.

Details

Motivation: LLMs' tool-use capabilities in knowledge-intensive domains like meteorology are underexplored, limiting their effectiveness for complex queries.

Method: KG2data combines knowledge graphs, LLMs, ReAct agents, and tool-use technologies, evaluated via virtual API for accuracy in name recognition, hallucination, and call correctness.

Result: KG2data outperforms RAG2data and chat2data with metrics of 1.43%, 0%, and 88.57% respectively.

Conclusion: KG2data enhances domain-specific reasoning and data integration, offering a scalable solution for knowledge-intensive tasks without costly LLM fine-tuning.

Abstract: API calls by large language models (LLMs) offer a cutting-edge approach for data analysis. However, their ability to effectively utilize tools via API calls remains underexplored in knowledge-intensive domains like meteorology. This paper introduces KG2data, a system that integrates knowledge graphs, LLMs, ReAct agents, and tool-use technologies to enable intelligent data acquisition and query handling in the meteorological field. Using a virtual API, we evaluate API call accuracy across three metrics: name recognition failure, hallucination failure, and call correctness. KG2data achieves superior performance (1.43%, 0%, 88.57%) compared to RAG2data (16%, 10%, 72.14%) and chat2data (7.14%, 8.57%, 71.43%). KG2data differs from typical LLM-based systems by addressing their limited access to domain-specific knowledge, which hampers performance on complex or terminology-rich queries. By using a knowledge graph as persistent memory, our system enhances content retrieval, complex query handling, domain-specific reasoning, semantic relationship resolution, and heterogeneous data integration. It also mitigates the high cost of fine-tuning LLMs, making the system more adaptable to evolving domain knowledge and API structures. In summary, KG2data provides a novel solution for intelligent, knowledge-based question answering and data analysis in domains with high knowledge demands.

[213] Parsing Musical Structure to Enable Meaningful Variations

Maziar Kanani, Sean O Leary, James McDermott

Main category: cs.AI

TL;DR: A rule-based method generates music by mutating grammars derived from tunes, analyzing changes via edit distance, complexity, and musicality.

Details

Motivation: To explore how tunes evolve through systematic grammar mutations and assess the impact of each mutation type.

Method: Parse tunes into grammars using Sequitur, apply random mutations (19 types), expand grammars into new tunes, and analyze changes.

Result: Tunes gradually change with mutations; edit distance, complexity, and length metrics quantify transformations. Mutation effects vary by type.

Conclusion: Grammar-based mutation effectively generates related tunes, with measurable changes and insights into mutation impacts.

Abstract: This paper presents a novel rule-based approach for generating music by varying existing tunes. We parse each tune to find the Pathway Assembly (PA) [ 1], that is a structure representing all repetitions in the tune. The Sequitur algorithm [2 ] is used for this. The result is a grammar. We then carry out mutation on the grammar, rather than on a tune directly. There are potentially 19 types of mutations such as adding, removing, swapping or reversing parts of the grammar that can be applied to the grammars. The system employs one of the mutations randomly in this step to automatically manipulate the grammar. Following the mutation, we need to expand the grammar which returns a new tune. The output after 1 or more mutations will be a new tune related to the original tune. Our study examines how tunes change gradually over the course of multiple mutations. Edit distances, structural complexity and length of the tunes are used to show how a tune is changed after multiple mutations. In addition, the size of effect of each mutation type is analyzed. As a final point, we review the musical aspect of the output tunes. It should be noted that the study only focused on generating new pitch sequences. The study is based on an Irish traditional tune dataset and a list of integers has been used to represent each tune’s pitch values.

[214] AI and the Net-Zero Journey: Energy Demand, Emissions, and the Potential for Transition

Pandu Devarakota, Nicolas Tsesmetzis, Faruk O. Alpak, Apurva Gala, Detlef Hohl

Main category: cs.AI

TL;DR: The paper examines AI’s energy consumption in data centers and its impact on GHG emissions, predicting short-term increases but long-term potential for CO2 reduction.

Details

Motivation: To assess whether AI will have a net positive, neutral, or negative impact on CO2 emissions by 2035, considering its growing role in various sectors.

Method: Analysis of energy consumption scenarios in data centers, including near-term (up to 2030) and long-term (2035+) projections, and evaluation of AI’s role in optimizing energy workflows.

Result: Short-term: AI’s demand strains resources, increasing electricity use and CO2 emissions. Long-term: AI’s optimization potential could reduce emissions, outweighing initial impacts.

Conclusion: AI may initially increase emissions but holds promise for significant long-term CO2 reduction and climate mitigation.

Abstract: Thanks to the availability of massive amounts of data, computing resources, and advanced algorithms, AI has entered nearly every sector. This has sparked significant investment and interest, particularly in building data centers with the necessary hardware and software to develop and operate AI models and AI-based workflows. In this technical review article, we present energy consumption scenarios of data centers and impact on GHG emissions, considering both near-term projections (up to 2030) and long-term outlook (2035 and beyond). We address the quintessential question of whether AI will have a net positive, neutral, or negative impact on CO2 emissions by 2035. Additionally, we discuss AI’s potential to automate, create efficient and disruptive workflows across various fields related to energy production, supply and consumption. In the near-term scenario, the growing demand for AI will likely strain computing resources, lead to increase in electricity consumption and therefore associated CO2 emissions. This is due to the power-hungry nature of big data centers and the requirements for training and running of large and complex AI models, as well as the penetration of AI assistant search and applications for public use. However, the long-term outlook could be more promising. AI has the potential to be a game-changer in CO2 reduction. Its ability to further automate and optimize processes across industries, from energy production to logistics, could significantly decrease our carbon footprint. This positive impact is anticipated to outweigh the initial emissions bump, creating value for businesses and society in areas where traditional solutions have fallen short. In essence, AI might cause some initial growing pains for the environment, but it has the potential to support climate mitigation efforts.

[215] IoT Malware Network Traffic Detection using Deep Learning and GraphSAGE Models

Nikesh Prajapati, Bimal Karki, Saroj Gopali, Akbar Siami Namin

Main category: cs.AI

TL;DR: The paper evaluates deep learning and graph-based models for detecting IoT malicious attacks, with BERT achieving the highest performance (99.94% accuracy).

Details

Motivation: To address the challenge of detecting malicious network traffic in IoT systems, which exhibit sequential and diverse traffic patterns.

Method: Utilized GraphSAGE, BERT, TCN, Multi-Head Attention, BI-LSTM, and LSTM models to analyze temporal patterns and feature significance.

Result: BERT outperformed others with 99.94% accuracy, high precision, recall, F1-score, and AUC-ROC. Multi-Head Attention provided interpretable results but required longer processing time. GraphSAGE had the shortest training time but lower accuracy.

Conclusion: BERT is highly effective for IoT malicious attack detection due to its temporal dependency capture, while Multi-Head Attention offers interpretability, and GraphSAGE is efficient but less accurate.

Abstract: This paper intends to detect IoT malicious attacks through deep learning models and demonstrates a comprehensive evaluation of the deep learning and graph-based models regarding malicious network traffic detection. The models particularly are based on GraphSAGE, Bidirectional encoder representations from transformers (BERT), Temporal Convolutional Network (TCN) as well as Multi-Head Attention, together with Bidirectional Long Short-Term Memory (BI-LSTM) Multi-Head Attention and BI-LSTM and LSTM models. The chosen models demonstrated great performance to model temporal patterns and detect feature significance. The observed performance are mainly due to the fact that IoT system traffic patterns are both sequential and diverse, leaving a rich set of temporal patterns for the models to learn. Experimental results showed that BERT maintained the best performance. It achieved 99.94% accuracy rate alongside high precision and recall, F1-score and AUC-ROC score of 99.99% which demonstrates its capabilities through temporal dependency capture. The Multi-Head Attention offered promising results by providing good detection capabilities with interpretable results. On the other side, the Multi-Head Attention model required significant processing time like BI-LSTM variants. The GraphSAGE model achieved good accuracy while requiring the shortest training time but yielded the lowest accuracy, precision, and F1 score compared to the other models

[216] Detecting AI Assistance in Abstract Complex Tasks

Tyler King, Nikolos Gurney, John H. Miller, Volkan Ustun

Main category: cs.AI

TL;DR: The paper proposes using neural networks to detect AI assistance in abstract tasks by preprocessing data into machine-learning-friendly formats, including image and time-series formulations, and evaluates their effectiveness across various architectures.

Details

Motivation: As AI becomes ubiquitous in complex tasks, detecting its assistance is crucial but challenging for humans, especially with abstract data. The study aims to leverage neural networks' classification capabilities for this purpose.

Method: The study constructs four neural network-friendly image formulations and a time-series formulation to encode user behavior. It benchmarks these across three deep learning architectures and a hybrid CNN-RNN model.

Result: The results show that preprocessing data into suitable formats enables effective classification of AI assistance, with the hybrid CNN-RNN model performing best by leveraging temporal and spatial data.

Conclusion: Appropriate preprocessing and hybrid architectures are key to detecting AI aid in abstract tasks, highlighting the importance of encoding temporal and spatial information.

Abstract: Detecting assistance from artificial intelligence is increasingly important as they become ubiquitous across complex tasks such as text generation, medical diagnosis, and autonomous driving. Aid detection is challenging for humans, especially when looking at abstract task data. Artificial neural networks excel at classification thanks to their ability to quickly learn from and process large amounts of data – assuming appropriate preprocessing. We posit detecting help from AI as a classification task for such models. Much of the research in this space examines the classification of complex but concrete data classes, such as images. Many AI assistance detection scenarios, however, result in data that is not machine learning-friendly. We demonstrate that common models can effectively classify such data when it is appropriately preprocessed. To do so, we construct four distinct neural network-friendly image formulations along with an additional time-series formulation that explicitly encodes the exploration/exploitation of users, which allows for generalizability to other abstract tasks. We benchmark the quality of each image formulation across three classical deep learning architectures, along with a parallel CNN-RNN architecture that leverages the additional time series to maximize testing performance, showcasing the importance of encoding temporal and spatial quantities for detecting AI aid in abstract tasks.

[217] Uncertainty-Informed Scheduling of Decision Points for Intelligent Mobile Health Interventions

Asim H. Gazi, Bhanu T. Gullapalli, Daiqi Gao, Benjamin M. Marlin, Vivek Shetty, Susan A. Murphy

Main category: cs.AI

TL;DR: SigmaScheduling dynamically schedules decision points for mHealth interventions based on behavior time uncertainty, improving timely interventions for habitual behaviors like toothbrushing.

Details

Motivation: Current fixed-interval scheduling for mHealth decision points is ineffective for individuals with irregular routines, often missing the target behavior window.

Method: Proposes SigmaScheduling, which adjusts decision points dynamically: earlier for uncertain behavior times and closer for predictable ones.

Result: In a 10-week trial with 68 participants, SigmaScheduling ensured decision points preceded brushing events in at least 70% of cases.

Conclusion: SigmaScheduling enhances precision in mHealth, especially for time-sensitive habitual behaviors, improving intervention effectiveness.

Abstract: Timely decision making is critical to the effectiveness of mobile health (mHealth) interventions. At predefined timepoints called “decision points,” intelligent mHealth systems such as just-in-time adaptive interventions (JITAIs) estimate an individual’s biobehavioral context from sensor or survey data and determine whether and how to intervene. For interventions targeting habitual behavior (e.g., oral hygiene), effectiveness often hinges on delivering support shortly before the target behavior is likely to occur. Current practice schedules decision points at a fixed interval (e.g., one hour) before user-provided behavior times, and the fixed interval is kept the same for all individuals. However, this one-size-fits-all approach performs poorly for individuals with irregular routines, often scheduling decision points after the target behavior has already occurred, rendering interventions ineffective. In this paper, we propose SigmaScheduling, a method to dynamically schedule decision points based on uncertainty in predicted behavior times. When behavior timing is more predictable, SigmaScheduling schedules decision points closer to the predicted behavior time; when timing is less certain, SigmaScheduling schedules decision points earlier, increasing the likelihood of timely intervention. We evaluated SigmaScheduling using real-world data from 68 participants in a 10-week trial of Oralytics, a JITAI designed to improve daily toothbrushing. SigmaScheduling increased the likelihood that decision points preceded brushing events in at least 70% of cases, preserving opportunities to intervene and impact behavior. Our results indicate that SigmaScheduling can advance precision mHealth, particularly for JITAIs targeting time-sensitive, habitual behaviors such as oral hygiene or dietary habits.

JaMor Hairston, Ritvik Ranjan, Sahithi Lakamana, Anthony Spadaro, Selen Bozkurt, Jeanmarie Perrone, Abeed Sarker

Main category: cs.AI

TL;DR: LLMs like GPT-4o can effectively replicate expert-driven thematic analysis of social media data using few-shot prompting, achieving high accuracy and F1-scores.

Details

Motivation: To assess if LLMs can overcome challenges in inductive thematic analysis, a task requiring deep expertise, by replicating expert-driven analysis of social media data.

Method: Evaluated five LLMs on Reddit datasets about xylazine using binary classification tasks with zero-, single-, and few-shot prompting. Performance was measured via accuracy, precision, recall, and F1-score.

Result: GPT-4o with two-shot prompting performed best (90.9% accuracy, F1-score: 0.71), closely matching expert classifications for high-prevalence themes.

Conclusion: Few-shot LLM-based approaches can automate thematic analyses, providing a scalable supplement for qualitative research.

Abstract: Background Large language models (LLMs) face challenges in inductive thematic analysis, a task requiring deep interpretive and domain-specific expertise. We evaluated the feasibility of using LLMs to replicate expert-driven thematic analysis of social media data. Methods Using two temporally non-intersecting Reddit datasets on xylazine (n=286 and n=686, for model optimization and validation, respectively) with twelve expert-derived themes, we evaluated five LLMs against expert coding. We modeled the task as a series of binary classifications, rather than a single, multi-label classification, employing zero-, single-, and few-shot prompting strategies and measuring performance via accuracy, precision, recall, and F1-score. Results On the validation set, GPT-4o with two-shot prompting performed best (accuracy: 90.9%; F1-score: 0.71). For high-prevalence themes, model-derived thematic distributions closely mirrored expert classifications (e.g., xylazine use: 13.6% vs. 17.8%; MOUD use: 16.5% vs. 17.8%). Conclusions Our findings suggest that few-shot LLM-based approaches can automate thematic analyses, offering a scalable supplement for qualitative research. Keywords: thematic analysis, large language models, natural language processing, qualitative analysis, social media, prompt engineering, public health

[219] AF-XRAY: Visual Explanation and Resolution of Ambiguity in Legal Argumentation Frameworks

Yilin Xia, Heng Zheng, Shawn Bowers, Bertram Ludäscher

Main category: cs.AI

TL;DR: AF-XRAY is an open-source toolkit for analyzing and visualizing argumentation frameworks in legal reasoning, addressing ambiguity and aiding non-experts.

Details

Motivation: Challenges in identifying ambiguity and explaining argument acceptance in legal reasoning for non-experts.

Method: AF-XRAY introduces layered visualizations, attack edge classification, overlay visualizations, and critical attack set identification.

Result: Transforms ambiguous scenarios into grounded solutions, aiding in pinpointing ambiguity causes and exploring resolutions.

Conclusion: AF-XRAY supports teleological legal reasoning by revealing how assumptions impact conclusions, demonstrated with real-world cases.

Abstract: Argumentation frameworks (AFs) provide formal approaches for legal reasoning, but identifying sources of ambiguity and explaining argument acceptance remains challenging for non-experts. We present AF-XRAY, an open-source toolkit for exploring, analyzing, and visualizing abstract AFs in legal reasoning. AF-XRAY introduces: (i) layered visualizations based on game-theoretic argument length revealing well-founded derivation structures; (ii) classification of attack edges by semantic roles (primary, secondary, blunders); (iii) overlay visualizations of alternative 2-valued solutions on ambiguous 3-valued grounded semantics; and (iv) identification of critical attack sets whose suspension resolves undecided arguments. Through systematic generation of critical attack sets, AF-XRAY transforms ambiguous scenarios into grounded solutions, enabling users to pinpoint specific causes of ambiguity and explore alternative resolutions. We use real-world legal cases (e.g., Wild Animals as modeled by Bench-Capon) to show that our tool supports teleological legal reasoning by revealing how different assumptions lead to different justified conclusions.

Zongtao He, Liuyi Wang, Lu Chen, Chengju Liu, Qijun Chen

Main category: cs.AI

TL;DR: NavComposer generates high-quality navigation instructions by decomposing and recomposing semantic entities, while NavInstrCritic evaluates them without expert annotations.

Details

Motivation: Addressing the scarcity of expert-provided navigation instructions and the poor quality of synthesized ones for large-scale research.

Method: NavComposer decomposes semantic entities (actions, scenes, objects) and recomposes them into instructions. NavInstrCritic evaluates instructions on contrastive matching, semantic consistency, and linguistic diversity.

Result: The framework produces high-quality instructions and offers a scalable, annotation-free evaluation system.

Conclusion: The method enables scalable and generalizable research in language-guided navigation by decoupling instruction generation and evaluation.

Abstract: Language-guided navigation is a cornerstone of embodied AI, enabling agents to interpret language instructions and navigate complex environments. However, expert-provided instructions are limited in quantity, while synthesized annotations often lack quality, making them insufficient for large-scale research. To address this, we propose NavComposer, a novel framework for automatically generating high-quality navigation instructions. NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions. Its modular architecture allows flexible integration of state-of-the-art techniques, while the explicit use of semantic entities enhances both the richness and accuracy of instructions. Moreover, it operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training. Complementing NavComposer, we introduce NavInstrCritic, a comprehensive annotation-free evaluation system that assesses navigation instructions on three dimensions: contrastive matching, semantic consistency, and linguistic diversity. NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations. By decoupling instruction generation and evaluation from specific navigation agents, our method enables more scalable and generalizable research. Extensive experiments provide direct and practical evidence for the effectiveness of our method.

[221] Trajectory Imputation in Multi-Agent Sports with Derivative-Accumulating Self-Ensemble

Han-Jun Choi, Hyunsung Kim, Minho Lee, Minchul Jeong, Chang-Jo Kim, Jinsung Yoon, Sang-Ki Ko

Main category: cs.AI

TL;DR: MIDAS is a framework for accurately imputing missing multi-agent trajectory data in sports, outperforming existing methods by jointly predicting positions, velocities, and accelerations using a Set Transformer and a learnable ensemble.

Details

Motivation: Existing imputation methods are inadequate for multi-agent sports data due to dynamic movements and evolving interactions.

Method: MIDAS uses a Set Transformer to predict positions, velocities, and accelerations, recursively accumulates predictions, and combines them with a learnable weighted ensemble.

Result: MIDAS outperforms baselines in accuracy and physical plausibility on three sports datasets.

Conclusion: MIDAS is effective for practical downstream tasks requiring complete tracking data, such as distance approximation and pass success probability.

Abstract: Multi-agent trajectory data collected from domains such as team sports often suffer from missing values due to various factors. While many imputation methods have been proposed for spatiotemporal data, they are not well-suited for multi-agent sports scenarios where player movements are highly dynamic and inter-agent interactions continuously evolve. To address these challenges, we propose MIDAS (Multi-agent Imputer with Derivative-Accumulating Self-ensemble), a framework that imputes multi-agent trajectories with high accuracy and physical plausibility. It jointly predicts positions, velocities, and accelerations through a Set Transformer-based neural network and generates alternative estimates by recursively accumulating predicted velocity and acceleration values. These predictions are then combined using a learnable weighted ensemble to produce final imputed trajectories. Experiments on three sports datasets demonstrate that MIDAS significantly outperforms existing baselines in both positional accuracy and physical plausibility. Lastly, we showcase use cases of MIDAS, such as approximating total distance and pass success probability, to highlight its applicability to practical downstream tasks that require complete tracking data.

[222] Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation

Yicong Wu, Ting Chen, Irit Hochberg, Zhoujian Sun, Ruth Edry, Zhengxing Huang, Mor Peleg

Main category: cs.AI

TL;DR: The study explores using an LLM-based multi-agent system (MAS) for therapy recommendations in multimorbidity patients, comparing it to single-agent approaches and real-world benchmarks. While single-agent GPs perform as well as MDTs, some models provide incomplete or conflicting advice.

Details

Motivation: Therapy recommendation for chronic patients with multimorbidity is challenging due to treatment conflicts, and existing systems lack scalability. The study aims to leverage LLM-based MAS to simulate MDT collaboration for safer recommendations.

Method: A single agent and MAS framework were designed to simulate MDT decision-making, enabling discussion among LLM agents. The systems were evaluated on therapy planning tasks using benchmark cases, comparing MAS with single-agent approaches and real-world benchmarks.

Result: Single-agent GPs performed as well as MDTs. The best models provided correct recommendations addressing clinical goals but were incomplete. Some models introduced unnecessary medications, causing conflicts.

Conclusion: LLM-based MAS shows promise for therapy recommendations in multimorbidity, but improvements are needed to avoid incomplete or conflicting advice.

Abstract: Therapy recommendation for chronic patients with multimorbidity is challenging due to risks of treatment conflicts. Existing decision support systems face scalability limitations. Inspired by the way in which general practitioners (GP) manage multimorbidity patients, occasionally convening multidisciplinary team (MDT) collaboration, this study investigated the feasibility and value of using a Large Language Model (LLM)-based multi-agent system (MAS) for safer therapy recommendations. We designed a single agent and a MAS framework simulating MDT decision-making by enabling discussion among LLM agents to resolve medical conflicts. The systems were evaluated on therapy planning tasks for multimorbidity patients using benchmark cases. We compared MAS performance with single-agent approaches and real-world benchmarks. An important contribution of our study is the definition of evaluation metrics that go beyond the technical precision and recall and allow the inspection of clinical goals met and medication burden of the proposed advices to a gold standard benchmark. Our results show that with current LLMs, a single agent GP performs as well as MDTs. The best-scoring models provide correct recommendations that address all clinical goals, yet the advices are incomplete. Some models also present unnecessary medications, resulting in unnecessary conflicts between medication and conditions or drug-drug interactions.

[223] Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization

Yuhao Wang, Keyan Ding, Kehua Feng, Zeyuan Wang, Ming Qin, Xiaotong Li, Qiang Zhang, Huajun Chen

Main category: cs.AI

TL;DR: A Knowledge-guided Preference Optimization (KPO) framework is proposed to mitigate risks of harmful protein generation by protein language models, ensuring safety without compromising functionality.

Details

Motivation: Protein language models pose biosafety risks by potentially generating harmful sequences, necessitating a solution to balance innovation with ethical concerns.

Method: KPO integrates a Protein Safety Knowledge Graph, uses graph pruning to identify safe sequences, and applies reinforcement learning to minimize harmful outputs.

Result: KPO successfully reduces hazardous sequence generation while preserving high functionality, proving its efficacy as a safety framework.

Conclusion: KPO offers a robust solution for safe protein sequence generation, addressing biosafety challenges in biotechnology applications.

Abstract: Protein language models have emerged as powerful tools for sequence generation, offering substantial advantages in functional optimization and denovo design. However, these models also present significant risks of generating harmful protein sequences, such as those that enhance viral transmissibility or evade immune responses. These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge-guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph. This framework utilizes an efficient graph pruning strategy to identify preferred sequences and employs reinforcement learning to minimize the risk of generating harmful proteins. Experimental results demonstrate that KPO effectively reduces the likelihood of producing hazardous sequences while maintaining high functionality, offering a robust safety assurance framework for applying generative models in biotechnology.

[224] Modeling Habitat Shifts: Integrating Convolutional Neural Networks and Tabular Data for Species Migration Prediction

Emir Durakovic, Min-Hong Shih

Main category: cs.AI

TL;DR: Combines CNNs and tabular data to model bird presence in shifting habitats using satellite imagery and environmental features, achieving 85% accuracy.

Details

Motivation: Addresses climate-induced habitat shifts by accurately predicting bird species presence.

Method: Uses CNNs for spatial landscape features (e.g., forestation) and tabular data for ecological/geographic features.

Result: Achieves 85% accuracy in predicting bird distribution.

Conclusion: Provides a scalable, reliable method to track bird migration amid climate change.

Abstract: Due to climate-induced changes, many habitats are experiencing range shifts away from their traditional geographic locations (Piguet, 2011). We propose a solution to accurately model whether bird species are present in a specific habitat through the combination of Convolutional Neural Networks (CNNs) (O’Shea, 2015) and tabular data. Our approach makes use of satellite imagery and environmental features (e.g., temperature, precipitation, elevation) to predict bird presence across various climates. The CNN model captures spatial characteristics of landscapes such as forestation, water bodies, and urbanization, whereas the tabular method uses ecological and geographic data. Both systems predict the distribution of birds with an average accuracy of 85%, offering a scalable but reliable method to understand bird migration.

[225] Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing

Yilmazcan Ozyurt, Tunaberk Almaci, Stefan Feuerriegel, Mrinmaya Sachan

Main category: cs.AI

TL;DR: ExRec is a framework for personalized exercise recommendation using semantically-grounded knowledge tracing, addressing gaps in existing methods by incorporating question semantics and structured learning progression.

Details

Motivation: Existing exercise recommendation methods often ignore the semantic content of questions and the structured progression of student learning.

Method: ExRec uses an end-to-end pipeline involving knowledge tracing (KT) annotation, semantic representation learning, KT model training, and reinforcement learning (RL) optimization, enhanced by a model-based value estimation (MVE) approach.

Result: ExRec shows effectiveness across four real-world math learning tasks, generalizes to unseen questions, and produces interpretable learning trajectories.

Conclusion: KT-guided RL holds promise for personalized education, as demonstrated by ExRec’s robust performance and interpretability.

Abstract: We introduce ExRec, a general framework for personalized exercise recommendation with semantically-grounded knowledge tracing. Our method builds on the observation that existing exercise recommendation approaches simulate student performance via knowledge tracing (KT) but they often overlook two key aspects: (a) the semantic content of questions and (b) the sequential, structured progression of student learning. To address this, our ExRec presents an end-to-end pipeline, from annotating the KCs of questions and learning their semantic representations to training KT models and optimizing several reinforcement learning (RL) methods. Moreover, we improve standard Q-learning-based continuous RL methods via a tailored model-based value estimation (MVE) approach that directly leverages the components of KT model in estimating cumulative knowledge improvement. We validate the effectiveness of our ExRec using various RL methods across four real-world tasks with different educational goals in online math learning. We further show that ExRec generalizes robustly to new, unseen questions and that it produces interpretable student learning trajectories. Together, our findings highlight the promise of KT-guided RL for effective personalization in education.

[226] Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander

Li Wang, Qizhen Wu, Lei Chen

Main category: cs.AI

TL;DR: A vision-language model-based commander is proposed for autonomous multi-agent tactical decisions, combining scene understanding and strategic reasoning for high adaptability and interpretability.

Details

Motivation: Traditional rule-based and reinforcement learning methods are inadequate for complex battlefield environments, lacking interpretability and strategic focus.

Method: Integrates a vision-language model for scene understanding and a lightweight large language model for strategic reasoning within a shared semantic space.

Result: Achieves over 80% win rate in simulations compared to baseline models.

Conclusion: The proposed method effectively bridges perception-to-decision reasoning, mimicking human commander cognition with strong performance.

Abstract: In multiple unmanned ground vehicle confrontations, autonomously evolving multi-agent tactical decisions from situational awareness remain a significant challenge. Traditional handcraft rule-based methods become vulnerable in the complicated and transient battlefield environment, and current reinforcement learning methods mainly focus on action manipulation instead of strategic decisions due to lack of interpretability. Here, we propose a vision-language model-based commander to address the issue of intelligent perception-to-decision reasoning in autonomous confrontations. Our method integrates a vision language model for scene understanding and a lightweight large language model for strategic reasoning, achieving unified perception and decision within a shared semantic space, with strong adaptability and interpretability. Unlike rule-based search and reinforcement learning methods, the combination of the two modules establishes a full-chain process, reflecting the cognitive process of human commanders. Simulation and ablation experiments validate that the proposed approach achieves a win rate of over 80% compared with baseline models.

[227] Function-to-Style Guidance of LLMs for Code Translation

Longhui Zhang, Bin Wang, Jiahao Wang, Xiaofeng Zhao, Min Zhang, Hao Yang, Meishan Zhang, Yu Li, Jing Li, Jun Yu, Min Zhang

Main category: cs.AI

TL;DR: F2STrans improves LLM-based code translation by combining functional and style learning, outperforming larger models like GPT-4.

Details

Motivation: Ensuring correctness and readability in code translation by LLMs remains a challenge, limiting real-world adoption.

Method: F2STrans uses functional learning (high-quality code pairs) and style learning (positive/negative examples) to guide LLMs.

Result: F2STrans enables smaller models (Qwen-1.5B) to outperform larger ones (Qwen-32B, GPT-4) in 20 code translation scenarios.

Conclusion: F2STrans significantly enhances code translation performance by addressing both correctness and readability.

Abstract: Large language models (LLMs) have made significant strides in code translation tasks. However, ensuring both the correctness and readability of translated code remains a challenge, limiting their effective adoption in real-world software development. In this work, we propose F2STrans, a function-to-style guiding paradigm designed to progressively improve the performance of LLMs in code translation. Our approach comprises two key stages: (1) Functional learning, which optimizes translation correctness using high-quality source-target code pairs mined from online programming platforms, and (2) Style learning, which improves translation readability by incorporating both positive and negative style examples. Additionally, we introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, enabling comprehensive functional and stylistic evaluations. Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 diverse code translation scenarios.

[228] AI Agent Architecture for Decentralized Trading of Alternative Assets

Ailiya Borjigin, Cong He, Charles CC Lee, Wei Zhou

Main category: cs.AI

TL;DR: GoldMine OS is an AI-driven decentralized system for tokenizing and trading physical gold on blockchain, achieving fast, secure, and compliant transactions.

Details

Motivation: To bridge physical asset custody with blockchain, ensuring compliance, liquidity, and risk management for decentralized trading of alternative assets like gold.

Method: Combines on-chain smart contracts for risk control with off-chain AI agents (Compliance, Token Issuance, Market Making, Risk Control) for automation and decision-making.

Result: Achieves on-demand token issuance in 1.2s, tight liquidity (spreads <0.5%), resilience to attacks, and scalability (5000 TPS, 10000 users).

Conclusion: AI agent-based decentralized exchanges can meet performance and safety needs, democratizing access to illiquid assets while ensuring transparency and adaptability.

Abstract: Decentralized trading of real-world alternative assets (e.g., gold) requires bridging physical asset custody with blockchain systems while meeting strict requirements for compliance, liquidity, and risk management. We present GoldMine OS, a research oriented architecture that employs multiple specialized AI agents to automate and secure the tokenization and exchange of physical gold into a blockchain based stablecoin (“OZ”). Our approach combines on chain smart contracts for critical risk controls with off chain AI agents for decision making, blending the transparency and reliability of blockchains with the flexibility of AI driven automation. We describe four cooperative agents (Compliance, Token Issuance, Market Making, and Risk Control) and a coordinating core, and evaluate the system through simulation and a controlled pilot deployment. In experiments the prototype delivers on demand token issuance in under 1.2 s, more than 100 times faster than manual workflows. The Market Making agent maintains tight liquidity with spreads often below 0.5 percent even under volatile conditions. Fault injection tests show resilience: an oracle price spoofing attack is detected and mitigated within 10 s, and a simulated vault mis reporting halts issuance immediately with minimal user impact. The architecture scales to 5000 transactions per second with 10000 concurrent users in benchmarks. These results indicate that an AI agent based decentralized exchange for alternative assets can satisfy rigorous performance and safety requirements. We discuss broader implications for democratizing access to traditionally illiquid assets and explain how our governance model – multi signature agent updates and on chain community voting on risk parameters – provides ongoing transparency, adaptability, and formal assurance of system integrity.

[229] Defining neurosymbolic AI

Lennert De Smet, Luc De Raedt

Main category: cs.AI

TL;DR: The paper introduces a formal definition for neurosymbolic AI, unifying logical and neural representations through integral computation over logical and belief functions.

Details

Motivation: The field lacks a generally accepted formal definition of neurosymbolic AI, despite numerous systems existing.

Method: The authors propose a formal definition where neurosymbolic inference involves computing an integral over a product of logical and belief functions.

Result: The definition abstracts key ingredients of neurosymbolic AI and aligns with representative systems.

Conclusion: The paper provides a foundational formalization for neurosymbolic AI, bridging learning and reasoning.

Abstract: Neurosymbolic AI focuses on integrating learning and reasoning, in particular, on unifying logical and neural representations. Despite the existence of an alphabet soup of neurosymbolic AI systems, the field is lacking a generally accepted formal definition of what neurosymbolic models and inference really are. We introduce a formal definition for neurosymbolic AI that makes abstraction of its key ingredients. More specifically, we define neurosymbolic inference as the computation of an integral over a product of a logical and a belief function. We show that our neurosymbolic AI definition makes abstraction of key representative neurosymbolic AI systems.

[230] Collaborative Trustworthiness for Good Decision Making in Autonomous Systems

Selma Saidi, Omar Laimona, Christoph Schmickler, Dirk Ziegenbein

Main category: cs.AI

TL;DR: A collaborative approach for trustworthy decision-making in autonomous systems, using quality attributes and BDDs for efficient aggregation and reasoning.

Details

Motivation: Ensuring safe and correct behavior of autonomous systems in dynamic environments is challenging, requiring trustworthy decision-making despite conflicting information.

Method: Proposes a collaborative approach using quality attributes (e.g., perception quality) and BDDs for aggregation and propagation rules, with reduction rules for efficiency.

Result: The method improves reliability and decision-making by leveraging trustworthy systems and formal models for automated reasoning.

Conclusion: The approach enhances trustworthiness and efficiency in autonomous systems’ decision-making through collaborative data sharing and formal reasoning.

Abstract: Autonomous systems are becoming an integral part of many application domains, like in the mobility sector. However, ensuring their safe and correct behaviour in dynamic and complex environments remains a significant challenge, where systems should autonomously make decisions e.g., about manoeuvring. We propose in this paper a general collaborative approach for increasing the level of trustworthiness in the environment of operation and improve reliability and good decision making in autonomous system. In the presence of conflicting information, aggregation becomes a major issue for trustworthy decision making based on collaborative data sharing. Unlike classical approaches in the literature that rely on consensus or majority as aggregation rule, we exploit the fact that autonomous systems have different quality attributes like perception quality. We use this criteria to determine which autonomous systems are trustworthy and borrow concepts from social epistemology to define aggregation and propagation rules, used for automated decision making. We use Binary Decision Diagrams (BDDs) as formal models for beliefs aggregation and propagation, and formulate reduction rules to reduce the size of the BDDs and allow efficient computation structures for collaborative automated reasoning.

[231] Fine-grained Timing Analysis of Digital Integrated Circuits in Answer Set Programming

Alessandro Bertagnon, Marcello Dalpasso, Michele Favalli, Marco Gavanelli

Main category: cs.AI

TL;DR: The paper addresses the challenge of computing the actual maximum delay in integrated circuits, using Answer Set Programming (ASP) for accurate results.

Details

Motivation: Static Timing Analysis provides an upper bound for delay, leading to suboptimal processor speeds. The goal is to compute the actual maximum delay for better performance.

Method: The problem is modeled in Answer Set Programming (ASP) with non-trivial encodings, leveraging ASP’s efficient solvers.

Result: Experimental results demonstrate ASP’s viability for solving complex hardware design problems.

Conclusion: ASP is a promising tool for accurately determining maximum delays in integrated circuits, improving performance over traditional methods.

Abstract: In the design of integrated circuits, one critical metric is the maximum delay introduced by combinational modules within the circuit. This delay is crucial because it represents the time required to perform a computation: in an Arithmetic-Logic Unit it represents the maximum time taken by the circuit to perform an arithmetic operation. When such a circuit is part of a larger, synchronous system, like a CPU, the maximum delay directly impacts the maximum clock frequency of the entire system. Typically, hardware designers use Static Timing Analysis to compute an upper bound of the maximum delay because it can be determined in polynomial time. However, relying on this upper bound can lead to suboptimal processor speeds, thereby missing performance opportunities. In this work, we tackle the challenging task of computing the actual maximum delay, rather than an approximate value. Since the problem is computationally hard, we model it in Answer Set Programming (ASP), a logic language featuring extremely efficient solvers. We propose non-trivial encodings of the problem into ASP. Experimental results show that ASP is a viable solution to address complex problems in hardware design.

[232] DuetGraph: Coarse-to-Fine Knowledge Graph Reasoning with Dual-Pathway Global-Local Fusion

Jin Li, Zezhong Ding, Xike Xie

Main category: cs.AI

TL;DR: DuetGraph is a new KG reasoning method that separates global and local information processing to avoid over-smoothing, improving reasoning quality and efficiency.

Details

Motivation: Existing KG reasoning methods suffer from score over-smoothing, which reduces effectiveness by blurring distinctions between correct and incorrect answers.

Method: DuetGraph uses a dual-pathway global-local fusion, segregating local (message passing) and global (attention) processing. It also employs coarse-to-fine optimization to partition entities and sharpen score gaps.

Result: DuetGraph achieves SOTA performance with up to 8.7% better reasoning quality and 1.8× faster training.

Conclusion: DuetGraph effectively addresses over-smoothing, enhancing KG reasoning performance and efficiency.

Abstract: Knowledge graphs (KGs) are vital for enabling knowledge reasoning across various domains. Recent KG reasoning methods that integrate both global and local information have achieved promising results. However, existing methods often suffer from score over-smoothing, which blurs the distinction between correct and incorrect answers and hinders reasoning effectiveness. To address this, we propose DuetGraph, a coarse-to-fine KG reasoning mechanism with dual-pathway global-local fusion. DuetGraph tackles over-smoothing by segregating – rather than stacking – the processing of local (via message passing) and global (via attention) information into two distinct pathways, preventing mutual interference and preserving representational discrimination. In addition, DuetGraph introduces a coarse-to-fine optimization, which partitions entities into high- and low-score subsets. This strategy narrows the candidate space and sharpens the score gap between the two subsets, which alleviates over-smoothing and enhances inference quality. Extensive experiments on various datasets demonstrate that DuetGraph achieves state-of-the-art (SOTA) performance, with up to an 8.7% improvement in reasoning quality and a 1.8$\times$ acceleration in training efficiency.

[233] Opus: A Prompt Intention Framework for Complex Workflow Generation

Théo Fagnoni, Mahsun Altin, Chia En Chung, Phillip Kingston, Alan Tuning, Dana O. Mohamed, Inès Adnani

Main category: cs.AI

TL;DR: The paper introduces the Opus Prompt Intention Framework to enhance workflow generation with LLMs by adding an intermediate intention capture layer, improving output quality and scalability.

Details

Motivation: To address the challenge of generating logical and meaningful workflows from complex user queries using LLMs.

Method: Proposes an intermediate layer (Opus Workflow Intention Framework) that extracts workflow signals, interprets them into structured intentions, and generates workflows based on these intentions.

Result: Shows consistent improvements in semantic workflow similarity metrics on a benchmark of 1,000 multi-intent query-workflow pairs.

Conclusion: The framework significantly enhances workflow generation quality, especially for mixed intention elicitation, compared to direct generation from queries.

Abstract: This paper introduces the Opus Prompt Intention Framework, designed to improve complex Workflow Generation with instruction-tuned Large Language Models (LLMs). We propose an intermediate Intention Capture layer between user queries and Workflow Generation, implementing the Opus Workflow Intention Framework, which consists of extracting Workflow Signals from user queries, interpreting them into structured Workflow Intention objects, and generating Workflows based on these Intentions. Our results show that this layer enables LLMs to produce logical and meaningful outputs that scale reliably as query complexity increases. On a synthetic benchmark of 1,000 multi-intent query-Workflow(s) pairs, applying the Opus Prompt Intention Framework to Workflow Generation yields consistent improvements in semantic Workflow similarity metrics. In this paper, we introduce the Opus Prompt Intention Framework by applying the concepts of Workflow Signal and Workflow Intention to LLM-driven Workflow Generation. We present a reproducible, customizable LLM-based Intention Capture system to extract Workflow Signals and Workflow Intentions from user queries. Finally, we provide empirical evidence that the proposed system significantly improves Workflow Generation quality compared to direct generation from user queries, particularly in cases of Mixed Intention Elicitation.

[234] Contestability in Quantitative Argumentation

Xiang Yin, Nico Potyka, Antonio Rago, Timotheus Kampik, Francesca Toni

Main category: cs.AI

TL;DR: The paper explores using Edge-Weighted Quantitative Bipolar Argumentation Frameworks (EW-QBAFs) to align AI decisions with human preferences, introducing a contestability problem and solving it with gradient-based explanations and an iterative algorithm.

Details

Motivation: To ensure AI-driven decisions are contestable and align with human preferences, focusing on EW-QBAFs, which have been understudied for this purpose.

Method: Proposes gradient-based relation attribution explanations (G-RAEs) and an iterative algorithm to adjust edge weights in EW-QBAFs for achieving desired argument strengths.

Result: Experimental evaluation on synthetic EW-QBAFs simulating recommender systems and multi-layer perceptrons shows the method effectively solves the contestability problem.

Conclusion: The approach successfully enables contestability in AI decisions by leveraging EW-QBAFs and gradient-based adjustments, validated through experiments.

Abstract: Contestable AI requires that AI-driven decisions align with human preferences. While various forms of argumentation have been shown to support contestability, Edge-Weighted Quantitative Bipolar Argumentation Frameworks (EW-QBAFs) have received little attention. In this work, we show how EW-QBAFs can be deployed for this purpose. Specifically, we introduce the contestability problem for EW-QBAFs, which asks how to modify edge weights (e.g., preferences) to achieve a desired strength for a specific argument of interest (i.e., a topic argument). To address this problem, we propose gradient-based relation attribution explanations (G-RAEs), which quantify the sensitivity of the topic argument’s strength to changes in individual edge weights, thus providing interpretable guidance for weight adjustments towards contestability. Building on G-RAEs, we develop an iterative algorithm that progressively adjusts the edge weights to attain the desired strength. We evaluate our approach experimentally on synthetic EW-QBAFs that simulate the structural characteristics of personalised recommender systems and multi-layer perceptrons, and demonstrate that it can solve the problem effectively.

Yuehao Huang, Liang Liu, Shuangming Lei, Yukai Ma, Hao Su, Jianbiao Mei, Pengxiang Zhao, Yaqing Gu, Yong Liu, Jiajun Lv

Main category: cs.AI

TL;DR: CogDDN is a VLM-based framework for demand-driven navigation that integrates fast and slow thinking systems, improving navigation accuracy by 15% over traditional methods.

Details

Motivation: To address the limitations of traditional data-driven DDN methods, which rely on pre-collected data and struggle in unseen scenarios.

Method: CogDDN uses semantic alignment of detected objects with instructions and a dual-process decision-making module (Heuristic and Analytic Processes) enhanced by Chain of Thought reasoning.

Result: Outperforms single-view camera-only methods by 15% in navigation accuracy and adaptability on the AI2Thor simulator.

Conclusion: CogDDN demonstrates significant improvements in generalization and performance for demand-driven navigation in unstructured environments.

Abstract: Mobile robots are increasingly required to navigate and interact within unknown and unstructured environments to meet human demands. Demand-driven navigation (DDN) enables robots to identify and locate objects based on implicit human intent, even when object locations are unknown. However, traditional data-driven DDN methods rely on pre-collected data for model training and decision-making, limiting their generalization capability in unseen scenarios. In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. CogDDN identifies appropriate target objects by semantically aligning detected objects with the given instructions. Furthermore, it incorporates a dual-process decision-making module, comprising a Heuristic Process for rapid, efficient decisions and an Analytic Process that analyzes past errors, accumulates them in a knowledge base, and continuously improves performance. Chain of Thought (CoT) reasoning strengthens the decision-making process. Extensive closed-loop evaluations on the AI2Thor simulator with the ProcThor dataset show that CogDDN outperforms single-view camera-only methods by 15%, demonstrating significant improvements in navigation accuracy and adaptability. The project page is available at https://yuehaohuang.github.io/CogDDN/.

[236] Foundation Models for Logistics: Toward Certifiable, Conversational Planning Interfaces

Yunhao Yang, Neel P. Bhatt, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

Main category: cs.AI

TL;DR: A neurosymbolic framework combines natural-language dialogue with verifiable guarantees for logistics planning, outperforming GPT-4.1 in zero-shot performance with lower latency.

Details

Motivation: Logistics decisions require rapid replanning and domain expertise, but existing methods like integer programming are slow, and LLMs risk misinterpretations.

Method: The framework converts user requests into structured plans, quantifies uncertainty, and uses an interactive clarification loop for low-confidence cases.

Result: A lightweight model fine-tuned on 100 examples surpasses GPT-4.1 in zero-shot performance and reduces inference latency by nearly 50%.

Conclusion: The approach offers a practical path for certifiable, real-time, and user-aligned logistics decision-making.

Abstract: Logistics operators, from battlefield coordinators rerouting airlifts ahead of a storm to warehouse managers juggling late trucks, often face life-critical decisions that demand both domain expertise and rapid and continuous replanning. While popular methods like integer programming yield logistics plans that satisfy user-defined logical constraints, they are slow and assume an idealized mathematical model of the environment that does not account for uncertainty. On the other hand, large language models (LLMs) can handle uncertainty and promise to accelerate replanning while lowering the barrier to entry by translating free-form utterances into executable plans, yet they remain prone to misinterpretations and hallucinations that jeopardize safety and cost. We introduce a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on goal interpretation. It converts user requests into structured planning specifications, quantifies its own uncertainty at the field and token level, and invokes an interactive clarification loop whenever confidence falls below an adaptive threshold. A lightweight model, fine-tuned on just 100 uncertainty-filtered examples, surpasses the zero-shot performance of GPT-4.1 while cutting inference latency by nearly 50%. These preliminary results highlight a practical path toward certifiable, real-time, and user-aligned decision-making for complex logistics.

[237] Modeling Code: Is Text All You Need?

Daniel Nichols, Konstantinos Parasyris, Harshitha Menon, Brian R. Bartoldson, Giorgis Georgakoudis, Tal Ben-Nun, Abhinav Bhatele

Main category: cs.AI

TL;DR: A novel approach combines code-as-text modeling with structured forms to enhance reasoning in Code LLMs.

Details

Motivation: Transformer-based models struggle with structured, analytical properties of code like control and data flow, despite their popularity in tasks like generation and translation.

Method: The work introduces a method to integrate structured data modeling (e.g., graph neural networks) with the generative capabilities of modern LLMs.

Result: The approach aims to improve reasoning over structured code properties while retaining the scale and generative power of LLMs.

Conclusion: Combining text and structured modeling can address limitations of current Code LLMs in reasoning tasks.

Abstract: Code LLMs have become extremely popular recently for modeling source code across a variety of tasks, such as generation, translation, and summarization. However, transformer-based models are limited in their capabilities to reason through structured, analytical properties of code, such as control and data flow. Previous work has explored the modeling of these properties with structured data and graph neural networks. However, these approaches lack the generative capabilities and scale of modern LLMs. In this work, we introduce a novel approach to combine the strengths of modeling both code as text and more structured forms.

[238] Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik

Main category: cs.AI

TL;DR: AI systems using human language for thought chains (CoT) can be monitored for misbehavior, though imperfectly. Further research and investment in CoT monitoring alongside existing safety methods is recommended.

Details

Motivation: To leverage human-like language in AI systems for better oversight and safety by monitoring their chains of thought (CoT) for potential misbehavior.

Method: Proposes monitoring AI systems’ CoT in human language to detect misbehavior, acknowledging its imperfection but highlighting its potential.

Result: CoT monitoring shows promise for AI safety, though it is not foolproof.

Conclusion: Recommends further research into CoT monitorability and urges developers to consider its impact in AI development decisions.

Abstract: AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

[239] Perspective-Aware AI in Extended Reality

Daniel Platnick, Matti Gruener, Marjan Alirezaie, Kent Larson, Dava J. Newman, Hossein Rahnama

Main category: cs.AI

TL;DR: PAiR integrates Perspective-Aware AI with XR to create adaptive, immersive experiences using user identity models called Chronicles.

Details

Motivation: Current XR systems lack deep user modeling and cognitive context, limiting adaptive experiences.

Method: PAiR uses Chronicles—multimodal identity models—in a closed-loop system to link user states with XR environments.

Result: Implemented in Unity-based OpenDome, PAiR demonstrates utility in two proof-of-concept scenarios.

Conclusion: PAiR advances human-AI interaction by embedding perspective-based identity models into immersive systems.

Abstract: AI-enhanced Extended Reality (XR) aims to deliver adaptive, immersive experiences-yet current systems fall short due to shallow user modeling and limited cognitive context. We introduce Perspective-Aware AI in Extended Reality (PAiR), a foundational framework for integrating Perspective-Aware AI (PAi) with XR to enable interpretable, context-aware experiences grounded in user identity. PAi is built on Chronicles: reasoning-ready identity models learned from multimodal digital footprints that capture users’ cognitive and experiential evolution. PAiR employs these models in a closed-loop system linking dynamic user states with immersive environments. We present PAiR’s architecture, detailing its modules and system flow, and demonstrate its utility through two proof-of-concept scenarios implemented in the Unity-based OpenDome engine. PAiR opens a new direction for human-AI interaction by embedding perspective-based identity models into immersive systems.

[240] Illuminating the Three Dogmas of Reinforcement Learning under Evolutionary Light

Mani Hamidi, Terrence W. Deacon

Main category: cs.AI

TL;DR: The paper critiques three core tenets of RL (agency, learning objectives, and reward hypothesis) using evolutionary theory, proposing a framework to rethink them and suggesting integration with origins-of-life theory for agency.

Details

Motivation: To address conceptual gaps in RL by leveraging evolutionary insights, making it more applicable to biological learning.

Method: Revisits RL assumptions through an evolutionary lens, arguing for brain-based evolutionary dynamics and integrating thermodynamics from origins-of-life theory.

Result: A framework to rethink RL dogmas, with evolutionary analogies for learning and reward, but agency requires additional thermodynamic foundations.

Conclusion: Evolutionary theory enriches RL but agency needs origins-of-life integration for a formal account.

Abstract: Three core tenets of reinforcement learning (RL)–concerning the definition of agency, the objective of learning, and the scope of the reward hypothesis–have been highlighted as key targets for conceptual revision, with major implications for theory and application. We propose a framework, inspired by open-ended evolutionary theory, to reconsider these three “dogmas.” We revisit each assumption and address related concerns raised alongside them. To make our arguments relevant to RL as a model of biological learning, we first establish that evolutionary dynamics can plausibly operate within living brains over an individual’s lifetime, and are not confined to cross-generational processes. We begin by revisiting the second dogma, drawing on evolutionary insights to enrich the “adaptation-rather-than-search” view of learning. We then address the third dogma regarding the limits of the reward hypothesis, using analogies from evolutionary fitness to illuminate the scalar reward vs. multi-objective debate. After discussing practical implications for exploration in RL, we turn to the first–and arguably most fundamental–issue: the absence of a formal account of agency. We argue that unlike the other two problems, the evolutionary paradigm alone cannot resolve the agency question, though it gestures in a productive direction. We advocate integrating ideas from origins-of-life theory, where the thermodynamics of sustenance and replication offer promising foundations for understanding agency and resource-constrained reinforcement learning in biological systems.

[241] DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering

Yinsheng Li, Zhen Dong, Yi Shao

Main category: cs.AI

TL;DR: DrafterBench is a benchmark for evaluating LLM agents in technical drawing revision tasks in civil engineering, featuring 12 task types, 46 tools, and 1920 tasks.

Details

Motivation: The need for systematic evaluation of LLM agents in industrial tasks, particularly civil engineering, due to their potential for automation.

Method: Creation of DrafterBench with real-world tasks, tools, and metrics to assess LLM agents’ capabilities in structured data comprehension, function execution, and reasoning.

Result: DrafterBench provides detailed accuracy and error analysis, offering insights into agent performance and areas for improvement.

Conclusion: DrafterBench is a valuable open-source tool for rigorously testing and improving LLM agents in engineering applications.

Abstract: Large Language Model (LLM) agents have shown great potential for solving real-world problems and promise to be a solution for tasks automation in industry. However, more benchmarks are needed to systematically evaluate automation agents from an industrial perspective, for example, in Civil Engineering. Therefore, we propose DrafterBench for the comprehensive evaluation of LLM agents in the context of technical drawing revision, a representation task in civil engineering. DrafterBench contains twelve types of tasks summarized from real-world drawing files, with 46 customized functions/tools and 1920 tasks in total. DrafterBench is an open-source benchmark to rigorously test AI agents’ proficiency in interpreting intricate and long-context instructions, leveraging prior knowledge, and adapting to dynamic instruction quality via implicit policy awareness. The toolkit comprehensively assesses distinct capabilities in structured data comprehension, function execution, instruction following, and critical reasoning. DrafterBench offers detailed analysis of task accuracy and error statistics, aiming to provide deeper insight into agent capabilities and identify improvement targets for integrating LLMs in engineering applications. Our benchmark is available at https://github.com/Eason-Li-AIS/DrafterBench, with the test set hosted at https://huggingface.co/datasets/Eason666/DrafterBench.

[242] How Many Instructions Can LLMs Follow at Once?

Daniel Jaroslawicz, Brendan Whiting, Parth Shah, Karime Maamari

Main category: cs.AI

TL;DR: IFScale is a benchmark for evaluating LLMs’ ability to follow high-density instructions, revealing performance degradation patterns and biases.

Details

Motivation: Existing benchmarks lack evaluation of LLMs at high instruction densities, limiting understanding of their real-world applicability.

Method: IFScale uses 500 keyword-inclusion instructions for a business report task to measure performance degradation with increasing instruction density.

Result: Best models achieve only 68% accuracy at 500 instructions, with performance linked to model size and reasoning capability.

Conclusion: The study highlights tradeoffs in instruction-dense prompts and provides insights for real-world LLM applications.

Abstract: Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities have not yet been characterized, as existing benchmarks only evaluate models on tasks with a single or few instructions. We introduce IFScale, a simple benchmark of 500 keyword-inclusion instructions for a business report writing task to measure how instruction-following performance degrades as instruction density increases. We evaluate 20 state-of-the-art models across seven major providers and find that even the best frontier models only achieve 68% accuracy at the max density of 500 instructions. Our analysis reveals model size and reasoning capability to correlate with 3 distinct performance degradation patterns, bias towards earlier instructions, and distinct categories of instruction-following errors. Our insights can help inform design of instruction-dense prompts in real-world applications and highlight important performance-latency tradeoffs. We open-source the benchmark and all results for further analysis at https://distylai.github.io/IFScale.

[243] A dancing bear, a colleague, or a sharpened toolbox? The cautious adoption of generative AI technologies in digital humanities research

Rongqian Ma, Meredith Dedema, Andrew Cox

Main category: cs.AI

TL;DR: The paper explores how Digital Humanities (DH) scholars adopt and evaluate generative AI (GenAI) technologies, revealing divided opinions on its benefits and risks.

Details

Motivation: To understand the implications of GenAI on DH research, given its technological nature.

Method: Survey (76 responses) and interviews (15 DH scholars) to analyze adoption rationale, practices, and perceptions.

Result: DH scholars see GenAI as enhancing efficiency but worry about intellectual identity disruption. Adoption is contested and negotiated.

Conclusion: GenAI is reshaping DH, but its impact is debated, highlighting the need for further research.

Abstract: The advent of generative artificial intelligence (GenAI) technologies has been changing the research landscape and potentially has significant implications for Digital Humanities (DH), a field inherently intertwined with technologies. This article investigates how DH scholars adopt and critically evaluate GenAI technologies for research. Drawing on 76 responses collected from an international survey study and 15 semi-structured interviews with DH scholars, we explored the rationale for adopting GenAI tools in research, identified the specific practices of using GenAI tools, and analyzed scholars’ collective perceptions regarding the benefits, risks, and challenges. The results reveal that DH research communities hold divided opinions and differing imaginations towards the role of GenAI in DH scholarship. While scholars acknowledge the benefits of GenAI in enhancing research efficiency and enabling reskilling, many remain concerned about its potential to disrupt their intellectual identities. Situated within the history of DH and viewed through the lens of Actor-Network Theory, our findings suggest that the adoption of GenAI is gradually changing the field, though this transformation remains contested, shaped by ongoing negotiations among multiple human and non-human actors. Our study is one of the first empirical analyses on this topic and has the potential to serve as a building block for future inquiries into the impact of GenAI on DH scholarship.

[244] Possible Principles for Aligned Structure Learning Agents

Lancelot Da Costa, Tomáš Gavenčiak, David Hyland, Mandana Samiei, Cristian Dragos-Manta, Candice Pattisapu, Adeel Razi, Karl Friston

Main category: cs.AI

TL;DR: The paper proposes a roadmap for scalable aligned AI by focusing on structure learning and alignment, integrating principles from mathematics, statistics, and cognitive science.

Details

Motivation: To develop AI that aligns with human preferences by learning accurate world models and other agents' perspectives.

Method: Emphasizes structure learning (causal representation learning) and core knowledge principles, with examples like Asimov’s Laws of Robotics.

Result: Outlines principles for scalable aligned AI and suggests refined alignment approaches.

Conclusion: The framework may guide the development of aligned AI systems through structure learning and theory of mind.

Abstract: This paper offers a roadmap for the development of scalable aligned artificial intelligence (AI) from first principle descriptions of natural intelligence. In brief, a possible path toward scalable aligned AI rests upon enabling artificial agents to learn a good model of the world that includes a good model of our preferences. For this, the main objective is creating agents that learn to represent the world and other agents’ world models; a problem that falls under structure learning (a.k.a. causal representation learning or model discovery). We expose the structure learning and alignment problems with this goal in mind, as well as principles to guide us forward, synthesizing various ideas across mathematics, statistics, and cognitive science. 1) We discuss the essential role of core knowledge, information geometry and model reduction in structure learning, and suggest core structural modules to learn a wide range of naturalistic worlds. 2) We outline a way toward aligned agents through structure learning and theory of mind. As an illustrative example, we mathematically sketch Asimov’s Laws of Robotics, which prescribe agents to act cautiously to minimize the ill-being of other agents. We supplement this example by proposing refined approaches to alignment. These observations may guide the development of artificial intelligence in helping to scale existing – or design new – aligned structure learning systems.

[245] A Unified Framework for Evaluating the Effectiveness and Enhancing the Transparency of Explainable AI Methods in Real-World Applications

Md. Ariful Islam, Md Abrar Jahin, M. F. Mridha, Nilanjan Dey

Main category: cs.AI

TL;DR: The paper proposes a unified evaluation framework for Explainable AI (XAI) to address the lack of standardized methods for assessing explanation quality, combining quantitative metrics and user feedback.

Details

Motivation: The opacity of deep learning models ('black boxes') hinders trust and understanding, necessitating better XAI evaluation methods.

Method: The framework includes steps like data loading, explanation generation, and comprehensive testing, with a focus on user needs and application-specific benchmarks.

Result: Case studies in healthcare, finance, farming, and self-driving systems demonstrate the framework’s effectiveness in ensuring fair and trustworthy XAI evaluations.

Conclusion: The framework provides a practical approach to enhance AI transparency and trust in real-world applications.

Abstract: The fast growth of deep learning has brought great progress in AI-based applications. However, these models are often seen as “black boxes,” which makes them hard to understand, explain, or trust. Explainable Artificial Intelligence (XAI) tries to make AI decisions clearer so that people can understand how and why the model makes certain choices. Even though many studies have focused on XAI, there is still a lack of standard ways to measure how well these explanation methods work in real-world situations. This study introduces a single evaluation framework for XAI. It uses both numbers and user feedback to check if the explanations are correct, easy to understand, fair, complete, and reliable. The framework focuses on users’ needs and different application areas, which helps improve the trust and use of AI in important fields. To fix problems in current evaluation methods, we propose clear steps, including loading data, creating explanations, and fully testing them. We also suggest setting common benchmarks. We show the value of this framework through case studies in healthcare, finance, farming, and self-driving systems. These examples prove that our method can support fair and trustworthy evaluation of XAI methods. This work gives a clear and practical way to improve transparency and trust in AI systems used in the real world.

[246] From Code to Play: Benchmarking Program Search for Games Using Large Language Models

Manuel Eberhardinger, James Goodman, Alexander Dockhorn, Diego Perez-Liebana, Raluca D. Gaina, Duygu Çakmak, Setareh Maghsudi, Simon Lucas

Main category: cs.AI

TL;DR: LLMs are explored for synthesizing game code in Python and Java using evolutionary hill-climbing, with performance varying by task rather than model size.

Details

Motivation: To assess LLMs' potential in generating usable game code across diverse tasks and languages.

Method: Evolutionary hill-climbing algorithm with LLM-controlled mutations and seeds, tested on 29 tasks in Python and Java.

Result: Performance depends on task, not model size; larger models generate more executable but not always better code. No single model excels universally.

Conclusion: Using multiple models and selecting the best results per task is more reliable than relying on one model.

Abstract: Large language models (LLMs) have shown impressive capabilities in generating program code, opening exciting opportunities for applying program synthesis to games. In this work, we explore the potential of LLMs to directly synthesize usable code for a wide range of gaming applications, focusing on two programming languages, Python and Java. We use an evolutionary hill-climbing algorithm, where the mutations and seeds of the initial programs are controlled by LLMs. For Python, the framework covers various game-related tasks, including five miniature versions of Atari games, ten levels of Baba is You, an environment inspired by Asteroids, and a maze generation task. For Java, the framework contains 12 games from the TAG tabletop games framework. Across 29 tasks, we evaluated 12 language models for Python and 8 for Java. Our findings suggest that the performance of LLMs depends more on the task than on model size. While larger models generate more executable programs, these do not always result in higher-quality solutions but are much more expensive. No model has a clear advantage, although on any specific task, one model may be better. Trying many models on a problem and using the best results across them is more reliable than using just one.

[247] LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu

Main category: cs.AI

TL;DR: The paper introduces LongDocURL, a benchmark for evaluating large vision language models (LVLMs) on long document understanding, numerical reasoning, and cross-element locating tasks, outperforming existing benchmarks with 2,325 QA pairs and 33,000+ pages.

Details

Motivation: Existing benchmarks for document understanding are limited in handling long documents and analyzing layout elements, prompting the need for a more comprehensive evaluation framework.

Method: The authors define three task categories (Long Document Understanding, Numerical Reasoning, cross-element Locating), propose the LongDocURL benchmark, and develop a semi-automated pipeline to collect 2,325 QA pairs.

Result: LongDocURL covers 33,000+ pages and outperforms existing benchmarks. Evaluation of 26 model configurations reveals critical performance gaps.

Conclusion: The LongDocURL benchmark addresses limitations of existing benchmarks and highlights performance gaps in LVLMs for document understanding tasks.

Abstract: Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

[248] XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation

Daniele Molino, Francesco Di Feola, Eliodoro Faiella, Deborah Fazzini, Domiziana Santucci, Linlin Shen, Valerio Guarrasi, Paolo Soda

Main category: cs.AI

TL;DR: XGeM is a 6.77B-parameter multimodal generative model for medical data synthesis, addressing challenges like data scarcity and privacy. It enables any-to-any synthesis between modalities, validated on MIMIC-CXR and expert assessments.

Details

Motivation: Challenges in AI adoption for medical imaging include data scarcity, privacy, and multimodal integration. Existing generative models lack joint synthesis of multiple modalities with clinical consistency.

Method: XGeM uses contrastive learning to create a shared latent space and a Multi-Prompt Training strategy for flexible conditioning on input modalities, enabling joint synthesis.

Result: XGeM outperforms competitors on MIMIC-CXR and passes a Visual Turing Test with radiologists. It supports anonymization, class imbalance, and data scarcity.

Conclusion: XGeM is a robust foundation model for medical data synthesis, addressing key challenges and ensuring clinical relevance.

Abstract: The adoption of Artificial Intelligence in medical imaging holds great promise, yet it remains hindered by challenges such as data scarcity, privacy concerns, and the need for robust multimodal integration. While recent advances in generative modeling have enabled high-quality synthetic data generation, existing approaches are often limited to unimodal, unidirectional synthesis and therefore lack the ability to jointly synthesize multiple modalities while preserving clinical consistency. To address this challenge, we introduce XGeM, a 6.77-billion-parameter multimodal generative model designed to support flexible, any-to-any synthesis between medical data modalities. XGeM constructs a shared latent space via contrastive learning and introduces a novel Multi-Prompt Training strategy, enabling conditioning on arbitrary subsets of input modalities. This design allows the model to adapt to heterogeneous clinical inputs and generate multiple outputs jointly, preserving both semantic and structural coherence. We extensively validate XGeM: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for multi-view Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we show how XGeM can support key medical data challenges such as anonymization, class imbalance, and data scarcity, underscoring its utility as a foundation model for medical data synthesis. Project page is at https://cosbidev.github.io/XGeM/.

[249] ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi

Main category: cs.AI

TL;DR: The paper evaluates LLMs’ logical reasoning using ZebraLogic, a framework for logic grid puzzles, revealing a ‘curse of complexity’ where accuracy drops with problem difficulty. It explores enhancement strategies like Best-of-N sampling and backtracking.

Details

Motivation: To assess the scalability and limitations of LLMs in complex non-monotonic reasoning, particularly for logic grid puzzles derived from CSPs.

Method: Introduces ZebraLogic, a framework to generate puzzles with controllable complexity, testing models like Llama, o1, and DeepSeek-R1 under varied constraints.

Result: Shows a decline in accuracy with increasing complexity, termed the ‘curse of complexity,’ despite larger models or more computation.

Conclusion: Highlights inherent LLM reasoning limits and suggests strategies like Best-of-N sampling for improvement.

Abstract: We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows – a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

[250] Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin

Main category: cs.AI

TL;DR: Agentic Reasoning enhances LLM reasoning by integrating external tools like web search, code execution, and structured memory. It introduces the Mind-Map agent for knowledge graphs and a superior Web-Search agent, achieving SOTA performance.

Details

Motivation: To address complex problems requiring deep research by improving LLM reasoning through dynamic tool integration and structured knowledge tracking.

Method: Uses external agents (web search, code execution, structured memory) and introduces the Mind-Map agent for knowledge graphs and a Web-Search agent for improved search.

Result: Achieves state-of-the-art performance on DeepSeek-R1, comparable to OpenAI Deep Research, with validated effectiveness of tools.

Conclusion: Agentic Reasoning significantly enhances LLM reasoning through innovative agent integration, validated by extensive studies.

Abstract: We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. The code is at: https://github.com/theworldofagents/Agentic-Reasoning

[251] The Odyssey of the Fittest: Can Agents Survive and Still Be Good?

Dylan Waldner, Risto Miikkulainen

Main category: cs.AI

TL;DR: A study introduces the Odyssey, a text-based game, to explore AI ethics by testing three agents (Bayesian with NEAT, Bayesian with SVI, and GPT-4o) in survival scenarios. Results show GPT-4o outperforms in survival and ethical consistency, challenging traditional probabilistic methods.

Details

Motivation: Understanding AI decision-making in complex environments is critical for ethical behavior, especially as models grow more powerful.

Method: The Odyssey framework tests three agents (Bayesian-NEAT, Bayesian-SVI, GPT-4o) in survival scenarios, evaluating their ethical decisions under increasing danger.

Result: GPT-4o outperformed Bayesian models in survival and ethical consistency, with ethical behavior becoming unpredictable as danger increased.

Conclusion: The study highlights GPT-4o’s unexpected superiority, challenging assumptions about probabilistic methods and calling for deeper understanding of LLMs’ reasoning.

Abstract: As AI models grow in power and generality, understanding how agents learn and make decisions in complex environments is critical to promoting ethical behavior. This study introduces the Odyssey, a lightweight, adaptive text based adventure game, providing a scalable framework for exploring AI ethics and safety. The Odyssey examines the ethical implications of implementing biological drives, specifically, self preservation, into three different agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with stochastic variational inference, and a GPT 4o agent. The agents select actions at each scenario to survive, adapting to increasingly challenging scenarios. Post simulation analysis evaluates the ethical scores of the agent decisions, uncovering the tradeoffs it navigates to survive. Specifically, analysis finds that when danger increases, agents ethical behavior becomes unpredictable. Surprisingly, the GPT 4o agent outperformed the Bayesian models in both survival and ethical consistency, challenging assumptions about traditional probabilistic methods and raising a new challenge to understand the mechanisms of LLMs’ probabilistic reasoning.

[252] BOOST: Bootstrapping Strategy-Driven Reasoning Programs for Program-Guided Fact-Checking

Qisheng Hu, Quanyu Long, Wenya Wang

Main category: cs.AI

TL;DR: BOOST is a bootstrapping framework for few-shot reasoning program generation, improving claim verification by integrating decomposition and information-gathering strategies iteratively without human intervention.

Details

Motivation: Prior methods rely on limited, manually designed demonstrations, lacking diversity and requiring domain expertise. BOOST addresses this by automating and refining program generation.

Method: BOOST uses claim decomposition and information-gathering strategies to guide program generation, iteratively refining demonstrations in a data-centric way.

Result: BOOST outperforms prior few-shot baselines in zero-shot and few-shot settings for complex claim verification.

Conclusion: BOOST enhances interpretability and effectiveness in program-guided reasoning, enabling seamless transition from zero-shot to few-shot learning.

Abstract: Program-guided reasoning has shown promise in complex claim fact-checking by decomposing claims into function calls and executing reasoning programs. However, prior work primarily relies on few-shot in-context learning (ICL) with ad-hoc demonstrations, which limit program diversity and require manual design with substantial domain knowledge. Fundamentally, the underlying principles of effective reasoning program generation still remain underexplored, making it challenging to construct effective demonstrations. To address this, we propose BOOST, a bootstrapping-based framework for few-shot reasoning program generation. BOOST explicitly integrates claim decomposition and information-gathering strategies as structural guidance for program generation, iteratively refining bootstrapped demonstrations in a strategy-driven and data-centric manner without human intervention. This enables a seamless transition from zero-shot to few-shot strategic program-guided learning, enhancing interpretability and effectiveness. Experimental results show that BOOST outperforms prior few-shot baselines in both zero-shot and few-shot settings for complex claim verification.

[253] Seeking to Collide: Online Safety-Critical Scenario Generation for Autonomous Driving with Retrieval Augmented Large Language Models

Yuewen Mei, Tong Nie, Jian Sun, Ye Tian

Main category: cs.AI

TL;DR: An online, retrieval-augmented LLM framework generates safety-critical driving scenarios for AVs, improving collision avoidance and outperforming baselines.

Details

Motivation: Existing scenario generation methods for AVs either overfit to common patterns or lack interactivity, missing rare, critical cases.

Method: Uses an LLM-based behavior analyzer to infer dangerous intents, then queries LLM agents for adversarial trajectories, augmented with a dynamic memorization-retrieval bank.

Result: Reduces mean minimum time-to-collision from 1.62 to 1.08 s and achieves a 75% collision rate, outperforming baselines.

Conclusion: The framework effectively generates safety-critical scenarios, enhancing AV testing.

Abstract: Simulation-based testing is crucial for validating autonomous vehicles (AVs), yet existing scenario generation methods either overfit to common driving patterns or operate in an offline, non-interactive manner that fails to expose rare, safety-critical corner cases. In this paper, we introduce an online, retrieval-augmented large language model (LLM) framework for generating safety-critical driving scenarios. Our method first employs an LLM-based behavior analyzer to infer the most dangerous intent of the background vehicle from the observed state, then queries additional LLM agents to synthesize feasible adversarial trajectories. To mitigate catastrophic forgetting and accelerate adaptation, we augment the framework with a dynamic memorization and retrieval bank of intent-planner pairs, automatically expanding its behavioral library when novel intents arise. Evaluations using the Waymo Open Motion Dataset demonstrate that our model reduces the mean minimum time-to-collision from 1.62 to 1.08 s and incurs a 75% collision rate, substantially outperforming baselines.

[254] An Agentic Framework for Autonomous Metamaterial Modeling and Inverse Design

Darui Lu, Jordan M. Malof, Willie J. Padilla

Main category: cs.AI

TL;DR: An Agentic Framework using LLMs automates inverse design of photonic metamaterials, integrating forward modeling, external tools, and deep inverse methods for novel outputs.

Details

Motivation: To leverage LLMs for autonomous, complex tasks like scientific research, specifically in photonic metamaterial design.

Method: Develops a framework where the Agent autonomously proposes forward models, uses APIs for simulations/optimization, and applies deep inverse methods.

Result: Demonstrates effectiveness in automation, reasoning, planning, and adaptability, yielding varied and novel designs.

Conclusion: The framework showcases the potential of LLMs in autonomous scientific research and complex problem-solving.

Abstract: Recent significant advances in integrating multiple Large Language Model (LLM) systems have enabled Agentic Frameworks capable of performing complex tasks autonomously, including novel scientific research. We develop and demonstrate such a framework specifically for the inverse design of photonic metamaterials. When queried with a desired optical spectrum, the Agent autonomously proposes and develops a forward deep learning model, accesses external tools via APIs for tasks like simulation and optimization, utilizes memory, and generates a final design via a deep inverse method. The framework’s effectiveness is demonstrated in its ability to automate, reason, plan, and adapt. Notably, the Agentic Framework possesses internal reflection and decision flexibility, permitting highly varied and potentially novel outputs.

[255] Establishing Best Practices for Building Rigorous Agentic Benchmarks

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

Main category: cs.AI

TL;DR: The paper highlights flaws in current AI agent benchmarks and introduces the Agentic Benchmark Checklist (ABC) to improve evaluation rigor, reducing performance overestimation by 33%.

Details

Motivation: To address issues in AI agent benchmarks, such as flawed task setups or reward designs, which can misrepresent agent performance by up to 100%.

Method: Introduces the Agentic Benchmark Checklist (ABC), synthesized from benchmark-building experience, best practices, and reported issues.

Result: ABC reduces performance overestimation by 33% when applied to CVE-Bench.

Conclusion: ABC provides a practical solution to enhance the rigor and reliability of agentic benchmarks.

Abstract: Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues in task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.

[256] Working with AI: Measuring the Occupational Implications of Generative AI

Kiran Tomlinson, Sonia Jaffe, Will Wang, Scott Counts, Siddharth Suri

Main category: cs.AI

TL;DR: The paper analyzes how generative AI impacts work activities, identifying common tasks like information gathering and writing, and measures AI’s success and applicability across occupations.

Details

Motivation: To understand AI's economic impact by examining its role in work activities and occupational applicability.

Method: Analysis of 200k anonymized conversations between users and Microsoft Bing Copilot, classifying activities and measuring task success and impact.

Result: Highest AI applicability in knowledge work and communication-heavy occupations like sales. AI excels in providing information, writing, teaching, and advising.

Conclusion: AI’s impact is significant in knowledge-based and information-driven roles, with potential implications for wage and education correlations.

Abstract: Given the rapid adoption of generative AI and its potential to impact a wide range of tasks, understanding the effects of AI on the economy is one of society’s most important questions. In this work, we take a step toward that goal by analyzing the work activities people do with AI, how successfully and broadly those activities are done, and combine that with data on what occupations do those activities. We analyze a dataset of 200k anonymized and privacy-scrubbed conversations between users and Microsoft Bing Copilot, a publicly available generative AI system. We find the most common work activities people seek AI assistance for involve gathering information and writing, while the most common activities that AI itself is performing are providing information and assistance, writing, teaching, and advising. Combining these activity classifications with measurements of task success and scope of impact, we compute an AI applicability score for each occupation. We find the highest AI applicability scores for knowledge work occupation groups such as computer and mathematical, and office and administrative support, as well as occupations such as sales whose work activities involve providing and communicating information. Additionally, we characterize the types of work activities performed most successfully, how wage and education correlate with AI applicability, and how real-world usage compares to predictions of occupational AI impact.

[257] Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLMs Without RL or Distillation

Wei Du, Branislav Kisacanin, George Armstrong, Shubham Toshniwal, Ivan Moshkov, Alexan Ayrapetyan, Sadegh Mahdavi, Dan Zhao, Shizhe Diao, Dragan Masulovic, Marius Stanean, Advaith Avadhanam, Max Wang, Ashmit Dutta, Shitij Govil, Sri Yanamandara, Mihir Tandon, Sriram Ananthakrishnan, Vedant Rathi, David Zhang, Joonseok Kang, Leon Luo, Titu Andreescu, Boris Ginsburg, Igor Gitman

Main category: cs.AI

TL;DR: Light fine-tuning of a base model with 20 long CoT examples from a reasoning model outperforms larger models, showing minimal high-quality data can unlock strong reasoning. Human or non-reasoning model CoT data falls short, highlighting unique expert CoT qualities.

Details

Motivation: Explore if long CoT can be induced in base models with minimal tuning or prompting, leveraging small high-quality datasets.

Method: Light fine-tuning of Qwen2.5-32B using 20 long CoT examples from QwQ-32B-Preview, and testing CoT data from non-reasoning models/humans with prompt engineering and editing.

Result: Fine-tuned model outperforms Qwen2.5-Math-72B-Instruct. Non-expert CoT data underperforms, suggesting expert CoT has irreplicable qualities.

Conclusion: Small, high-quality CoT datasets can activate reasoning in base models, but expert CoT remains superior. Challenges persist, but human-authored CoT shows promise.

Abstract: Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model \texttt{QwQ-32B-Preview}, we lightly fine-tune the base model \texttt{Qwen2.5-32B}. The resulting model outperforms the much larger \texttt{Qwen2.5-Math-72B-Instruct}, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human annotators, enhanced with prompt engineering, multi-pass editing, and structural guidance. However, neither matches the performance of reasoning model traces, suggesting that certain latent qualities of expert CoT are difficult to replicate. We analyze key properties of reasoning data, such as problem difficulty, diversity, and answer length, that influence reasoning distillation. While challenges remain, we are optimistic that carefully curated human-written CoT, even in small quantities, can activate reasoning behaviors in base models. We release our human-authored dataset across refinement stages and invite further investigation into what makes small-scale reasoning supervision so effective.

[258] VerifyBench: A Systematic Benchmark for Evaluating Reasoning Verifiers Across Domains

Xuzhao Li, Xuchen Li, Shiyu Hu, Yongzhen Guo, Wentao Zhang

Main category: cs.AI

TL;DR: The paper proposes VerifyBench, a benchmark to evaluate verifiers for LLM responses, highlighting trade-offs between specialized and general verifiers.

Details

Motivation: Existing verifiers for LLM responses lack systematic evaluation across domains, hindering reliable RLVR development.

Method: VerifyBench includes 4,000 expert-level questions with reference answers and diverse responses, evaluated under a four-dimensional framework.

Result: Specialized verifiers lead in accuracy but lack recall; general models are inclusive but unstable. Verifiers are sensitive to input structure and struggle with cross-domain generalization.

Conclusion: The study identifies critical bottlenecks in verifier technology, emphasizing the need for balanced solutions in RLVR.

Abstract: Large language models (LLMs) increasingly rely on reinforcement learning (RL) to enhance their reasoning capabilities through feedback. A critical challenge is verifying the consistency of model-generated responses and reference answers, since these responses are often lengthy, diverse, and nuanced. Rule-based verifiers struggle with complexity, prompting the use of model-based verifiers. However, specialized verifiers lack flexibility, while general LLM judges can be inconsistent. Existing research primarily focuses on building better verifiers, yet a systematic evaluation of different types of verifiers’ performance across domains remains lacking, severely constraining the reliable development of Reinforcement Learning with Verifiable Reward (RLVR). To address this, we propose VerifyBench–a cross-domain comprehensive benchmark for systematically evaluating verifiers. We construct 4,000 expert-level questions covering mathematics, physics, chemistry, and biology. Each question is equipped with reference answers and diverse responses. The reliability of the evaluation is ensured through a rigorous annotation process conducted by a multidisciplinary expert team. We design a four-dimensional experimental framework to comprehensively compare the performance boundaries of specialized verifiers and general LLMs under combined conditions of extracted answers vs. complete responses, and short vs. long outputs. Our evaluation uncovers fundamental trade-offs in verifiers: while specialized verifiers achieve leading accuracy, they exhibit deficiencies in recall; general models show stronger inclusivity but unstable precision. More importantly, we discover verifiers’ high sensitivity to input structure and inherent limitations in cross-domain generalization, providing critical insights into the bottlenecks of current verifier technology.

[259] Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures

Sudarshan Babu

Main category: cs.AI

TL;DR: The paper proposes architectures like neural memory and hypernetworks to enable efficient knowledge transfer in data-scarce domains, demonstrating applications in 3D scene generation, segmentation, and molecular property prediction.

Details

Motivation: Addressing the challenge of transfer learning in data-scarce domains like computational chemistry and medical imaging, where large pre-trained models are infeasible.

Method: Utilizes neural memory for adaptation on non-stationary distributions and hypernetworks with MAML for generalizable priors, applied to 3D tasks and molecular prediction.

Result: Hypernetworks efficiently acquire priors with few samples, enabling faster text-to-3D generation and improved molecular property prediction.

Conclusion: The proposed architectures offer scalable solutions for transfer learning in data-limited settings, with broad applicability across domains.

Abstract: The ability to transfer knowledge from prior experiences to novel tasks stands as a pivotal capability of intelligent agents, including both humans and computational models. This principle forms the basis of transfer learning, where large pre-trained neural networks are fine-tuned to adapt to downstream tasks. Transfer learning has demonstrated tremendous success, both in terms of task adaptation speed and performance. However there are several domains where, due to lack of data, training such large pre-trained models or foundational models is not a possibility - computational chemistry, computational immunology, and medical imaging are examples. To address these challenges, our work focuses on designing architectures to enable efficient acquisition of priors when large amounts of data are unavailable. In particular, we demonstrate that we can use neural memory to enable adaptation on non-stationary distributions with only a few samples. Then we demonstrate that our hypernetwork designs (a network that generates another network) can acquire more generalizable priors than standard networks when trained with Model Agnostic Meta-Learning (MAML). Subsequently, we apply hypernetworks to 3D scene generation, demonstrating that they can acquire priors efficiently on just a handful of training scenes, thereby leading to faster text-to-3D generation. We then extend our hypernetwork framework to perform 3D segmentation on novel scenes with limited data by efficiently transferring priors from earlier viewed scenes. Finally, we repurpose an existing molecular generative method as a pre-training framework that facilitates improved molecular property prediction, addressing critical challenges in computational immunology.

cs.SD

[260] A Survey on Speech Deepfake Detection

Menglu Li, Yasaman Ahmadiadli, Xiao-Ping Zhang

Main category: cs.SD

TL;DR: A survey analyzing over 200 papers on speech Deepfake detection, covering model architectures, evaluation metrics, datasets, and emerging challenges like adversarial defenses and cross-dataset evaluation.

Details

Motivation: The rise of Deepfake content, especially speech, poses misinformation threats, necessitating advancements in detection techniques.

Method: Systematic review of 200+ papers, analyzing detection pipeline components, performance, and open-source tools.

Result: Comprehensive insights into current techniques, challenges, and baselines for future research.

Conclusion: The survey provides guidance for improving speech Deepfake detection and highlights promising research directions.

Abstract: The availability of smart devices leads to an exponential increase in multimedia content. However, advancements in deep learning have also enabled the creation of highly sophisticated Deepfake content, including speech Deepfakes, which pose a serious threat by generating realistic voices and spreading misinformation. To combat this, numerous challenges have been organized to advance speech Deepfake detection techniques. In this survey, we systematically analyze more than 200 papers published up to March 2024. We provide a comprehensive review of each component in the detection pipeline, including model architectures, optimization techniques, generalizability, evaluation metrics, performance comparisons, available datasets, and open source availability. For each aspect, we assess recent progress and discuss ongoing challenges. In addition, we explore emerging topics such as partial Deepfake detection, cross-dataset evaluation, and defences against adversarial attacks, while suggesting promising research directions. This survey not only identifies the current state of the art to establish strong baselines for future experiments but also offers clear guidance for researchers aiming to enhance speech Deepfake detection systems.

[261] Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition

Mengzhe Geng, Patrick Littell, Aidan Pine, PENÁĆ, Marc Tessier, Roland Kuhn

Main category: cs.SD

TL;DR: The paper proposes an ASR-driven pipeline to support SENĆOTEN language revitalization, addressing data scarcity and vocabulary variation with TTS-augmented data and cross-lingual transfer learning. Results show improved WER and CER.

Details

Motivation: To support SENĆOTEN language revitalization by overcoming challenges like limited data and vocabulary variation in ASR development.

Method: Uses TTS-augmented speech data, cross-lingual transfer learning with SFMs, and an n-gram language model for shallow fusion or n-best rescoring.

Result: Achieved WER of 19.34% (improved to 14.32%) and CER of 5.09% (improved to 3.45%) on the test set.

Conclusion: The ASR-driven pipeline shows promise for SENĆOTEN documentation, aiding revitalization efforts.

Abstract: The SEN'{C}OTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SEN'{C}OTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SEN'{C}OTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SEN'{C}OTEN language documentation.

[262] Pronunciation Deviation Analysis Through Voice Cloning and Acoustic Comparison

Andrew Valdivia, Yueming Zhang, Hailu Xu, Amir Ghasemkhani, Xin Qin

Main category: cs.SD

TL;DR: A novel method detects mispronunciations by comparing original speech to voice-cloned corrected versions, identifying deviations as errors.

Details

Motivation: To improve pronunciation detection without relying on predefined rules or large training datasets.

Method: Uses voice cloning to create corrected speech, then compares it frame-by-frame with the original to find deviations.

Result: Effectively identifies specific pronunciation errors without needing phonetic rules or extensive data.

Conclusion: The approach offers a scalable and efficient solution for mispronunciation detection.

Abstract: This paper presents a novel approach for detecting mispronunciations by analyzing deviations between a user’s original speech and their voice-cloned counterpart with corrected pronunciation. We hypothesize that regions with maximal acoustic deviation between the original and cloned utterances indicate potential mispronunciations. Our method leverages recent advances in voice cloning to generate a synthetic version of the user’s voice with proper pronunciation, then performs frame-by-frame comparisons to identify problematic segments. Experimental results demonstrate the effectiveness of this approach in pinpointing specific pronunciation errors without requiring predefined phonetic rules or extensive training data for each target language.

[263] EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing

Vassilis Sioros, Alexandros Potamianos, Giorgos Paraskevopoulos

Main category: cs.SD

TL;DR: The paper introduces a method for efficient audio editing in auto-regressive models using cross-attention control, outperforming diffusion-based baselines.

Details

Motivation: To improve audio editing by adapting image editing techniques like Prompt-to-Prompt for audio, leveraging attention mechanisms.

Method: Develops a Prompt-to-Prompt-like approach with cross and self-attention, integrates diffusion-based strategies, and introduces three editing mechanisms (Replacement, Reweighting, Refinement) using MUSICGEN.

Result: The proposed method outperforms diffusion-based baselines in melody, dynamics, and tempo, validated by automatic and human evaluations.

Conclusion: The combination of prompt-to-prompt guidance with auto-regressive models is effective for high-quality audio editing.

Abstract: In this study, we investigate leveraging cross-attention control for efficient audio editing within auto-regressive models. Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms. Integrating a diffusion-based strategy, influenced by Auffusion, we extend the model’s functionality to support refinement edits, establishing a baseline for prompt-guided audio editing. Additionally, we introduce an alternative approach by incorporating MUSICGEN, a pre-trained frozen auto-regressive model, and propose three editing mechanisms, based on Replacement, Reweighting, and Refinement of the attention scores. We employ commonly-used music-specific evaluation metrics and a human study, to gauge time-varying controllability, adherence to global text cues, and overall audio realism. The automatic and human evaluations indicate that the proposed combination of prompt-to-prompt guidance with autoregressive generation models significantly outperforms the diffusion-based baseline in terms of melody, dynamics, and tempo of the generated audio. Our code is available at https://github.com/billsioros/EditGen

[264] Improving Neural Pitch Estimation with SWIPE Kernels

David Marttila, Joshua D. Reiss

Main category: cs.SD

TL;DR: The paper explores using SWIPE kernels as an audio frontend for neural pitch estimation, improving accuracy, noise robustness, and parameter efficiency.

Details

Motivation: To enhance neural pitch estimators by incorporating task-specific features (SWIPE kernels) instead of relying solely on raw audio or general-purpose representations.

Method: Investigates SWIPE kernels as an audio frontend, evaluating supervised and self-supervised neural architectures on common datasets.

Result: SWIPE frontend reduces network size by an order of magnitude without performance loss and outperforms self-supervised neural estimators.

Conclusion: Task-specific features like SWIPE kernels can significantly improve neural pitch estimation efficiency and accuracy.

Abstract: Neural networks have become the dominant technique for accurate pitch and periodicity estimation. Although a lot of research has gone into improving network architectures and training paradigms, most approaches operate directly on the raw audio waveform or on general-purpose time-frequency representations. We investigate the use of Sawtooth-Inspired Pitch Estimation (SWIPE) kernels as an audio frontend and find that these hand-crafted, task-specific features can make neural pitch estimators more accurate, robust to noise, and more parameter-efficient. We evaluate supervised and self-supervised state-of-the-art architectures on common datasets and show that the SWIPE audio frontend allows for reducing the network size by an order of magnitude without performance degradation. Additionally, we show that the SWIPE algorithm on its own is much more accurate than commonly reported, outperforming state-of-the-art self-supervised neural pitch estimators.

[265] FasTUSS: Faster Task-Aware Unified Source Separation

Francesco Paissan, Gordon Wichern, Yoshiki Masuyama, Ryo Aihara, François G. Germain, Kohei Saijo, Jonathan Le Roux

Main category: cs.SD

TL;DR: The paper analyzes the Time-Frequency (TF) dual-path model TUSS, optimizing its performance-complexity trade-off, and introduces two efficient variants, FasTUSS-8.3G and FasTUSS-11.7G, reducing operations significantly with minor performance drops.

Details

Motivation: Address the high computational cost of TF dual-path models like TUSS while maintaining performance for audio source separation tasks.

Method: Analyze TUSS design choices, derive efficient models (FasTUSS-8.3G and FasTUSS-11.7G), and investigate prompt conditioning for a causal TUSS model.

Result: Reduced operations by 81% and 73% with minor performance drops (1.2 dB and 0.4 dB).

Conclusion: Efficient models achieve significant computational savings with minimal performance loss, offering practical improvements for audio source separation.

Abstract: Time-Frequency (TF) dual-path models are currently among the best performing audio source separation network architectures, achieving state-of-the-art performance in speech enhancement, music source separation, and cinematic audio source separation. While they are characterized by a relatively low parameter count, they still require a considerable number of operations, implying a higher execution time. This problem is exacerbated by the trend towards bigger models trained on large amounts of data to solve more general tasks, such as the recently introduced task-aware unified source separation (TUSS) model. TUSS, which aims to solve audio source separation tasks using a single, conditional model, is built upon TF-Locoformer, a TF dual-path model combining convolution and attention layers. The task definition comes in the form of a sequence of prompts that specify the number and type of sources to be extracted. In this paper, we analyze the design choices of TUSS with the goal of optimizing its performance-complexity trade-off. We derive two more efficient models, FasTUSS-8.3G and FasTUSS-11.7G that reduce the original model’s operations by 81% and 73% with minor performance drops of 1.2~~dB and 0.4~~dB averaged over all benchmarks, respectively. Additionally, we investigate the impact of prompt conditioning to derive a causal TUSS model.

[266] ReverbMiipher: Generative Speech Restoration meets Reverberation Characteristics Controllability

Wataru Nakata, Yuma Koizumi, Shigeki Karita, Robin Scheibler, Haruko Ishikawa, Adriana Guevara-Rukoz, Heiga Zen, Michiel Bacchiani

Main category: cs.SD

TL;DR: ReverbMiipher is a Speech Restoration model that denoises speech while preserving and controlling reverberation, outperforming traditional methods.

Details

Motivation: Traditional SR removes reverberation, which encodes spatial information. ReverbMiipher aims to retain and control it.

Method: Uses a ReverbEncoder to extract reverb features, conditions a vocoder for reconstruction, and employs a zero-vector replacement strategy for disentanglement.

Result: Effectively preserves reverberation, removes noise, and allows control via feature manipulation. Outperforms conventional methods.

Conclusion: ReverbMiipher successfully retains and controls reverberation, offering novel effects and superior performance.

Abstract: Reverberation encodes spatial information regarding the acoustic source environment, yet traditional Speech Restoration (SR) usually completely removes reverberation. We propose ReverbMiipher, an SR model extending parametric resynthesis framework, designed to denoise speech while preserving and enabling control over reverberation. ReverbMiipher incorporates a dedicated ReverbEncoder to extract a reverb feature vector from noisy input. This feature conditions a vocoder to reconstruct the speech signal, removing noise while retaining the original reverberation characteristics. A stochastic zero-vector replacement strategy during training ensures the feature specifically encodes reverberation, disentangling it from other speech attributes. This learned representation facilitates reverberation control via techniques such as interpolation between features, replacement with features from other utterances, or sampling from a latent space. Objective and subjective evaluations confirm ReverbMiipher effectively preserves reverberation, removes other artifacts, and outperforms the conventional two-stage SR and convolving simulated room impulse response approach. We further demonstrate its ability to generate novel reverberation effects through feature manipulation.

Bingshen Mu, Kun Wei, Pengcheng Guo, Lei Xie

Main category: cs.SD

TL;DR: The paper proposes multi-modal and multi-granularity GER methods to improve ASR accuracy for accented speech by integrating pronunciation information and fine-grained phoneme-level details, achieving a 67.35% WER reduction.

Details

Motivation: ASR performance degrades with speaker accents. Current GER methods lack specificity for accented speech, prompting the need for tailored solutions.

Method: Introduces multi-modal GER (integrates pronunciation from speech) and multi-granularity GER (phoneme-level details). Uses LoRA fine-tuning and HDMoLE for accent diversity.

Result: Achieves 67.35% relative WER reduction compared to Whisper-large-v3 on a multi-accent English dataset.

Conclusion: The proposed methods effectively address accented speech challenges, significantly improving ASR accuracy.

Abstract: Despite substantial improvements in ASR, performance tends to degrade when faced with adverse conditions such as speaker accents. Generative error correction (GER) leverages the rich linguistic knowledge and exceptional reasoning ability of LLMs, significantly outperforming typical LM methods. However, it lacks specificity in accented speech scenarios. In this study, we leverage GER to improve the accuracy of transcription predictions by addressing the two primary features of accented speech recognition. To fully leverage pronunciation information, we propose the multi-modal GER, which integrates pronunciation information from the speech modality, and the multi-granularity GER, which incorporates fine-grained phoneme-level information related to pronunciation. These two methods enable the LLM to utilize the pronunciation information of accented speech and the semantic information from word-level hypotheses for accurate transcription predictions through LoRA fine-tuning. On the one hand, we employ a three-stage training strategy to train separate multi-modal GER models for each accent to obtain mono-accent LoRA experts. By adopting our proposed HDMoLE method, which incorporates hierarchical routing and dynamic thresholds within the mixture of LoRA experts, we effectively merge multiple mono-accent LoRA experts within a single multi-modal GER to overcome the challenges posed by accent diversity. On the other hand, multi-granularity GER leverages the N-best word-level and phoneme-level hypotheses generated by the HDMoLE model to predict the final accented speech transcriptions. Experimental results on the multi-accent English dataset demonstrate the efficacy of our proposed methods. Our methods achieve a remarkable relative WER reduction of 67.35% compared to the Whisper-large-v3 baseline.

cs.LG

[268] Tool-to-Tool Matching Analysis Based Difference Score Computation Methods for Semiconductor Manufacturing

Sameera Bharadwaja H., Siddhrath Jandial, Shashank S. Agashe, Rajesh Kumar Reddy Moore, Youngkwan Kim

Main category: cs.LG

TL;DR: The paper proposes novel tool-to-tool matching (TTTM) pipelines for semiconductor manufacturing, addressing limitations of traditional methods in heterogeneous settings.

Details

Motivation: Traditional TTTM methods rely on static data or golden references, which are hard to obtain and ineffective in heterogeneous equipment setups.

Method: The proposed pipelines analyze variance and modes in data to identify mismatched equipment, using both univariate and multivariate approaches.

Result: Univariate methods achieve high correlation (>0.95 with variance, >0.5 with modes), while multivariate methods correlate >0.75 with top univariate methods.

Conclusion: The proposed methods are effective for TTTM in heterogeneous settings, with multivariate methods showing robustness to hyper-parameters.

Abstract: We consider the problem of tool-to-tool matching (TTTM), also called, chamber matching in the context of a semiconductor manufacturing equipment. Traditional TTTM approaches utilize static configuration data or depend on a golden reference which are difficult to obtain in a commercial manufacturing line. Further, existing methods do not extend very well to a heterogeneous setting, where equipment are of different make-and-model, sourced from different equipment vendors. We propose novel TTTM analysis pipelines to overcome these issues. We hypothesize that a mismatched equipment would have higher variance and/or higher number of modes in the data. Our best univariate method achieves a correlation coefficient >0.95 and >0.5 with the variance and number of modes, respectively showing that the proposed methods are effective. Also, the best multivariate method achieves a correlation coefficient >0.75 with the top-performing univariate methods, showing its effectiveness. Finally, we analyze the sensitivity of the multivariate algorithms to the algorithm hyper-parameters.

[269] Enhancing Cross Entropy with a Linearly Adaptive Loss Function for Optimized Classification Performance

Jae Wan Shim

Main category: cs.LG

TL;DR: A novel Linearly Adaptive Cross Entropy Loss function is proposed, outperforming standard cross entropy in classification tasks with one-hot encoded labels.

Details

Motivation: To enhance optimization in classification tasks by introducing an adaptive term based on predicted probability of the true class.

Method: Derived from information theory, the loss function adds a term dependent on the true class’s predicted probability. Evaluated using ResNet on CIFAR-100.

Result: Consistently outperforms standard cross entropy in accuracy while maintaining similar efficiency.

Conclusion: The proposed loss function shows promise for future research in loss function design.

Abstract: We propose the Linearly Adaptive Cross Entropy Loss function. This is a novel measure derived from the information theory. In comparison to the standard cross entropy loss function, the proposed one has an additional term that depends on the predicted probability of the true class. This feature serves to enhance the optimization process in classification tasks involving one-hot encoded class labels. The proposed one has been evaluated on a ResNet-based model using the CIFAR-100 dataset. Preliminary results show that the proposed one consistently outperforms the standard cross entropy loss function in terms of classification accuracy. Moreover, the proposed one maintains simplicity, achieving practically the same efficiency to the traditional cross entropy loss. These findings suggest that our approach could broaden the scope for future research into loss function design.

[270] An Adaptive Volatility-based Learning Rate Scheduler

Kieran Chai Kai Ren

Main category: cs.LG

TL;DR: VolSched is a novel adaptive learning rate scheduler inspired by volatility in stochastic processes, improving generalization in deep neural networks by dynamically adjusting LR based on accuracy volatility.

Details

Motivation: Pre-defined and adaptive LR schedulers often lead to suboptimal generalization, prompting the need for a more dynamic approach.

Method: VolSched adjusts LR by calculating the ratio between long-term and short-term accuracy volatility, increasing LR to escape plateaus and decreasing it to stabilize training.

Result: On CIFAR-100 with ResNet-18/34, VolSched improves top-1 accuracy by 1.4 and 1.3 percentage points, respectively, and finds flatter minima (38% flatter than baselines).

Conclusion: VolSched enhances exploration and generalization, achieving better performance and flatter minima compared to existing schedulers.

Abstract: Effective learning rate (LR) scheduling is crucial for training deep neural networks. However, popular pre-defined and adaptive schedulers can still lead to suboptimal generalization. This paper introduces VolSched, a novel adaptive LR scheduler inspired by the concept of volatility in stochastic processes like Geometric Brownian Motion to dynamically adjust the learning rate. By calculating the ratio between long-term and short-term accuracy volatility, VolSched increases the LR to escape plateaus and decreases it to stabilize training, allowing the model to explore the loss landscape more effectively. We evaluate VolSched on the CIFAR-100 dataset against a strong baseline using a standard augmentation pipeline. When paired with ResNet-18 and ResNet-34, our scheduler delivers consistent performance gains, improving top-1 accuracy by 1.4 and 1.3 percentage points respectively. Analysis of the loss curves reveals that VolSched promotes a longer exploration phase. A quantitative analysis of the Hessian shows that VolSched finds a final solution that is 38% flatter than the next-best baseline, allowing the model to obtain wider minima and hence better generalization performance.

[271] Universal Approximation Theorem for a Single-Layer Transformer

Esmail Gumaan

Main category: cs.LG

TL;DR: The paper explores the mathematical foundations of deep learning and Transformers, proving a universal approximation theorem for Transformers and analyzing their theoretical underpinnings.

Details

Motivation: Despite the success of deep learning and Transformers in various domains, their theoretical understanding remains limited. This paper aims to bridge this gap by examining their mathematical foundations.

Method: The paper reviews key concepts from linear algebra, probability, and optimization, and analyzes the multi-head self-attention mechanism and backpropagation. It proves a universal approximation theorem for Transformers.

Result: The main result is a proof that a single-layer Transformer can approximate any continuous sequence-to-sequence mapping on a compact domain to arbitrary precision.

Conclusion: The findings advance the theoretical understanding of Transformers, bridging the gap between theory and practice, and are supported by practical case studies.

Abstract: Deep learning employs multi-layer neural networks trained via the backpropagation algorithm. This approach has achieved success across many domains and relies on adaptive gradient methods such as the Adam optimizer. Sequence modeling evolved from recurrent neural networks to attention-based models, culminating in the Transformer architecture. Transformers have achieved state-of-the-art performance in natural language processing (for example, BERT and GPT-3) and have been applied in computer vision and computational biology. However, theoretical understanding of these models remains limited. In this paper, we examine the mathematical foundations of deep learning and Transformers and present a novel theoretical result. We review key concepts from linear algebra, probability, and optimization that underpin deep learning, and we analyze the multi-head self-attention mechanism and the backpropagation algorithm in detail. Our main contribution is a universal approximation theorem for Transformers: we prove that a single-layer Transformer, comprising one self-attention layer followed by a position-wise feed-forward network with ReLU activation, can approximate any continuous sequence-to-sequence mapping on a compact domain to arbitrary precision. We provide a formal statement and a complete proof. Finally, we present case studies that demonstrate the practical implications of this result. Our findings advance the theoretical understanding of Transformer models and help bridge the gap between theory and practice.

[272] Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs

Gabriel Bo, Koa Chang, Justin Gu

Main category: cs.LG

TL;DR: SPaRK is a reinforcement learning framework that trains LLMs to explore diverse tool usage, optimizing for answer quality and tool diversity, outperforming baselines in MMLU-Pro tasks.

Details

Motivation: To enhance reasoning in LLMs by encouraging diverse tool usage beyond conventional methods.

Method: Uses step-wise RL with a dual-objective reward system, offline PPO, and a rarity-first exploitation strategy guided by GPT-4o.

Result: Achieves competitive performance in MMLU-Pro with higher tool diversity and maintained accuracy.

Conclusion: Explicit tool diversity in RL can improve reasoning without sacrificing accuracy.

Abstract: We present Step-wise Policy for Rare-tool Knowledge (SPaRK), a novel reinforcement learning framework that teaches large language models to explore diverse tool usage patterns beyond conventional high-temperature sampling. Building on recent advances in step-wise reinforcement learning, we introduce a dual-objective reward system that simultaneously optimizes for answer quality and tool diversity, training a Llama-3.1 8B model through offline PPO on synthetically generated trajectories from the MMLU-Pro dataset. Our approach uniquely employs a rarity-first exploitation strategy where a GPT-4o judge scores candidate actions across eight distinct tools plus chain-of-thought reasoning, with the policy favoring less-frequently used but still viable tools to encourage systematic exploration. Empirical results demonstrate that SPaRK achieves competitive performance across 14 MMLU-Pro categories while exhibiting significantly higher entropy in tool selection compared to both baseline and supervised fine-tuning approaches, suggesting that algorithmic exploration through explicit tool diversity can enhance reasoning capabilities without sacrificing accuracy.

[273] MH-FSF: A Unified Framework for Overcoming Benchmarking and Reproducibility Limitations in Feature Selection Evaluation

Vanderson Rocha, Diego Kreutz, Gabriel Canto, Hendrio Bragança, Eduardo Feitosa

Main category: cs.LG

TL;DR: The paper introduces MH-FSF, a modular framework for feature selection, addressing reproducibility and benchmarking gaps in current research. It includes 17 methods and evaluates them on 10 Android malware datasets.

Details

Motivation: Current feature selection research lacks reproducibility and benchmarking due to proprietary datasets and limited evaluations.

Method: The MH-FSF framework is developed, offering 17 feature selection methods (11 classical, 6 domain-specific) and systematic evaluation on 10 public Android malware datasets.

Result: Performance varies across balanced and imbalanced datasets, emphasizing the need for tailored preprocessing and selection criteria.

Conclusion: MH-FSF fosters methodological consistency, broadens literature, and opens new research directions in feature selection, especially for Android malware detection.

Abstract: Feature selection is vital for building effective predictive models, as it reduces dimensionality and emphasizes key features. However, current research often suffers from limited benchmarking and reliance on proprietary datasets. This severely hinders reproducibility and can negatively impact overall performance. To address these limitations, we introduce the MH-FSF framework, a comprehensive, modular, and extensible platform designed to facilitate the reproduction and implementation of feature selection methods. Developed through collaborative research, MH-FSF provides implementations of 17 methods (11 classical, 6 domain-specific) and enables systematic evaluation on 10 publicly available Android malware datasets. Our results reveal performance variations across both balanced and imbalanced datasets, highlighting the critical need for data preprocessing and selection criteria that account for these asymmetries. We demonstrate the importance of a unified platform for comparing diverse feature selection techniques, fostering methodological consistency and rigor. By providing this framework, we aim to significantly broaden the existing literature and pave the way for new research directions in feature selection, particularly within the context of Android malware detection.

[274] Extension OL-MDISF: Online Learning from Mix-Typed, Drifted, and Incomplete Streaming Features

Shengda Zhuo, Di Wu, Yi He, Shuqiang Huang, Xindong Wu

Main category: cs.LG

TL;DR: OL-MDISF addresses challenges in online learning with mixed, drifted, and incomplete features using latent copula-based representation, drift detection, and pseudo-labeling.

Details

Motivation: To tackle heterogeneity, distribution shifts, and labeling constraints in online learning.

Method: Constructs latent copula-based representation, detects drifts via ensemble entropy and latent mismatch, and performs structure-aware pseudo-labeling.

Result: Tested on 14 real-world datasets, showing CER trends, ablation studies, and sensitivity analyses.

Conclusion: Provides a reproducible benchmark for online learning with complex, weakly supervised streaming data.

Abstract: Online learning, where feature spaces can change over time, offers a flexible learning paradigm that has attracted considerable attention. However, it still faces three significant challenges. First, the heterogeneity of real-world data streams with mixed feature types presents challenges for traditional parametric modeling. Second, data stream distributions can shift over time, causing an abrupt and substantial decline in model performance. Third, it is often infeasible to label every data instance due to time and cost constraints. To address these issues, we proposed OL-MDISF (Online Learning from Mix-typed, Drifted, and Incomplete Streaming Features), which constructs a latent copula-based representation for heterogeneous features, detects drifts via ensemble entropy and latent mismatch, and performs structure-aware pseudo-labeling. This companion paper serves as a standalone technical reference to OL-MDISF. It provides a contextual discussion of related work in mixed-type modeling, drift adaptation, and weak supervision, as well as a comprehensive set of experiments across 14 real-world datasets under two types of drift scenarios. These include CER trends, ablation studies, sensitivity analyses, and temporal ensemble dynamics. We hope this document offers a reproducible benchmark for online learning on complex, weakly supervised streaming data.

[275] Divide-Then-Rule: A Cluster-Driven Hierarchical Interpolator for Attribute-Missing Graphs

Yaowen Hu, Wenxuan Tu, Yue Liu, Miaomiao Li, Wenpeng Lu, Zhigang Luo, Xinwang Liu, Ping Chen

Main category: cs.LG

TL;DR: DTRGC is a novel method for deep graph clustering in attribute-missing graphs, using hierarchical imputation and clustering to improve accuracy.

Details

Motivation: Existing imputation methods for attribute-missing graphs often fail due to varying neighborhood information, leading to unreliable results.

Method: DTRGC uses Dynamic Cluster-Aware Feature Propagation, Hierarchical Neighborhood-aware Imputation, and Hop-wise Representation Enhancement to iteratively impute missing attributes and refine clustering.

Result: Experiments on six datasets show DTRGC significantly improves clustering performance for attribute-missing graphs.

Conclusion: DTRGC effectively addresses the challenges of attribute-missing graphs by leveraging clustering and hierarchical imputation.

Abstract: Deep graph clustering (DGC) for attribute-missing graphs is an unsupervised task aimed at partitioning nodes with incomplete attributes into distinct clusters. Addressing this challenging issue is vital for practical applications. However, research in this area remains underexplored. Existing imputation methods for attribute-missing graphs often fail to account for the varying amounts of information available across node neighborhoods, leading to unreliable results, especially for nodes with insufficient known neighborhood. To address this issue, we propose a novel method named Divide-Then-Rule Graph Completion (DTRGC). This method first addresses nodes with sufficient known neighborhood information and treats the imputed results as new knowledge to iteratively impute more challenging nodes, while leveraging clustering information to correct imputation errors. Specifically, Dynamic Cluster-Aware Feature Propagation (DCFP) initializes missing node attributes by adjusting propagation weights based on the clustering structure. Subsequently, Hierarchical Neighborhood-aware Imputation (HNAI) categorizes attribute-missing nodes into three groups based on the completeness of their neighborhood attributes. The imputation is performed hierarchically, prioritizing the groups with nodes that have the most available neighborhood information. The cluster structure is then used to refine the imputation and correct potential errors. Finally, Hop-wise Representation Enhancement (HRE) integrates information across multiple hops, thereby enriching the expressiveness of node representations. Experimental results on six widely used graph datasets show that DTRGC significantly improves the clustering performance of various DGC methods under attribute-missing graphs.

Fei Zhao, Chonggang Lu, Yue Wang, Zheyong Xie, Ziyan Liu, Haofu Qian, JianZhao Huang, Fangcheng Shi, Zijie Meng, Hongcheng Guo, Mingqian He, Xinze Lyu, Yiming Lu, Ziyang Xiang, Zheyu Ye, Chengqiang Lu, Zhe Xu, Yi Wu, Yao Hu, Yan Gao, Jun Fan, Xiaolong Jiang, Weiting Liu, Boyang Wang, Shaosheng Cao

Main category: cs.LG

TL;DR: RedOne is a domain-specific LLM for SNS, outperforming single-task baselines by up to 14.02% in tasks and reducing harmful content exposure by 11.23%.

Details

Motivation: Addressing the limitations of isolated task-focused LLMs in SNS, which struggle with data scaling and adaptability.

Method: Three-stage training: continued pretraining, supervised fine-tuning, and preference optimization using real-world SNS data.

Result: Average improvements of 14.02% in SNS tasks and 7.56% in bilingual benchmarks; reduced harmful content exposure by 11.23%.

Conclusion: RedOne is a robust, adaptable LLM for SNS, excelling in generalization and real-world applicability.

Abstract: As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models (LLMs) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining, supervised fine-tuning, and preference optimization, using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark, compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks finetuned baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.

[277] DALI-PD: Diffusion-based Synthetic Layout Heatmap Generation for ML in Physical Design

Bing-Yue Wu, Vidya A. Chhabria

Main category: cs.LG

TL;DR: DALI-PD is a scalable framework using diffusion models to generate synthetic layout heatmaps for ML in physical design, addressing dataset limitations.

Details

Motivation: Overcoming the scarcity of high-quality, large-scale training datasets for ML in physical design tasks due to computational costs and IP constraints.

Method: Uses a diffusion model to quickly generate diverse synthetic layout heatmaps (power, IR drop, congestion, etc.) in seconds.

Result: Created a dataset of 20,000+ layout configurations resembling real layouts, improving ML accuracy for tasks like IR drop prediction.

Conclusion: DALI-PD provides a scalable solution to dataset generation, enhancing ML research in physical design.

Abstract: Machine learning (ML) has demonstrated significant promise in various physical design (PD) tasks. However, model generalizability remains limited by the availability of high-quality, large-scale training datasets. Creating such datasets is often computationally expensive and constrained by IP. While very few public datasets are available, they are typically static, slow to generate, and require frequent updates. To address these limitations, we present DALI-PD, a scalable framework for generating synthetic layout heatmaps to accelerate ML in PD research. DALI-PD uses a diffusion model to generate diverse layout heatmaps via fast inference in seconds. The heatmaps include power, IR drop, congestion, macro placement, and cell density maps. Using DALI-PD, we created a dataset comprising over 20,000 layout configurations with varying macro counts and placements. These heatmaps closely resemble real layouts and improve ML accuracy on downstream ML tasks such as IR drop or congestion prediction.

[278] LRMR: LLM-Driven Relational Multi-node Ranking for Lymph Node Metastasis Assessment in Rectal Cancer

Yaoxian Dong, Yifan Gao, Haoyue Li, Yanfen Cui, Xin Gao

Main category: cs.LG

TL;DR: LRMR, a two-stage LLM framework, improves lymph node metastasis assessment in rectal cancer by combining multimodal image analysis and relational ranking, outperforming traditional methods.

Details

Motivation: Conventional MRI and black-box AI models lack interpretability and patient-level context for lymph node metastasis assessment.

Method: LRMR uses a multimodal LLM for structured feature analysis and a text-based LLM for pairwise risk ranking.

Result: Achieved AUC of 0.7917 and F1-score of 0.7200, surpassing ResNet50 (AUC 0.7708).

Conclusion: LRMR’s two-stage approach enhances interpretability and performance in lymph node metastasis diagnosis.

Abstract: Accurate preoperative assessment of lymph node (LN) metastasis in rectal cancer guides treatment decisions, yet conventional MRI evaluation based on morphological criteria shows limited diagnostic performance. While some artificial intelligence models have been developed, they often operate as black boxes, lacking the interpretability needed for clinical trust. Moreover, these models typically evaluate nodes in isolation, overlooking the patient-level context. To address these limitations, we introduce LRMR, an LLM-Driven Relational Multi-node Ranking framework. This approach reframes the diagnostic task from a direct classification problem into a structured reasoning and ranking process. The LRMR framework operates in two stages. First, a multimodal large language model (LLM) analyzes a composite montage image of all LNs from a patient, generating a structured report that details ten distinct radiological features. Second, a text-based LLM performs pairwise comparisons of these reports between different patients, establishing a relative risk ranking based on the severity and number of adverse features. We evaluated our method on a retrospective cohort of 117 rectal cancer patients. LRMR achieved an area under the curve (AUC) of 0.7917 and an F1-score of 0.7200, outperforming a range of deep learning baselines, including ResNet50 (AUC 0.7708). Ablation studies confirmed the value of our two main contributions: removing the relational ranking stage or the structured prompting stage led to a significant performance drop, with AUCs falling to 0.6875 and 0.6458, respectively. Our work demonstrates that decoupling visual perception from cognitive reasoning through a two-stage LLM framework offers a powerful, interpretable, and effective new paradigm for assessing lymph node metastasis in rectal cancer.

[279] A Feed-Forward Artificial Intelligence Pipeline for Sustainable Desalination under Climate Uncertainties: UAE Insights

Obumneme Nwafor, Chioma Nwafor, Amro Zakaria, Nkechi Nwankwo

Main category: cs.LG

TL;DR: The study proposes a predictive modeling framework for optimizing desalination performance in the UAE by forecasting aerosol optical depth (AOD) and efficiency losses, achieving 98% accuracy, and integrating rule-based controls into an interactive dashboard.

Details

Motivation: The UAE's heavy reliance on energy-intensive desalination faces sustainability challenges due to climate uncertainties like rising seawater temperatures and AOD, impacting solar-powered systems.

Method: A two-stage predictive modeling architecture forecasts AOD and desalination efficiency losses, using SHAP for key driver analysis, and proposes dust-aware control logic for system adjustments.

Result: The framework achieved 98% accuracy, with SHAP revealing degradation drivers. Rule-based controls were developed for adaptive management.

Conclusion: The study provides a climate-adaptive decision-support system via an interactive dashboard, enhancing desalination sustainability in the UAE.

Abstract: The United Arab Emirates (UAE) relies heavily on seawater desalination to meet over 90% of its drinking water needs. Desalination processes are highly energy intensive and account for approximately 15% of the UAE’s electricity consumption, contributing to over 22% of the country’s energy-related CO2 emissions. Moreover, these processes face significant sustainability challenges in the face of climate uncertainties such as rising seawater temperatures, salinity, and aerosol optical depth (AOD). AOD greatly affects the operational and economic performance of solar-powered desalination systems through photovoltaic soiling, membrane fouling, and water turbidity cycles. This study proposes a novel pipelined two-stage predictive modelling architecture: the first stage forecasts AOD using satellite-derived time series and meteorological data; the second stage uses the predicted AOD and other meteorological factors to predict desalination performance efficiency losses. The framework achieved 98% accuracy, and SHAP (SHapley Additive exPlanations) was used to reveal key drivers of system degradation. Furthermore, this study proposes a dust-aware rule-based control logic for desalination systems based on predicted values of AOD and solar efficiency. This control logic is used to adjust the desalination plant feed water pressure, adapt maintenance scheduling, and regulate energy source switching. To enhance the practical utility of the research findings, the predictive models and rule-based controls were packaged into an interactive dashboard for scenario and predictive analytics. This provides a management decision-support system for climate-adaptive planning.

[280] FedGSCA: Medical Federated Learning with Global Sample Selector and Client Adaptive Adjuster under Label Noise

Mengwen Ye, Yingzi Huangfu, Shujian Gao, Wei Ren, Weifan Liu, Zekuan Yu

Main category: cs.LG

TL;DR: FedGSCA is a novel FL framework addressing label noise in medical image classification by combining a Global Sample Selector and Client Adaptive Adjustment, outperforming existing methods in noisy scenarios.

Details

Motivation: Label noise in FL, caused by inter-institutional data variability, degrades model performance. Existing methods fail to handle noise heterogeneity and data imbalance in medical FL.

Method: FedGSCA uses a Global Sample Selector to aggregate noise knowledge and a Client Adaptive Adjustment mechanism (adaptive threshold pseudo-labeling and Robust Credal Labeling Loss) to manage noisy labels and class imbalance.

Result: FedGSCA outperforms state-of-the-art methods on real-world and synthetic datasets under various noise conditions, excelling in extreme and heterogeneous noise scenarios.

Conclusion: FedGSCA improves model stability and handles complex noise effectively, making it suitable for real-world medical FL applications.

Abstract: Federated Learning (FL) emerged as a solution for collaborative medical image classification while preserving data privacy. However, label noise, which arises from inter-institutional data variability, can cause training instability and degrade model performance. Existing FL methods struggle with noise heterogeneity and the imbalance in medical data. Motivated by these challenges, we propose FedGSCA, a novel framework for enhancing robustness in noisy medical FL. FedGSCA introduces a Global Sample Selector that aggregates noise knowledge from all clients, effectively addressing noise heterogeneity and improving global model stability. Furthermore, we develop a Client Adaptive Adjustment (CAA) mechanism that combines adaptive threshold pseudo-label generation and Robust Credal Labeling Loss. CAA dynamically adjusts to class distributions, ensuring the inclusion of minority samples and carefully managing noisy labels by considering multiple plausible labels. This dual approach mitigates the impact of noisy data and prevents overfitting during local training, which improves the generalizability of the model. We evaluate FedGSCA on one real-world colon slides dataset and two synthetic medical datasets under various noise conditions, including symmetric, asymmetric, extreme, and heterogeneous types. The results show that FedGSCA outperforms the state-of-the-art methods, excelling in extreme and heterogeneous noise scenarios. Moreover, FedGSCA demonstrates significant advantages in improving model stability and handling complex noise, making it well-suited for real-world medical federated learning scenarios.

[281] Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, Jingang Wang

Main category: cs.LG

TL;DR: The paper revisits scaling laws in NLP, finding that data quality and training strategies, not just size, impact performance. High data density and poor resource allocation cause sub-scaling. A new scaling law is proposed to address this.

Details

Motivation: To understand why large language models show diminishing returns (sub-scaling) despite increased size and data, focusing on data quality and training strategies.

Method: Empirical analysis of over 400 models to study the effects of data density and resource allocation on performance.

Result: Identified high data density (redundant information) and non-optimal resource allocation as key causes of sub-scaling. Proposed a new scaling law for sub-scaling regimes.

Conclusion: Data quality and diversity, along with optimal resource allocation, are critical for sustained performance improvements in large language models.

Abstract: Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance improvements decelerate, which is a phenomenon known as sub-scaling. This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical analysis of over 400 models, we identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling. High data density leads to diminishing returns due to redundant information, while optimal resource allocation is crucial for sustained performance improvements. We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes, highlighting the importance of data quality and diversity.

[282] Fine-tuning Large Language Model for Automated Algorithm Design

Fei Liu, Rui Zhang, Xi Lin, Zhichao Lu, Qingfu Zhang

Main category: cs.LG

TL;DR: Fine-tuning LLMs for algorithm design improves performance and generalization, outperforming general-purpose models.

Details

Motivation: Explore the need for LLMs tailored to algorithm design and how to obtain them effectively.

Method: Use Diversity-Aware Rank-based sampling and direct preference optimization to fine-tune LLMs.

Result: Fine-tuned LLMs outperform general models, with smaller models matching larger ones in some tasks and showing generalization.

Conclusion: Task-specific adaptation of LLMs is valuable for algorithm design, opening new research directions.

Abstract: The integration of large language models (LLMs) into automated algorithm design has shown promising potential. A prevalent approach embeds LLMs within search routines to iteratively generate and refine candidate algorithms. However, most existing methods rely on off-the-shelf LLMs trained for general coding tasks,leaving a key question open: Do we need LLMs specifically tailored for algorithm design? If so, how can such LLMs be effectively obtained and how well can they generalize across different algorithm design tasks? In this paper, we take a first step toward answering these questions by exploring fine-tuning of LLMs for algorithm design. We introduce a Diversity-Aware Rank based (DAR) sampling strategy to balance training data diversity and quality, then we leverage direct preference optimization to efficiently align LLM outputs with task objectives. Our experiments, conducted on Llama-3.2-1B-Instruct and Llama- 3.1-8B-Instruct, span three distinct algorithm design tasks. Results suggest that finetuned LLMs can significantly outperform their off-the-shelf counterparts with the smaller Llama-3.2-1B-Instruct and match the larger Llama-3.1-8B-Instruct on the admissible set problem. Moreover, we observe promising generalization: LLMs finetuned on specific algorithm design tasks also improve performance on related tasks with varying settings. These findings highlight the value of task-specific adaptation for LLMs in algorithm design and open new avenues for future research.

[283] Fully Data-driven but Interpretable Human Behavioural Modelling with Differentiable Discrete Choice Model

Fumiyasu Makinoshima, Tatsuya Mitomi, Fumiya Makihara, Eigo Segawa

Main category: cs.LG

TL;DR: Diff-DCM is a data-driven, interpretable method for modeling human behaviors using differentiable programming, requiring no prior knowledge and minimal computational resources.

Details

Motivation: Automating and improving interpretability in discrete choice models for human behavior without relying on expert domain knowledge.

Method: Differentiable discrete choice model (Diff-DCM) uses differentiable programming to estimate closed-form utility functions from input features and choice outcomes.

Result: Diff-DCM works with synthetic and real-world data, is computationally efficient, and provides insights like optimal intervention paths.

Conclusion: Diff-DCM enables fully automated, reliable modeling, prediction, and control of human behaviors.

Abstract: Discrete choice models are essential for modelling various decision-making processes in human behaviour. However, the specification of these models has depended heavily on domain knowledge from experts, and the fully automated but interpretable modelling of complex human behaviours has been a long-standing challenge. In this paper, we introduce the differentiable discrete choice model (Diff-DCM), a fully data-driven method for the interpretable modelling, learning, prediction, and control of complex human behaviours, which is realised by differentiable programming. Solely from input features and choice outcomes without any prior knowledge, Diff-DCM can estimate interpretable closed-form utility functions that reproduce observed behaviours. Comprehensive experiments with both synthetic and real-world data demonstrate that Diff-DCM can be applied to various types of data and requires only a small amount of computational resources for the estimations, which can be completed within tens of seconds on a laptop without any accelerators. In these experiments, we also demonstrate that, using its differentiability, Diff-DCM can provide useful insights into human behaviours, such as an optimal intervention path for effective behavioural changes. This study provides a strong basis for the fully automated and reliable modelling, prediction, and control of human behaviours.

[284] Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them

Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, Ivan Titov

Main category: cs.LG

TL;DR: Comparative analysis of RL and SFT for training LLMs on maths problems shows RL offers minor in-domain gains but slight out-of-domain degradation, while SFT has more pronounced effects, including greater parameter updates and potential skill replacement.

Details

Motivation: To understand the training dynamics of RL and SFT in LLMs for reasoning tasks, particularly their impact on in-domain and out-of-domain performance.

Method: Comparative analysis of RL and SFT using the same model, maths problems, and similar hyperparameters, with examination of parameter updates and freezing experiments.

Result: RL yields minor in-domain gains but slight degradation on knowledge benchmarks; SFT shows more pronounced trends, with greater parameter updates and potential skill replacement. Freezing parts of the model yields inconclusive results.

Conclusion: RL amplifies existing capabilities, while SFT may replace old skills with new ones, though further investigation is needed to mitigate out-of-domain degradation.

Abstract: Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.

[285] Beyond Predictions: A Participatory Framework for Multi-Stakeholder Decision-Making

Vittoria Vineis, Giuseppe Perelli, Gabriele Tolomei

Main category: cs.LG

TL;DR: A participatory AI framework for multi-stakeholder decision-making, balancing diverse preferences via optimization and synthetic scoring, outperforming predictive baselines.

Details

Motivation: Addressing the oversight of multi-actor complexity in conventional decision-support systems and the limited applicability of participatory AI.

Method: Proposes a modular, model-agnostic framework using k-fold cross-validation, context-dependent reward functions, and compromise functions for stakeholder trade-offs.

Result: Outperforms predictive baselines in real-world case studies, enhancing transparency and accountability.

Conclusion: The framework effectively balances stakeholder preferences and improves decision-making outcomes.

Abstract: Conventional automated decision-support systems, often based on supervised learning, focus on predicting outcomes to recommend actions. However, they typically overlook the complexity of multi-actor environments, where diverse and conflicting stakeholder preferences must be balanced. At the same time, participatory AI approaches remain largely context-specific, limiting their broader applicability. To address these gaps, we propose a participatory framework that reframes decision-making as a multi-stakeholder optimization problem, using context-dependent reward functions to represent each actor’s preferences. Our modular, model-agnostic framework employs k-fold cross-validation to fine-tune user-provided prediction models and evaluate decision strategies, including compromise functions that mediate stakeholder trade-offs. A synthetic scoring mechanism aggregates user-defined preferences across multiple metrics to rank strategies and select an optimal decision-maker for generating actionable recommendations on new data. Validated on two high-stake real-world case studies, the framework consistently produces stakeholder-aware decisions that outperform purely predictive baselines across multiple metrics, while enhancing the transparency and accountability of AI-supported decision-making.

[286] Compute Requirements for Algorithmic Innovation in Frontier AI Models

Peter Barnett

Main category: cs.LG

TL;DR: The paper investigates compute requirements for algorithmic innovations in large language model pretraining, analyzing 36 innovations in Llama 3 and DeepSeek-V3. It finds that compute caps may not significantly slow AI progress.

Details

Motivation: To understand the compute resources needed for algorithmic innovations in pretraining large language models and assess the impact of compute caps on innovation.

Method: Catalog 36 pretraining algorithmic innovations, estimate their FLOP usage and hardware FLOP/s, and analyze the effect of compute caps.

Result: Compute requirements for innovations double yearly. Even stringent caps (e.g., GPT-2’s compute or 8 H100 GPUs) could allow half the innovations.

Conclusion: Compute caps alone are unlikely to dramatically slow AI algorithmic progress, as many innovations can still occur under restrictive conditions.

Abstract: Algorithmic innovation in the pretraining of large language models has driven a massive reduction in the total compute required to reach a given level of capability. In this paper we empirically investigate the compute requirements for developing algorithmic innovations. We catalog 36 pre-training algorithmic innovations used in Llama 3 and DeepSeek-V3. For each innovation we estimate both the total FLOP used in development and the FLOP/s of the hardware utilized. Innovations using significant resources double in their requirements each year. We then use this dataset to investigate the effect of compute caps on innovation. Our analysis suggests that compute caps alone are unlikely to dramatically slow AI algorithmic progress. Even stringent compute caps – such as capping total operations to the compute used to train GPT-2 or capping hardware capacity to 8 H100 GPUs – could still have allowed for half of the cataloged innovations.

[287] Meta-Reinforcement Learning for Fast and Data-Efficient Spectrum Allocation in Dynamic Wireless Networks

Oluwaseyi Giwa, Tobi Awodunmila, Muhammad Ahmed Mohsin, Ahsan Bilal, Muhammad Ali Jamshed

Main category: cs.LG

TL;DR: A meta-learning framework is proposed for dynamic spectrum allocation in 5G/6G networks, outperforming traditional DRL methods in throughput, latency, and fairness.

Details

Motivation: Traditional DRL methods are impractical due to high sample complexity and safety risks, necessitating a more efficient and safer approach.

Method: Three meta-learning architectures (MAML, RNN, attention-enhanced RNN) are implemented and compared to PPO in a simulated IAB environment.

Result: The attention-based meta-learning agent achieves 48 Mbps peak throughput, reduces SINR/latency violations by 50%, and shows better fairness (0.7 index) than PPO (10 Mbps).

Conclusion: Meta-learning is proven effective and safer for intelligent control in wireless systems, offering rapid adaptation and superior performance.

Abstract: The dynamic allocation of spectrum in 5G / 6G networks is critical to efficient resource utilization. However, applying traditional deep reinforcement learning (DRL) is often infeasible due to its immense sample complexity and the safety risks associated with unguided exploration, which can cause severe network interference. To address these challenges, we propose a meta-learning framework that enables agents to learn a robust initial policy and rapidly adapt to new wireless scenarios with minimal data. We implement three meta-learning architectures, model-agnostic meta-learning (MAML), recurrent neural network (RNN), and an attention-enhanced RNN, and evaluate them against a non-meta-learning DRL algorithm, proximal policy optimization (PPO) baseline, in a simulated dynamic integrated access/backhaul (IAB) environment. Our results show a clear performance gap. The attention-based meta-learning agent reaches a peak mean network throughput of 48 Mbps, while the PPO baseline decreased drastically to 10 Mbps. Furthermore, our method reduces SINR and latency violations by more than 50% compared to PPO. It also shows quick adaptation, with a fairness index 0.7, showing better resource allocation. This work proves that meta-learning is a very effective and safer option for intelligent control in complex wireless systems.

Chenxi Liu, Hao Miao, Cheng Long, Yan Zhao, Ziyue Li, Panos Kalnis

Main category: cs.LG

TL;DR: The paper provides an overview of LLM-based cross-modal time series analytics, classifying approaches into conversion, alignment, and fusion, and discusses applications, challenges, and future directions.

Details

Motivation: To bridge the cross-modality gap between time series and textual data for LLMs, enabling their practical application in time series analytics.

Method: Introduces a taxonomy of cross-modal modeling strategies (conversion, alignment, fusion) and reviews their applications in downstream tasks.

Result: Summarizes advancements and methodologies, highlighting open challenges and balancing effectiveness with efficiency.

Conclusion: Aims to expand LLM applications in real-world cross-modal time series analytics, providing insights into current progress and future research.

Abstract: Large Language Models (LLMs) have emerged as a promising paradigm for time series analytics, leveraging their massive parameters and the shared sequential nature of textual and time series data. However, a cross-modality gap exists between time series and textual data, as LLMs are pre-trained on textual corpora and are not inherently optimized for time series. In this tutorial, we provide an up-to-date overview of LLM-based cross-modal time series analytics. We introduce a taxonomy that classifies existing approaches into three groups based on cross-modal modeling strategies, e.g., conversion, alignment, and fusion, and then discuss their applications across a range of downstream tasks. In addition, we summarize several open challenges. This tutorial aims to expand the practical application of LLMs in solving real-world problems in cross-modal time series analytics while balancing effectiveness and efficiency. Participants will gain a thorough understanding of current advancements, methodologies, and future research directions in cross-modal time series analytics.

[289] Flows and Diffusions on the Neural Manifold

Daniel Saragih, Deyu Cao, Tejas Balaji

Main category: cs.LG

TL;DR: The paper extends diffusion and flow-based generative models to weight space learning, leveraging optimization dynamics for structural priors and unifying trajectory inference techniques under gradient flow matching.

Details

Motivation: To advance generative models in weight space learning by incorporating optimization dynamics as inductive bias, improving weight generation and downstream tasks.

Method: Models gradient descent trajectories as inference problems, uses gradient flow matching, and explores architectural choices like adjoint matching, autoencoders, and task-specific conditioning.

Result: Matches or surpasses baselines in weight generation, improves downstream training initialization, and excels in detecting harmful covariate shifts.

Conclusion: The method effectively applies generative models to weight space, enhancing performance and practical applications like safety-critical systems.

Abstract: Diffusion and flow-based generative models have achieved remarkable success in domains such as image synthesis, video generation, and natural language modeling. In this work, we extend these advances to weight space learning by leveraging recent techniques to incorporate structural priors derived from optimization dynamics. Central to our approach is modeling the trajectory induced by gradient descent as a trajectory inference problem. We unify several trajectory inference techniques under the framework of gradient flow matching, providing a theoretical framework for treating optimization paths as inductive bias. We further explore architectural and algorithmic choices, including reward fine-tuning by adjoint matching, the use of autoencoders for latent weight representation, conditioning on task-specific context data, and adopting informative source distributions such as Kaiming uniform. Experiments demonstrate that our method matches or surpasses baselines in generating in-distribution weights, improves initialization for downstream training, and supports fine-tuning to enhance performance. Finally, we illustrate a practical application in safety-critical systems: detecting harmful covariate shifts, where our method outperforms the closest comparable baseline.

[290] Player-Team Heterogeneous Interaction Graph Transformer for Soccer Outcome Prediction

Lintao Wang, Shiwen Xu, Michael Horton, Joachim Gudmundsson, Zhiyong Wang

Main category: cs.LG

TL;DR: HIGFormer, a graph-augmented transformer model, improves soccer match outcome prediction by capturing player and team interactions, outperforming existing methods.

Details

Motivation: Existing methods overlook heterogeneous player and team interactions, crucial for accurate soccer outcome prediction.

Method: HIGFormer uses a multi-level interaction framework: Player Interaction Network, Team Interaction Network, and Match Comparison Transformer.

Result: HIGFormer outperforms existing methods on the WyScout Open Access Dataset and aids in player performance evaluation.

Conclusion: HIGFormer offers a robust solution for soccer outcome prediction and insights for talent scouting and strategy analysis.

Abstract: Predicting soccer match outcomes is a challenging task due to the inherently unpredictable nature of the game and the numerous dynamic factors influencing results. While it conventionally relies on meticulous feature engineering, deep learning techniques have recently shown a great promise in learning effective player and team representations directly for soccer outcome prediction. However, existing methods often overlook the heterogeneous nature of interactions among players and teams, which is crucial for accurately modeling match dynamics. To address this gap, we propose HIGFormer (Heterogeneous Interaction Graph Transformer), a novel graph-augmented transformer-based deep learning model for soccer outcome prediction. HIGFormer introduces a multi-level interaction framework that captures both fine-grained player dynamics and high-level team interactions. Specifically, it comprises (1) a Player Interaction Network, which encodes player performance through heterogeneous interaction graphs, combining local graph convolutions with a global graph-augmented transformer; (2) a Team Interaction Network, which constructs interaction graphs from a team-to-team perspective to model historical match relationships; and (3) a Match Comparison Transformer, which jointly analyzes both team and player-level information to predict match outcomes. Extensive experiments on the WyScout Open Access Dataset, a large-scale real-world soccer dataset, demonstrate that HIGFormer significantly outperforms existing methods in prediction accuracy. Furthermore, we provide valuable insights into leveraging our model for player performance evaluation, offering a new perspective on talent scouting and team strategy analysis.

[291] GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning

Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, Dandan Tu

Main category: cs.LG

TL;DR: GHPO, a difficulty-aware RL framework, improves training stability and performance in LLMs by dynamically balancing imitation learning and exploration-based RL.

Details

Motivation: Address training instability and inefficiency in RLVR for LLMs due to capacity-difficulty mismatch.

Method: Introduces Guided Hybrid Policy Optimization (GHPO), which uses adaptive prompt refinement to balance imitation learning and RL.

Result: Achieves ~5% performance gain on math benchmarks, outperforming baselines in stability and reasoning.

Conclusion: GHPO offers a scalable, efficient solution for robust reasoning models.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model’s current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This unique approach adaptively balances direct imitation learning for problems currently beyond the model’s reach with exploration-based reinforcement learning for more manageable tasks, effectively creating a smooth and optimized learning curriculum. Extensive experiments demonstrate that GHPO achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks, consistently outperforming strong on-policy reinforcement learning and curriculum learning baselines. Further analysis confirms that our framework significantly enhances both training stability and final reasoning performance, thus offering a scalable and efficient solution for developing powerful and robust reasoning models.

[292] Scalable Unsupervised Segmentation via Random Fourier Feature-based Gaussian Process

Issei Saito, Masatoshi Nagano, Tomoaki Nakamura, Daichi Mochihashi, Koki Mimura

Main category: cs.LG

TL;DR: RFF-GP-HSMM is a fast unsupervised time-series segmentation method using random Fourier features to reduce computational costs of GP-HSMM.

Details

Motivation: GP-HSMM's high computational cost due to kernel matrix inversion limits scalability.

Method: Approximates GP with linear regression using RFF, avoiding kernel matrix inversion.

Result: Achieves comparable segmentation to conventional methods with 278x speedup on 39,200 frames.

Conclusion: RFF-GP-HSMM offers efficient, scalable time-series segmentation without sacrificing performance.

Abstract: In this paper, we propose RFF-GP-HSMM, a fast unsupervised time-series segmentation method that incorporates random Fourier features (RFF) to address the high computational cost of the Gaussian process hidden semi-Markov model (GP-HSMM). GP-HSMM models time-series data using Gaussian processes, requiring inversion of an N times N kernel matrix during training, where N is the number of data points. As the scale of the data increases, matrix inversion incurs a significant computational cost. To address this, the proposed method approximates the Gaussian process with linear regression using RFF, preserving expressive power while eliminating the need for inversion of the kernel matrix. Experiments on the Carnegie Mellon University (CMU) motion-capture dataset demonstrate that the proposed method achieves segmentation performance comparable to that of conventional methods, with approximately 278 times faster segmentation on time-series data comprising 39,200 frames.

[293] GeoHopNet: Hopfield-Augmented Sparse Spatial Attention for Dynamic UAV Site Location Problem

Jianing Zhi, Xinghua Li, Zidong Chen

Main category: cs.LG

TL;DR: GeoHopNet is a Hopfield-augmented sparse spatial attention network for dynamic UAV site selection, addressing computational bottlenecks with innovations like distance-biased attention and K-nearest neighbor sparse attention.

Details

Motivation: The urban UAV economy's growth demands efficient dynamic site selection, but traditional deep reinforcement learning struggles with computational complexity in large-scale problems.

Method: GeoHopNet introduces four innovations: distance-biased multi-head attention, K-nearest neighbor sparse attention, a Hopfield external memory module, and memory regularization.

Result: GeoHopNet solves large-scale problems (1,000 nodes) in under 0.1s with a 0.22% optimality gap, outperforming baselines in speed and quality.

Conclusion: GeoHopNet advances UAV site selection by significantly improving computational efficiency and solution quality for large-scale urban problems.

Abstract: The rapid development of urban low-altitude unmanned aerial vehicle (UAV) economy poses new challenges for dynamic site selection of UAV landing points and supply stations. Traditional deep reinforcement learning methods face computational complexity bottlenecks, particularly with standard attention mechanisms, when handling large-scale urban-level location problems. This paper proposes GeoHopNet, a Hopfield-augmented sparse spatial attention network specifically designed for dynamic UAV site location problems. Our approach introduces four core innovations: (1) distance-biased multi-head attention mechanism that explicitly encodes spatial geometric information; (2) K-nearest neighbor sparse attention that reduces computational complexity from $O(N^2)$ to $O(NK)$; (3) a modern Hopfield external memory module; and (4) a memory regularization strategy. Experimental results demonstrate that GeoHopNet extends the boundary of solvable problem sizes. For large-scale instances with 1,000 nodes, where standard attention models become prohibitively slow (over 3 seconds per instance) and traditional solvers fail, GeoHopNet finds high-quality solutions (0.22% optimality gap) in under 0.1 seconds. Compared to the state-of-the-art ADNet baseline on 100-node instances, our method improves solution quality by 22.2% and is 1.8$\times$ faster.

[294] A Simple Baseline for Stable and Plastic Neural Networks

É. Künzel, A. Jaziri, V. Ramesh

Main category: cs.LG

TL;DR: RDBP combines ReLUDown and Decreasing Backpropagation for balanced continual learning, outperforming state-of-the-art methods with lower computational cost.

Details

Motivation: Address the trade-off between plasticity and stability in continual learning for computer vision.

Method: Introduces RDBP with ReLUDown (activation modification) and Decreasing Backpropagation (gradient-scheduling).

Result: Matches or exceeds state-of-the-art performance on Continual ImageNet with reduced computational cost.

Conclusion: RDBP is a practical, efficient benchmark for future continual learning strategies.

Abstract: Continual learning in computer vision requires that models adapt to a continuous stream of tasks without forgetting prior knowledge, yet existing approaches often tip the balance heavily toward either plasticity or stability. We introduce RDBP, a simple, low-overhead baseline that unites two complementary mechanisms: ReLUDown, a lightweight activation modification that preserves feature sensitivity while preventing neuron dormancy, and Decreasing Backpropagation, a biologically inspired gradient-scheduling scheme that progressively shields early layers from catastrophic updates. Evaluated on the Continual ImageNet benchmark, RDBP matches or exceeds the plasticity and stability of state-of-the-art methods while reducing computational cost. RDBP thus provides both a practical solution for real-world continual learning and a clear benchmark against which future continual learning strategies can be measured.

[295] ZClassifier: Temperature Tuning and Manifold Approximation via KL Divergence on Logit Space

Shim Soon Yong

Main category: cs.LG

TL;DR: ZClassifier replaces deterministic logits with Gaussian-distributed logits, unifying uncertainty calibration and latent control via KL divergence minimization. It improves robustness, calibration, and latent separation over softmax classifiers.

Details

Motivation: To address temperature scaling and manifold approximation in classification, while unifying uncertainty calibration and latent control.

Method: Uses diagonal Gaussian-distributed logits, minimizing KL divergence between predicted Gaussians and a unit isotropic Gaussian.

Result: Outperforms softmax classifiers in robustness, calibration, and latent separation on CIFAR-10 and CIFAR-100. Also effective for classifier-guided generation.

Conclusion: ZClassifier provides a principled probabilistic framework for classification, improving performance and interpretability.

Abstract: We introduce a novel classification framework, ZClassifier, that replaces conventional deterministic logits with diagonal Gaussian-distributed logits. Our method simultaneously addresses temperature scaling and manifold approximation by minimizing the Kullback-Leibler (KL) divergence between the predicted Gaussian distributions and a unit isotropic Gaussian. This unifies uncertainty calibration and latent control in a principled probabilistic manner, enabling a natural interpretation of class confidence and geometric consistency. Experiments on CIFAR-10 and CIFAR-100 show that ZClassifier improves over softmax classifiers in robustness, calibration, and latent separation. We also demonstrate its effectiveness for classifier-guided generation by interpreting logits as Gaussian semantic potentials.

[296] First-of-its-kind AI model for bioacoustic detection using a lightweight associative memory Hopfield neural network

Andrew Gascoyne, Wendy Lomas

Main category: cs.LG

TL;DR: An AI model using Hopfield neural networks for bioacoustic analysis is proposed, addressing data scarcity, environmental impact, and hardware demands. It is fast, lightweight, and accurate, with potential for field deployment.

Details

Motivation: To tackle challenges in bioacoustic analysis, such as limited training data, high energy consumption, and hardware requirements, by developing a sustainable and efficient AI model.

Method: The model employs associative memory via a transparent Hopfield neural network, requiring minimal training data (one signal per target sound) and offering rapid training and classification.

Result: The model processes 10,384 bat recordings in 5.4s on a standard laptop, uses only 144.09MB RAM, and achieves 86% precision with no disagreements against expert identifications.

Conclusion: The proposed AI model is a promising solution for fast, lightweight, sustainable, and accurate bioacoustic analysis, with broad applicability beyond the tested bat dataset.

Abstract: A growing issue within conservation bioacoustics is the task of analysing the vast amount of data generated from the use of passive acoustic monitoring devices. In this paper, we present an alternative AI model which has the potential to help alleviate this problem. Our model formulation addresses the key issues encountered when using current AI models for bioacoustic analysis, namely the: limited training data available; environmental impact, particularly in energy consumption and carbon footprint of training and implementing these models; and associated hardware requirements. The model developed in this work uses associative memory via a transparent, explainable Hopfield neural network to store signals and detect similar signals which can then be used to classify species. Training is rapid ($3$,ms), as only one representative signal is required for each target sound within a dataset. The model is fast, taking only $5.4$,s to pre-process and classify all $10384$ publicly available bat recordings, on a standard Apple MacBook Air. The model is also lightweight with a small memory footprint of $144.09$,MB of RAM usage. Hence, the low computational demands make the model ideal for use on a variety of standard personal devices with potential for deployment in the field via edge-processing devices. It is also competitively accurate, with up to $86%$ precision on the dataset used to evaluate the model. In fact, we could not find a single case of disagreement between model and manual identification via expert field guides. Although a dataset of bat echolocation calls was chosen to demo this first-of-its-kind AI model, trained on only two representative calls, the model is not species specific. In conclusion, we propose an equitable AI model that has the potential to be a game changer for fast, lightweight, sustainable, transparent, explainable and accurate bioacoustic analysis.

[297] A Group Theoretic Analysis of the Symmetries Underlying Base Addition and Their Learnability by Neural Networks

Cutter Dawes, Simon Segert, Kamesh Krishnamurthy, Jonathan D. Cohen

Main category: cs.LG

TL;DR: The paper explores how neural networks can learn symmetry functions, like base addition, to achieve radical generalization, focusing on the role of carry functions and their impact on learning efficiency.

Details

Motivation: To address the challenge of designing neural networks capable of radical generalization by leveraging symmetry functions, exemplified by base addition.

Method: A group theoretic analysis of base addition is conducted, introducing alternative carry functions. Neural networks are trained with different carries to study inductive biases.

Result: Simple neural networks can achieve radical generalization with appropriate input formats and carry functions, with learning speed tied to carry structure.

Conclusion: The findings highlight the importance of carry function structure in symmetry learning, offering insights for cognitive science and machine learning.

Abstract: A major challenge in the use of neural networks both for modeling human cognitive function and for artificial intelligence is the design of systems with the capacity to efficiently learn functions that support radical generalization. At the roots of this is the capacity to discover and implement symmetry functions. In this paper, we investigate a paradigmatic example of radical generalization through the use of symmetry: base addition. We present a group theoretic analysis of base addition, a fundamental and defining characteristic of which is the carry function – the transfer of the remainder, when a sum exceeds the base modulus, to the next significant place. Our analysis exposes a range of alternative carry functions for a given base, and we introduce quantitative measures to characterize these. We then exploit differences in carry functions to probe the inductive biases of neural networks in symmetry learning, by training neural networks to carry out base addition using different carries, and comparing efficacy and rate of learning as a function of their structure. We find that even simple neural networks can achieve radical generalization with the right input format and carry function, and that learning speed is closely correlated with carry function structure. We then discuss the relevance this has for cognitive science and machine learning.

[298] A Simple Approximate Bayesian Inference Neural Surrogate for Stochastic Petri Net Models

Bright Kwaku Manu, Trevor Reckell, Beckett Sterner, Petar Jevtic

Main category: cs.LG

TL;DR: A neural-surrogate framework using a 1D Convolutional Residual Network is introduced for parameter estimation in Stochastic Petri Nets (SPNs) with covariate-dependent rates, outperforming traditional Bayesian methods in accuracy and speed.

Details

Motivation: Parameter estimation in SPNs is challenging, especially with covariate-dependent rates and unavailable explicit likelihoods, necessitating a robust, data-driven solution.

Method: A lightweight 1D Convolutional Residual Network is trained on Gillespie-simulated SPN realizations to predict rate-function coefficients from noisy, partially observed data, using Monte Carlo dropout for uncertainty.

Result: The surrogate achieves RMSE = 0.108 on synthetic SPNs with 20% missing events and runs faster than Bayesian methods.

Conclusion: Neural surrogates enable accurate, robust, and real-time parameter recovery in complex, partially observed discrete-event systems.

Abstract: Stochastic Petri Nets (SPNs) are an increasingly popular tool of choice for modeling discrete-event dynamics in areas such as epidemiology and systems biology, yet their parameter estimation remains challenging in general and in particular when transition rates depend on external covariates and explicit likelihoods are unavailable. We introduce a neural-surrogate (neural-network–based approximation of the posterior distribution) framework that predicts the coefficients of known covariate-dependent rate functions directly from noisy, partially observed token trajectories. Our model employs a lightweight 1D Convolutional Residual Network trained end-to-end on Gillespie-simulated SPN realizations, learning to invert system dynamics under realistic conditions of event dropout. During inference, Monte Carlo dropout provides calibrated uncertainty bounds together with point estimates. On synthetic SPNs with 20% missing events, our surrogate recovers rate-function coefficients with an RMSE = 0.108 and substantially runs faster than traditional Bayesian approaches. These results demonstrate that data-driven, likelihood-free surrogates can enable accurate, robust, and real-time parameter recovery in complex, partially observed discrete-event systems.

[299] Distributionally Robust Optimization with Adversarial Data Contamination

Shuyao Li, Ilias Diakonikolas, Jelena Diakonikolas

Main category: cs.LG

TL;DR: The paper introduces a method to address outliers and distributional uncertainty in Wasserstein-1 DRO for generalized linear models, achieving an estimation error of O(√ϵ) with contaminated data.

Details

Motivation: To tackle the dual challenges of data contamination and distributional uncertainty in DRO, which can compromise decision-making effectiveness.

Method: A novel modeling framework integrates robustness against data contamination and distributional shifts, using an efficient algorithm inspired by robust statistics.

Result: The method achieves an estimation error of O(√ϵ) for the true DRO objective value with contaminated data under bounded covariance.

Conclusion: This work provides the first rigorous guarantees for learning under data contamination and distributional shifts, with efficient computation.

Abstract: Distributionally Robust Optimization (DRO) provides a framework for decision-making under distributional uncertainty, yet its effectiveness can be compromised by outliers in the training data. This paper introduces a principled approach to simultaneously address both challenges. We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions, where an $\epsilon$-fraction of the training data is adversarially corrupted. Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts, alongside an efficient algorithm inspired by robust statistics to solve the resulting optimization problem. We prove that our method achieves an estimation error of $O(\sqrt{\epsilon})$ for the true DRO objective value using only the contaminated data under the bounded covariance assumption. This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.

[300] Ground-Compose-Reinforce: Tasking Reinforcement Learning Agents through Formal Language

Andrew C. Li, Toryn Q. Klassen, Andrew Wang, Parand A. Alamdari, Sheila A. McIlraith

Main category: cs.LG

TL;DR: A neurosymbolic framework, Ground-Compose-Reinforce, is proposed for grounding formal language in perception and action, enabling efficient learning and generalization without manual design.

Details

Motivation: To address the challenge of grounding language in complex perception and action for situated agents, avoiding manual design or massive datasets.

Method: Uses a neurosymbolic framework combining data-driven learning with compositional formal language semantics to ground language and elicit behaviors via RL.

Result: Achieves reliable mapping of formal language instructions to behaviors with limited data, outperforming end-to-end data-driven approaches.

Conclusion: The framework efficiently grounds language and generalizes compositions, demonstrating success in gridworld and robotics domains.

Abstract: Grounding language in complex perception (e.g. pixels) and action is a key challenge when building situated agents that can interact with humans via language. In past works, this is often solved via manual design of the language grounding or by curating massive datasets relating language to elements of the environment. We propose Ground-Compose-Reinforce, a neurosymbolic framework for grounding formal language from data, and eliciting behaviours by directly tasking RL agents through this language. By virtue of data-driven learning, our framework avoids the manual design of domain-specific elements like reward functions or symbol detectors. By virtue of compositional formal language semantics, our framework achieves data-efficient grounding and generalization to arbitrary language compositions. Experiments on an image-based gridworld and a MuJoCo robotics domain show that our approach reliably maps formal language instructions to behaviours with limited data while end-to-end, data-driven approaches fail.

[301] A Benchmarking Framework for AI models in Automotive Aerodynamics

Kaustubh Tangsali, Rishikesh Ranade, Mohammad Amin Nabian, Alexey Kamenev, Peter Sharpe, Neil Ashton, Ram Cherukuri, Sanjay Choudhry

Main category: cs.LG

TL;DR: A benchmarking framework in NVIDIA PhysicsNeMo-CFD is introduced to evaluate AI models for automotive aerodynamics, focusing on accuracy, performance, scalability, and generalization.

Details

Motivation: To standardize and improve the assessment of AI models in automotive aerodynamics, enhancing transparency and consistency for better model development.

Method: The framework incorporates diverse metrics and evaluates three AI models (DoMINO, X-MeshGraphNet, FIGConvNet) using the DrivAerML dataset, with guidelines for extensibility.

Result: Demonstrates utility by assessing surface and volumetric flow field predictions, enabling standardized comparisons.

Conclusion: The framework aims to accelerate research and innovation by aiding in the selection and refinement of AI-driven aerodynamic models.

Abstract: In this paper, we introduce a benchmarking framework within the open-source NVIDIA PhysicsNeMo-CFD framework designed to systematically assess the accuracy, performance, scalability, and generalization capabilities of AI models for automotive aerodynamics predictions. The open extensible framework enables incorporation of a diverse set of metrics relevant to the Computer-Aided Engineering (CAE) community. By providing a standardized methodology for comparing AI models, the framework enhances transparency and consistency in performance assessment, with the overarching goal of improving the understanding and development of these models to accelerate research and innovation in the field. To demonstrate its utility, the framework includes evaluation of both surface and volumetric flow field predictions on three AI models: DoMINO, X-MeshGraphNet, and FIGConvNet using the DrivAerML dataset. It also includes guidelines for integrating additional models and datasets, making it extensible for physically consistent metrics. This benchmarking study aims to enable researchers and industry professionals in selecting, refining, and advancing AI-driven aerodynamic modeling approaches, ultimately fostering the development of more efficient, accurate, and interpretable solutions in automotive aerodynamics

[302] Spatial Reasoners for Continuous Variables in Any Domain

Bart Pogodzinski, Christopher Wewer, Bernt Schiele, Jan Eric Lenssen

Main category: cs.LG

TL;DR: Spatial Reasoners is a framework for spatial reasoning using generative denoising models, simplifying research with easy-to-use interfaces.

Details

Motivation: To address the high effort required for generative reasoning due to diverse denoising formulations and inference strategies.

Method: Provides interfaces for variable mapping, generative model paradigms, and inference strategies.

Result: An openly available framework facilitating research in spatial reasoning with generative models.

Conclusion: Spatial Reasoners streamlines research in generative spatial reasoning, making it accessible and efficient.

Abstract: We present Spatial Reasoners, a software framework to perform spatial reasoning over continuous variables with generative denoising models. Denoising generative models have become the de-facto standard for image generation, due to their effectiveness in sampling from complex, high-dimensional distributions. Recently, they have started being explored in the context of reasoning over multiple continuous variables. Providing infrastructure for generative reasoning with such models requires a high effort, due to a wide range of different denoising formulations, samplers, and inference strategies. Our presented framework aims to facilitate research in this area, providing easy-to-use interfaces to control variable mapping from arbitrary data domains, generative model paradigms, and inference strategies. Spatial Reasoners are openly available at https://spatialreasoners.github.io/

[303] A Generalizable Physics-Enhanced State Space Model for Long-Term Dynamics Forecasting in Complex Environments

Yuchen Wang, Hongjue Zhao, Haohong Lin, Enze Xu, Lifang He, Huajie Shao

Main category: cs.LG

TL;DR: Phy-SSM integrates partial physics knowledge into state space models for long-term dynamics forecasting in noisy, irregularly sampled environments, outperforming baselines in real-world tasks.

Details

Motivation: Address the challenge of long-term dynamic forecasting in complex, noisy environments by leveraging SSMs' ability to capture long-range dependencies and incorporating physics knowledge for better generalization.

Method: Decompose partially known system dynamics into known and unknown state matrices, integrate them into a Phy-SSM unit, and introduce a physics state regularization term for alignment with system dynamics.

Result: Superior performance in long-term interpolation and extrapolation tasks across vehicle motion, drone state, and COVID-19 epidemiology forecasting.

Conclusion: Phy-SSM effectively combines physics knowledge with SSMs, enhancing long-term forecasting in complex environments.

Abstract: This work aims to address the problem of long-term dynamic forecasting in complex environments where data are noisy and irregularly sampled. While recent studies have introduced some methods to improve prediction performance, these approaches still face a significant challenge in handling long-term extrapolation tasks under such complex scenarios. To overcome this challenge, we propose Phy-SSM, a generalizable method that integrates partial physics knowledge into state space models (SSMs) for long-term dynamics forecasting in complex environments. Our motivation is that SSMs can effectively capture long-range dependencies in sequential data and model continuous dynamical systems, while the incorporation of physics knowledge improves generalization ability. The key challenge lies in how to seamlessly incorporate partially known physics into SSMs. To achieve this, we decompose partially known system dynamics into known and unknown state matrices, which are integrated into a Phy-SSM unit. To further enhance long-term prediction performance, we introduce a physics state regularization term to make the estimated latent states align with system dynamics. Besides, we theoretically analyze the uniqueness of the solutions for our method. Extensive experiments on three real-world applications, including vehicle motion prediction, drone state prediction, and COVID-19 epidemiology forecasting, demonstrate the superior performance of Phy-SSM over the baselines in both long-term interpolation and extrapolation tasks. The code is available at https://github.com/511205787/Phy_SSM-ICML2025.

[304] Multi-Armed Sampling Problem and the End of Exploration

Mohammad Pedramfar, Siamak Ravanbakhsh

Main category: cs.LG

TL;DR: The paper introduces multi-armed sampling as a counterpart to multi-armed bandits, focusing on the exploration-exploitation trade-off in sampling. It defines regret notions, establishes lower bounds, and proposes an optimal algorithm. Findings show sampling doesn’t require exploration, unlike optimization. The work connects sampling and bandits via a temperature parameter, offering insights for neural samplers and RLHF.

Details

Motivation: To rigorously study the exploration-exploitation trade-off in sampling, contrasting it with optimization, and to unify multi-armed sampling and bandit problems.

Method: Defines regret notions for multi-armed sampling, establishes lower bounds, and proposes an algorithm achieving optimal regret. Introduces a continuous problem family with a temperature parameter to bridge sampling and bandits.

Result: Demonstrates that sampling doesn’t require exploration, unlike optimization. Provides optimal regret bounds and connects sampling and bandit problems.

Conclusion: The multi-armed sampling framework is foundational for studying sampling, with implications for neural samplers, entropy-regularized RL, and RLHF. Highlights the role of exploration and algorithm convergence in these areas.

Abstract: This paper introduces the framework of multi-armed sampling, as the sampling counterpart to the optimization problem of multi-arm bandits. Our primary motivation is to rigorously examine the exploration-exploitation trade-off in the context of sampling. We systematically define plausible notions of regret for this framework and establish corresponding lower bounds. We then propose a simple algorithm that achieves these optimal regret bounds. Our theoretical results demonstrate that in contrast to optimization, sampling does not require exploration. To further connect our findings with those of multi-armed bandits, we define a continuous family of problems and associated regret measures that smoothly interpolates and unifies multi-armed sampling and multi-armed bandit problems using a temperature parameter. We believe the multi-armed sampling framework, and our findings in this setting can have a foundational role in the study of sampling including recent neural samplers, akin to the role of multi-armed bandits in reinforcement learning. In particular, our work sheds light on the need for exploration and the convergence properties of algorithm for entropy-regularized reinforcement learning, fine-tuning of pretrained models and reinforcement learning with human feedback (RLHF).

[305] Uncovering Causal Relation Shifts in Event Sequences under Out-of-Domain Interventions

Kazi Tasnim Zinat, Yun Zhou, Xiang Lyu, Yawei Wang, Zhicheng Liu, Panpan Xu

Main category: cs.LG

TL;DR: Proposes a causal framework to handle out-of-domain interventions in temporal sequences, introducing a Transformer-based model for improved ATE estimation.

Details

Motivation: Existing causal inference methods ignore out-of-domain interventions, which can significantly impact causal dynamics in real-world settings.

Method: A new causal framework defines ATE beyond i.i.d. data, with an unbiased estimator and a Transformer-based model to integrate out-of-domain interventions.

Result: Outperforms baselines in ATE estimation and goodness-of-fit on simulated and real-world datasets.

Conclusion: The proposed method effectively captures causal relation shifts under out-of-domain interventions, improving accuracy in temporal process modeling.

Abstract: Inferring causal relationships between event pairs in a temporal sequence is applicable in many domains such as healthcare, manufacturing, and transportation. Most existing work on causal inference primarily focuses on event types within the designated domain, without considering the impact of exogenous out-of-domain interventions. In real-world settings, these out-of-domain interventions can significantly alter causal dynamics. To address this gap, we propose a new causal framework to define average treatment effect (ATE), beyond independent and identically distributed (i.i.d.) data in classic Rubin’s causal framework, to capture the causal relation shift between events of temporal process under out-of-domain intervention. We design an unbiased ATE estimator, and devise a Transformer-based neural network model to handle both long-range temporal dependencies and local patterns while integrating out-of-domain intervention information into process modeling. Extensive experiments on both simulated and real-world datasets demonstrate that our method outperforms baselines in ATE estimation and goodness-of-fit under out-of-domain-augmented point processes.

[306] Collaboration Promotes Group Resilience in Multi-Agent RL

Ilai Shraga, Guy Azran, Matthias Gerstgrasser, Ofir Abu, Jeffrey S. Rosenschein, Sarah Keren

Main category: cs.LG

TL;DR: The paper introduces ‘group resilience’ in multi-agent RL, showing collaboration enhances resilience to environmental changes.

Details

Motivation: To address resilience in multi-agent settings, as prior work focused only on single-agent scenarios.

Method: Formalized group resilience and tested collaboration protocols in MARL.

Result: Collaborative approaches outperformed non-collaborative ones in achieving group resilience.

Conclusion: Collaboration is key to enhancing group resilience in dynamic multi-agent environments.

Abstract: To effectively operate in various dynamic scenarios, RL agents must be resilient to unexpected changes in their environment. Previous work on this form of resilience has focused on single-agent settings. In this work, we introduce and formalize a multi-agent variant of resilience, which we term group resilience. We further hypothesize that collaboration with other agents is key to achieving group resilience; collaborating agents adapt better to environmental perturbations in multi-agent reinforcement learning (MARL) settings. We test our hypothesis empirically by evaluating different collaboration protocols and examining their effect on group resilience. Our experiments show that all the examined collaborative approaches achieve higher group resilience than their non-collaborative counterparts.

[307] Semantic Context for Tool Orchestration

Robert Müller

Main category: cs.LG

TL;DR: The paper shows Semantic Context (SC) is key for tool orchestration, introducing SC-LinUCB for lower regret, validating SC’s role in LLM learning, and proposing the FiReAct pipeline for large-scale tool orchestration.

Details

Motivation: To improve tool orchestration by leveraging Semantic Context (SC) for better adaptability and efficiency in dynamic and large action spaces.

Method: Theoretical foundation with SC-LinUCB (contextual bandits), empirical validation with LLMs, and the FiReAct pipeline for SC-based retrieval.

Result: SC-LinUCB reduces regret, SC enhances LLM learning, and FiReAct enables effective orchestration over 10,000 tools.

Conclusion: SC is foundational for efficient, adaptive, and scalable tool orchestration agents.

Abstract: This paper demonstrates that Semantic Context (SC), leveraging descriptive tool information, is a foundational component for robust tool orchestration. Our contributions are threefold. First, we provide a theoretical foundation using contextual bandits, introducing SC-LinUCB and proving it achieves lower regret and adapts favourably in dynamic action spaces. Second, we provide parallel empirical validation with Large Language Models, showing that SC is critical for successful in-context learning in both static (efficient learning) and non-stationary (robust adaptation) settings. Third, we propose the FiReAct pipeline, and demonstrate on a benchmark with over 10,000 tools that SC-based retrieval enables an LLM to effectively orchestrate over a large action space. These findings provide a comprehensive guide to building more sample-efficient, adaptive, and scalable orchestration agents.

[308] From Small to Large: A Graph Convolutional Network Approach for Solving Assortment Optimization Problems

Guokai Li, Pin Gao, Stefanus Jasin, Zizhuo Wang

Main category: cs.LG

TL;DR: The paper proposes using Graph Convolutional Networks (GCNs) to solve constrained assortment optimization efficiently, achieving high performance even for large-scale instances.

Details

Motivation: Assortment optimization is NP-hard and challenging due to its combinatorial and non-linear nature, requiring efficient solutions.

Method: Develop a graph representation of the problem, train a GCN to learn optimal assortment patterns, and propose two inference policies.

Result: GCN-based policies achieve 90%+ optimality on large-scale instances (up to 2,000 products) quickly, outperforming existing heuristics.

Conclusion: The GCN approach is effective and scalable, even in model-free settings with unknown choice models.

Abstract: Assortment optimization involves selecting a subset of substitutable products (subject to certain constraints) to maximize the expected revenue. It is a classic problem in revenue management and finds applications across various industries. However, the problem is usually NP-hard due to its combinatorial and non-linear nature. In this work, we explore how graph concolutional networks (GCNs) can be leveraged to efficiently solve constrained assortment optimization under the mixed multinomial logit choice model. We first develop a graph representation of the assortment problem, then train a GCN to learn the patterns of optimal assortments, and lastly propose two inference policies based on the GCN’s output. Due to the GCN’s inherent ability to generalize across inputs of varying sizes, we can use a GCN trained on small-scale instances to facilitate large-scale instances. Extensive numerical experiments demonstrate that given a GCN trained on small-scale instances (e.g., with 20 products), the proposed policies can achieve superior performance (90%+ optimality) on large-scale instances (with up to 2,000 products) within seconds, which outperform existing heuristic policies in both performance and efficiency. Furthermore, we extend our framework to a model-free setting where the underlying choice model is unknown but transaction data is available. We also conduct numerical experiments to demonstrate the effectiveness and efficiency of our proposed policies in this setting.

[309] Offline Reinforcement Learning with Wasserstein Regularization via Optimal Transport Maps

Motoki Omura, Yusuke Mukuta, Kazuki Ota, Takayuki Osa, Tatsuya Harada

Main category: cs.LG

TL;DR: Proposes a Wasserstein distance-based method for offline RL to address distributional shift, using ICNNs for stable learning without adversarial training.

Details

Motivation: Distributional shift in offline RL leads to unreliable actions; current methods use density ratio-based measures, but Wasserstein distance offers robustness.

Method: Uses input-convex neural networks (ICNNs) to model optimal transport maps, computing Wasserstein distance without adversarial training.

Result: Achieves comparable or superior performance on the D4RL benchmark dataset.

Conclusion: The method is effective for offline RL, avoiding adversarial training while maintaining performance.

Abstract: Offline reinforcement learning (RL) aims to learn an optimal policy from a static dataset, making it particularly valuable in scenarios where data collection is costly, such as robotics. A major challenge in offline RL is distributional shift, where the learned policy deviates from the dataset distribution, potentially leading to unreliable out-of-distribution actions. To mitigate this issue, regularization techniques have been employed. While many existing methods utilize density ratio-based measures, such as the $f$-divergence, for regularization, we propose an approach that utilizes the Wasserstein distance, which is robust to out-of-distribution data and captures the similarity between actions. Our method employs input-convex neural networks (ICNNs) to model optimal transport maps, enabling the computation of the Wasserstein distance in a discriminator-free manner, thereby avoiding adversarial training and ensuring stable learning. Our approach demonstrates comparable or superior performance to widely used existing methods on the D4RL benchmark dataset. The code is available at https://github.com/motokiomura/Q-DOT .

[310] Visually grounded emotion regulation via diffusion models and user-driven reappraisal

Edoardo Pinzuti, Oliver Tüscher, André Ferreira Castro

Main category: cs.LG

TL;DR: The paper introduces a visually based augmentation of cognitive reappraisal using text-to-image diffusion models, showing that AI-generated visual feedback enhances emotion regulation by reducing negative affect.

Details

Motivation: Traditional cognitive reappraisal methods are cognitively demanding and rely on verbal processes, which can be ineffective for individuals with trauma or depression. The study aims to improve reappraisal by integrating visual feedback.

Method: A system was developed where users reinterpret negative images via spoken reappraisals, transformed into supportive visualizations using fine-tuned stable diffusion models. A within-subject experiment (N=20) tested this approach with a modified CER task.

Result: AI-assisted reappraisal significantly reduced negative affect compared to non-AI and control conditions. Sentiment alignment between reappraisals and generated images correlated with affective relief.

Conclusion: Generative visual input supports cognitive reappraisal, offering new possibilities for integrating AI, affective computing, and therapeutic technology.

Abstract: Cognitive reappraisal is a key strategy in emotion regulation, involving reinterpretation of emotionally charged stimuli to alter affective responses. Despite its central role in clinical and cognitive science, real-world reappraisal interventions remain cognitively demanding, abstract, and primarily verbal. This reliance on higher-order cognitive and linguistic processes is often impaired in individuals with trauma or depression, limiting the effectiveness of standard approaches. Here, we propose a novel, visually based augmentation of cognitive reappraisal by integrating large-scale text-to-image diffusion models into the emotional regulation process. Specifically, we introduce a system in which users reinterpret emotionally negative images via spoken reappraisals, which are transformed into supportive, emotionally congruent visualizations using stable diffusion models with a fine-tuned IP-adapter. This generative transformation visually instantiates users’ reappraisals while maintaining structural similarity to the original stimuli, externalizing and reinforcing regulatory intent. To test this approach, we conducted a within-subject experiment (N = 20) using a modified cognitive emotion regulation (CER) task. Participants reappraised or described aversive images from the International Affective Picture System (IAPS), with or without AI-generated visual feedback. Results show that AI-assisted reappraisal significantly reduced negative affect compared to both non-AI and control conditions. Further analyses reveal that sentiment alignment between participant reappraisals and generated images correlates with affective relief, suggesting that multimodal coherence enhances regulatory efficacy. These findings demonstrate that generative visual input can support cogitive reappraisal and open new directions at the intersection of generative AI, affective computing, and therapeutic technology.

[311] GALDS: A Graph-Autoencoder-based Latent Dynamics Surrogate model to predict neurite material transport

Tsung Yeh Hsieh, Yongjie Jessica Zhang

Main category: cs.LG

TL;DR: The paper introduces GALDS, a Graph-Autoencoder-based Latent Dynamics Surrogate model, to efficiently simulate material transport in neural trees, achieving high accuracy and speed.

Details

Motivation: The complex geometry of neurite networks makes material transport simulation computationally challenging, requiring optimization beyond traditional methods.

Method: GALDS uses a graph autoencoder to encode network geometry, velocity fields, and concentration profiles into latent representations, predicting dynamics via a graph latent space model inspired by Neural ODEs.

Result: GALDS achieves a mean relative error of 3%, max error <8%, and a 10x speed improvement over previous surrogate models on test cases.

Conclusion: GALDS offers an efficient, accurate solution for simulating material transport in neural trees, leveraging latent representations and Neural ODEs.

Abstract: Neurons exhibit intricate geometries within their neurite networks, which play a crucial role in processes such as signaling and nutrient transport. Accurate simulation of material transport in the networks is essential for understanding these biological phenomena but poses significant computational challenges because of the complex tree-like structures involved. Traditional approaches are time-intensive and resource-demanding, yet the inherent properties of neuron trees, which consists primarily of pipes with steady-state parabolic velocity profiles and bifurcations, provide opportunities for computational optimization. To address these challenges, we propose a Graph-Autoencoder-based Latent Dynamics Surrogate (GALDS) model, which is specifically designed to streamline the simulation of material transport in neural trees. GALDS employs a graph autoencoder to encode latent representations of the network’s geometry, velocity fields, and concentration profiles. These latent space representations are then assembled into a global graph, which is subsequently used to predict system dynamics in the latent space via a trained graph latent space system dynamic model, inspired by the Neural Ordinary Differential Equations (Neural ODEs) concept. The integration of an autoencoder allows for the use of smaller graph neural network models with reduced training data requirements. Furthermore, the Neural ODE component effectively mitigates the issue of error accumulation commonly encountered in recurrent neural networks. The effectiveness of the GALDS model is demonstrated through results on eight unseen geometries and four abnormal transport examples, where our approach achieves mean relative error of 3% with maximum relative error <8% and demonstrates a 10-fold speed improvement compared to previous surrogate model approaches.

[312] Domain-Adaptive Small Language Models for Structured Tax Code Prediction

Souvik Nath, Sumit Wadhwa, Luiz Perez

Main category: cs.LG

TL;DR: A domain-adaptive small language model (SLM) with encoder-decoder architecture is proposed for predicting hierarchical tax codes, outperforming flat classifiers and other architectures.

Details

Motivation: Multinational firms face challenges in accurately determining tax codes (e.g., HSN, SAC) due to varying regulations, necessitating a robust solution to avoid penalties.

Method: An encoder-decoder SLM is used to predict hierarchical tax codes from unstructured product/service data, capturing dependencies within the codes.

Result: The SLM outperforms flat classifiers, decoder-only, and encoder-only architectures in predicting structured tax codes like HSN.

Conclusion: The proposed SLM is effective for tax code prediction and scalable to other government-mandated codes like UNSPSC or NCM.

Abstract: Every day, multinational firms process thousands of transactions, each of which must adhere to tax regulations that vary by jurisdiction and are often nuanced. The determination of product and service tax codes, such as HSN or SAC is a major use case in Tax compliance. An accurate determination of such codes is imperative to avoid any tax penalties. This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes. In this approach, we address the problem of predicting hierarchical tax code sequences using unstructured product and services data. We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes to capture the hierarchical dependencies present within the tax codes. Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes, a domain that remains comparatively unexplored in current NLP research. In this paper, we demonstrate the superior performance of the domain-adaptive encoder-decoder SLMs over flat classifiers when applied to the Harmonized System of Nomenclature (HSN), and achieve superior results compared to decoder-only and encoder-only architectures for structured sequence generation tasks. This approach can also be scaled to other government-mandated tax commodity codes, such as United Nations Standard Products and Services Codes (UNSPSC), or Brazil’s Nomenclatura Comum do Mercosul (NCM).

[313] Learning from Imperfect Data: Robust Inference of Dynamic Systems using Simulation-based Generative Model

Hyunwoo Cho, Hyeontae Jo, Hyung Ju Hwang

Main category: cs.LG

TL;DR: SiGMoID is a simulation-based generative model for inferring nonlinear dynamic systems from noisy, sparse, or partial data, combining physics-informed neural networks and Wasserstein GANs for robust parameter estimation and noise quantification.

Details

Motivation: Inferring nonlinear dynamic models from imperfect data is challenging, especially with noise, sparsity, or partial observability. SiGMoID addresses this gap.

Method: Integrates physics-informed neural networks (for ODE solving) and Wasserstein GANs (for parameter estimation and noise handling).

Result: SiGMoID successfully quantifies noise, estimates parameters, and infers unobserved components, validated in realistic experiments.

Conclusion: SiGMoID is effective for dynamic system inference, with broad applicability in scientific and engineered systems.

Abstract: System inference for nonlinear dynamic models, represented by ordinary differential equations (ODEs), remains a significant challenge in many fields, particularly when the data are noisy, sparse, or partially observable. In this paper, we propose a Simulation-based Generative Model for Imperfect Data (SiGMoID) that enables precise and robust inference for dynamic systems. The proposed approach integrates two key methods: (1) physics-informed neural networks with hyper-networks that constructs an ODE solver, and (2) Wasserstein generative adversarial networks that estimates ODE parameters by effectively capturing noisy data distributions. We demonstrate that SiGMoID quantifies data noise, estimates system parameters, and infers unobserved system components. Its effectiveness is validated validated through realistic experimental examples, showcasing its broad applicability in various domains, from scientific research to engineered systems, and enabling the discovery of full system dynamics.

[314] How to Protect Models against Adversarial Unlearning?

Patryk Jasiorski, Marek Klonowski, Michał Woźniak

Main category: cs.LG

TL;DR: The paper explores adversarial unlearning, where malicious unlearn requests degrade model performance, and proposes a method to protect models from such effects.

Details

Motivation: AI models require unlearning for legal compliance (e.g., GDPR) and to address issues like toxic content or data shifts, but unlearning can harm performance. Adversarial unlearning exacerbates this.

Method: Investigates adversarial unlearning, analyzing factors like model backbone and unlearning strategies. Proposes a protection method against performance degradation.

Result: Demonstrates adversarial unlearning’s impact and introduces a method to mitigate performance loss from unlearning, whether spontaneous or adversarial.

Conclusion: The proposed method safeguards model performance against unlearning side effects, addressing both natural and adversarial scenarios.

Abstract: AI models need to be unlearned to fulfill the requirements of legal acts such as the AI Act or GDPR, and also because of the need to remove toxic content, debiasing, the impact of malicious instances, or changes in the data distribution structure in which a model works. Unfortunately, removing knowledge may cause undesirable side effects, such as a deterioration in model performance. In this paper, we investigate the problem of adversarial unlearning, where a malicious party intentionally sends unlearn requests to deteriorate the model’s performance maximally. We show that this phenomenon and the adversary’s capabilities depend on many factors, primarily on the backbone model itself and strategy/limitations in selecting data to be unlearned. The main result of this work is a new method of protecting model performance from these side effects, both in the case of unlearned behavior resulting from spontaneous processes and adversary actions.

[315] Outbound Modeling for Inventory Management

Riccardo Savorgnan, Udaya Ghai, Carson Eisenach, Dean Foster

Main category: cs.LG

TL;DR: The paper addresses forecasting inventory drain and shipping costs for RL-based regional planning, proposing a probabilistic model and validation scheme for robustness.

Details

Motivation: Accurate modeling of inventory drain and shipping costs is crucial for RL-based control policies in regional inventory planning, but existing methods are non-differentiable and inefficient.

Method: A probabilistic forecasting model is developed to predict joint distributions of drain and shipping costs, conditioned on inventory and demand, with a validation scheme for RL-induced scenarios.

Result: Preliminary results show the model’s accuracy in in-distribution settings.

Conclusion: The proposed model and validation scheme offer a scalable and differentiable solution for RL-based inventory planning.

Abstract: We study the problem of forecasting the number of units fulfilled (or ``drained’’) from each inventory warehouse to meet customer demand, along with the associated outbound shipping costs. The actual drain and shipping costs are determined by complex production systems that manage the planning and execution of customers’ orders fulfillment, i.e. from where and how to ship a unit to be delivered to a customer. Accurately modeling these processes is critical for regional inventory planning, especially when using Reinforcement Learning (RL) to develop control policies. For the RL usecase, a drain model is incorporated into a simulator to produce long rollouts, which we desire to be differentiable. While simulating the calls to the internal software systems can be used to recover this transition, they are non-differentiable and too slow and costly to run within an RL training environment. Accordingly, we frame this as a probabilistic forecasting problem, modeling the joint distribution of outbound drain and shipping costs across all warehouses at each time period, conditioned on inventory positions and exogenous customer demand. To ensure robustness in an RL environment, the model must handle out-of-distribution scenarios that arise from off-policy trajectories. We propose a validation scheme that leverages production systems to evaluate the drain model on counterfactual inventory states induced by RL policies. Preliminary results demonstrate the model’s accuracy within the in-distribution setting.

[316] Class-Proportional Coreset Selection for Difficulty-Separable Data

Elisa Tsai, Haizhong Zheng, Atul Prakash

Main category: cs.LG

TL;DR: The paper introduces the Class Difficulty Separability Coefficient (CDSC) to measure class-wise data difficulty variation and proposes class-proportional coreset methods, improving data efficiency and performance in high-stakes domains.

Details

Motivation: Existing coreset methods assume class-wise homogeneity in data difficulty, overlooking variations across classes, which can degrade performance in domains like network intrusion detection and medical imaging.

Method: The authors formalize class-difficulty separability with CDSC and develop class-proportional variants of sampling strategies, evaluated on five datasets.

Result: Class-proportional methods outperform class-agnostic ones, e.g., CCS-CP shows minimal performance drops (2.58% accuracy loss) at 99% pruning, while baselines suffer larger declines (7.59% accuracy loss).

Conclusion: Explicitly modeling class-difficulty separability enhances data pruning effectiveness, robustness, and generalization, especially in critical applications.

Abstract: High-quality training data is essential for building reliable and efficient machine learning systems. One-shot coreset selection addresses this by pruning the dataset while maintaining or even improving model performance, often relying on training-dynamics-based data difficulty scores. However, most existing methods implicitly assume class-wise homogeneity in data difficulty, overlooking variation in data difficulty across different classes. In this work, we challenge this assumption by showing that, in domains such as network intrusion detection and medical imaging, data difficulty often clusters by class. We formalize this as class-difficulty separability and introduce the Class Difficulty Separability Coefficient (CDSC) as a quantitative measure. We demonstrate that high CDSC values correlate with performance degradation in class-agnostic coreset methods, which tend to overrepresent easy majority classes while neglecting rare but informative ones. To address this, we introduce class-proportional variants of multiple sampling strategies. Evaluated on five diverse datasets spanning security and medical domains, our methods consistently achieve state-of-the-art data efficiency. For instance, on CTU-13, at an extreme 99% pruning rate, a class-proportional variant of Coverage-centric Coreset Selection (CCS-CP) shows remarkable stability, with accuracy dropping only 2.58%, precision 0.49%, and recall 0.19%. In contrast, the class-agnostic CCS baseline, the next best method, suffers sharper declines of 7.59% in accuracy, 4.57% in precision, and 4.11% in recall. We further show that aggressive pruning enhances generalization in noisy, imbalanced, and large-scale datasets. Our results underscore that explicitly modeling class-difficulty separability leads to more effective, robust, and generalizable data pruning, particularly in high-stakes scenarios.

[317] Diffusion Decoding for Peptide De Novo Sequencing

Chi-en Amy Tai, Alexander Wong

Main category: cs.LG

TL;DR: The paper explores diffusion decoders for peptide de novo sequencing, finding potential improvements in amino acid recall over traditional autoregressive methods.

Details

Motivation: Traditional deep learning methods like Casanovo suffer from cascading errors and inefficient use of high-confidence regions.

Method: The study tests three diffusion decoder designs, knapsack beam search, and various loss functions, comparing them to autoregressive decoders.

Result: The best diffusion decoder with DINOISER loss improved amino acid recall by 0.373, though peptide precision and recall remained 0.

Conclusion: Diffusion decoders show promise for enhancing sensitivity and advancing peptide de novo sequencing.

Abstract: Peptide de novo sequencing is a method used to reconstruct amino acid sequences from tandem mass spectrometry data without relying on existing protein sequence databases. Traditional deep learning approaches, such as Casanovo, mainly utilize autoregressive decoders and predict amino acids sequentially. Subsequently, they encounter cascading errors and fail to leverage high-confidence regions effectively. To address these issues, this paper investigates using diffusion decoders adapted for the discrete data domain. These decoders provide a different approach, allowing sequence generation to start from any peptide segment, thereby enhancing prediction accuracy. We experiment with three different diffusion decoder designs, knapsack beam search, and various loss functions. We find knapsack beam search did not improve performance metrics and simply replacing the transformer decoder with a diffusion decoder lowered performance. Although peptide precision and recall were still 0, the best diffusion decoder design with the DINOISER loss function obtained a statistically significant improvement in amino acid recall by 0.373 compared to the baseline autoregressive decoder-based Casanovo model. These findings highlight the potential of diffusion decoders to not only enhance model sensitivity but also drive significant advancements in peptide de novo sequencing.

[318] Physics-Informed Neural Networks For Semiconductor Film Deposition: A Review

Tao Han, Zahra Taheri, Hyunwoong Ko

Main category: cs.LG

TL;DR: The paper reviews ML applications, especially Physics-Informed Neural Networks (PINNs), for improving semiconductor film deposition processes, identifying trends, gaps, and proposing future research directions.

Details

Motivation: To address challenges in semiconductor film deposition (e.g., control, quality, predictive modeling) using ML, particularly PINNs, for better precision and efficiency.

Method: Thematic analysis of ML applications in film deposition, focusing on PINNs, their integration of physical laws, and neural network architectures.

Result: Identified key trends, limitations, and gaps in current ML methodologies, proposing novel directions for PINN integration.

Conclusion: PINNs offer significant potential to enhance film deposition processes, with future research needed to address gaps and improve scalability and efficiency.

Abstract: Semiconductor manufacturing relies heavily on film deposition processes, such as Chemical Vapor Deposition and Physical Vapor Deposition. These complex processes require precise control to achieve film uniformity, proper adhesion, and desired functionality. Recent advancements in Physics-Informed Neural Networks (PINNs), an innovative machine learning (ML) approach, have shown significant promise in addressing challenges related to process control, quality assurance, and predictive modeling within semiconductor film deposition and other manufacturing domains. This paper provides a comprehensive review of ML applications targeted at semiconductor film deposition processes. Through a thematic analysis, we identify key trends, existing limitations, and research gaps, offering insights into both the advantages and constraints of current methodologies. Our structured analysis aims to highlight the potential integration of these ML techniques to enhance interpretability, accuracy, and robustness in film deposition processes. Additionally, we examine state-of-the-art PINN methods, discussing strategies for embedding physical knowledge, governing laws, and partial differential equations into advanced neural network architectures tailored for semiconductor manufacturing. Based on this detailed review, we propose novel research directions that integrate the strengths of PINNs to significantly advance film deposition processes. The contributions of this study include establishing a clear pathway for future research in integrating physics-informed ML frameworks, addressing existing methodological gaps, and ultimately improving precision, scalability, and operational efficiency within semiconductor manufacturing.

[319] StellarF: A Lora-Adapter Integrated Large Model Framework for Stellar Flare Forecasting with Historical & Statistical Data

Tianyu Su, Zhiqiang Zou, Ali Luo, Xiao Kong, Qingyu Lu, Min Li

Main category: cs.LG

TL;DR: StellarF introduces a parameter-efficient model for stellar flare forecasting using LoRA and Adapter techniques, outperforming existing methods on Kepler and TESS datasets.

Details

Motivation: The sparsity of recorded flare events and lack of domain-specific large-scale models hinder stellar flare forecasting.

Method: StellarF combines a flare statistical module with a historical flare record module for multi-scale pattern recognition, using LoRA and Adapter techniques.

Result: StellarF achieves state-of-the-art performance on self-constructed datasets from Kepler and TESS.

Conclusion: The model provides a novel framework for advancing astrophysical research and cross-disciplinary applications.

Abstract: Stellar flare forecasting, a critical research frontier in astronomy, offers profound insights into stellar activity. However, the field is constrained by both the sparsity of recorded flare events and the absence of domain-specific large-scale predictive models. To address these challenges, this study introduces StellarF (Stellar Flare Forecasting), a novel large model that leverages Low-Rank (LoRA) and Adapter techniques to parameter-efficient learning for stellar flare forecasting. At its core, StellarF integrates an flare statistical information module with a historical flare record module, enabling multi-scale pattern recognition from observational data. Extensive experiments on our self-constructed datasets (derived from Kepler and TESS light curves) demonstrate that StellarF achieves state-of-the-art performance compared to existing methods. The proposed prediction paradigm establishes a novel methodological framework for advancing astrophysical research and cross-disciplinary applications.

[320] High-Throughput Distributed Reinforcement Learning via Adaptive Policy Synchronization

Rodney Lafuente-Mercado

Main category: cs.LG

TL;DR: ClusterEnv is a lightweight, modular interface for distributed RL environment execution, decoupling simulation from training with the DETACH pattern and addressing policy staleness via AAPS.

Details

Motivation: Existing RL frameworks entangle simulation, learning, and orchestration, limiting modularity and reusability.

Method: ClusterEnv uses the DETACH pattern to offload simulation to remote workers, centralizing learning, and introduces AAPS for adaptive policy synchronization.

Result: Experiments show AAPS achieves high sample efficiency with fewer weight updates, integrating seamlessly into existing pipelines.

Conclusion: ClusterEnv offers a modular, efficient solution for distributed RL, reducing synchronization overhead without performance loss.

Abstract: Scaling reinforcement learning (RL) workloads often requires distributing environment simulation across compute clusters. Existing frameworks entangle simulation, learning logic, and orchestration into monolithic systems, limiting modularity and reusability. We present ClusterEnv, a lightweight, learner-agnostic interface for distributed environment execution that mirrors the Gymnasium API. ClusterEnv introduces the DETACH pattern, which decouples simulation from training by offloading reset() and step() operations to remote workers while keeping learning centralized. To address policy staleness in distributed execution, we propose Adaptive Actor Policy Synchronization (AAPS), a divergence-triggered update mechanism that reduces synchronization overhead without sacrificing performance. ClusterEnv integrates cleanly into existing RL pipelines, supports both on-policy and off-policy methods, and requires minimal code changes. Experiments on discrete control tasks demonstrate that AAPS achieves high sample efficiency with significantly fewer weight updates. Source code is available at https://github.com/rodlaf/ClusterEnv.

[321] Misalignment from Treating Means as Ends

Henrik Marklund, Alex Infanger, Benjamin Van Roy

Main category: cs.LG

TL;DR: The paper highlights how reward functions in reinforcement learning often conflate terminal and instrumental goals, leading to misalignment and poor performance.

Details

Motivation: To address the problem of reward functions inaccurately representing human goals due to the conflation of terminal (ends) and instrumental (means) goals.

Method: The authors formulate a simple example to demonstrate how slight conflation of these goals causes severe misalignment and analyze environments sensitive to this issue.

Result: Optimizing misspecified reward functions results in poor performance when evaluated against the true reward function.

Conclusion: The paper underscores the sensitivity of reinforcement learning to goal conflation and discusses implications for reward learning and real-world environments.

Abstract: Reward functions, learned or manually specified, are rarely perfect. Instead of accurately expressing human goals, these reward functions are often distorted by human beliefs about how best to achieve those goals. Specifically, these reward functions often express a combination of the human’s terminal goals – those which are ends in themselves – and the human’s instrumental goals – those which are means to an end. We formulate a simple example in which even slight conflation of instrumental and terminal goals results in severe misalignment: optimizing the misspecified reward function results in poor performance when measured by the true reward function. This example distills the essential properties of environments that make reinforcement learning highly sensitive to conflation of instrumental and terminal goals. We discuss how this issue can arise with a common approach to reward learning and how it can manifest in real environments.

[322] First-Order Error Matters: Accurate Compensation for Quantized Large Language Models

Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu

Main category: cs.LG

TL;DR: FOEM is a novel PTQ method for LLMs that incorporates first-order gradient terms to improve quantization error compensation, outperforming existing methods like GPTQ.

Details

Motivation: Existing PTQ methods assume negligible first-order terms in quantization error, but this assumption is flawed due to accumulated deviations. FOEM addresses this by explicitly including first-order gradients.

Method: FOEM approximates gradients by computing differences between latent and full-precision weights, avoiding costly backpropagation. It uses precomputed Cholesky factors for efficient Hessian submatrix inversion.

Result: FOEM reduces perplexity by 89.6% for Llama3-8B and improves MMLU accuracy from 51.7% to 74.9% for Llama3-70B, nearing full-precision performance. It also integrates well with advanced techniques like GPTAQ and SpinQuant.

Conclusion: FOEM effectively addresses the limitations of existing PTQ methods, offering superior performance with minimal computational overhead and broad applicability.

Abstract: Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at https://github.com/Xingyu-Zheng/FOEM.

[323] Crafting Imperceptible On-Manifold Adversarial Attacks for Tabular Data

Zhipeng He, Alexander Stevens, Chun Ouyang, Johannes De Smedt, Alistair Barros, Catarina Moreira

Main category: cs.LG

TL;DR: The paper proposes a VAE-based framework for generating imperceptible adversarial examples on tabular data, addressing challenges like heterogeneous features and distributional deviations.

Details

Motivation: Adversarial attacks on tabular data face unique challenges due to mixed categorical and numerical features, lacking intuitive similarity metrics and often producing detectable outliers.

Method: A mixed-input Variational Autoencoder (VAE) integrates categorical and numerical features into a latent manifold, enabling perturbations that preserve statistical consistency.

Result: The method achieves lower outlier rates and consistent performance across datasets and models, with In-Distribution Success Rate (IDSR) validating indistinguishability.

Conclusion: VAE-based attacks are practical for tabular data when reconstruction quality is high, emphasizing the importance of on-manifold perturbations for realistic adversarial examples.

Abstract: Adversarial attacks on tabular data present fundamental challenges distinct from image or text domains due to the heterogeneous nature of mixed categorical and numerical features. Unlike images where pixel perturbations maintain visual similarity, tabular data lacks intuitive similarity metrics, making it difficult to define imperceptible modifications. Additionally, traditional gradient-based methods prioritise $\ell_p$-norm constraints, often producing adversarial examples that deviate from the original data distributions, making them detectable. We propose a latent space perturbation framework using a mixed-input Variational Autoencoder (VAE) to generate imperceptible adversarial examples. The proposed VAE integrates categorical embeddings and numerical features into a unified latent manifold, enabling perturbations that preserve statistical consistency. We specify In-Distribution Success Rate (IDSR) to measure the proportion of adversarial examples that remain statistically indistinguishable from the input distribution. Evaluation across six publicly available datasets and three model architectures demonstrates that our method achieves substantially lower outlier rates and more consistent performance compared to traditional input-space attacks and other VAE-based methods adapted from image domain approaches. Our comprehensive analysis includes hyperparameter sensitivity, sparsity control mechanisms, and generative architectural comparisons, revealing that VAE-based attacks depend critically on reconstruction quality but offer superior practical utility when sufficient training data is available. This work highlights the importance of on-manifold perturbations for realistic adversarial attacks on tabular data, offering a robust approach for practical deployment. The source code can be accessed through https://github.com/ZhipengHe/VAE-TabAttack.

[324] AdaMuon: Adaptive Muon Optimizer

Chongjie Si, Debing Zhang, Wei Shen

Main category: cs.LG

TL;DR: AdaMuon is an adaptive learning-rate framework enhancing Muon with per-parameter second-moment modulation and RMS-aligned rescaling, outperforming Muon in convergence and stability.

Details

Motivation: To improve the efficiency and adaptability of the Muon optimizer for large-scale model training.

Method: AdaMuon adds two modules: per-parameter second-moment modulation for update-level adaptivity and RMS-aligned rescaling for regulating update magnitude.

Result: Empirical results show AdaMuon outperforms Muon in convergence speed and training stability across various model scales and learning-rate regimes.

Conclusion: AdaMuon enhances Muon without extra tuning, offering seamless integration and superior performance.

Abstract: We propose AdaMuon, an adaptive learning-rate framework built upon the recently validated Muon optimizer, which has demonstrated substantial efficiency gains over AdamW in large-scale model training. AdaMuon augments Muon with two mutually dependent modules: (1) a per-parameter second-moment modulation that captures orthogonal gradient updates to ensure update-level adaptivity, and (2) a RMS-aligned rescaling that regulates the overall update magnitude by aligning it with the intrinsic structure of the parameter space. Empirical results on multiple model scales and learning-rate regimes confirm that AdaMuon consistently outperforms the original Muon, delivering higher acceleration in convergence while maintaining training stability. Our method introduces no additional tuning burden and can be seamlessly integrated into existing Muon training pipelines.

[325] AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air

Shiyi Yang, Xiaoxue Yu, Rongpeng Li, Jianhang Zhu, Zhifeng Zhao, Honggang Zhang

Main category: cs.LG

TL;DR: AirLLM introduces a hierarchical diffusion policy framework for efficient LoRA adaptation in edge devices, combining PPO and DDIM to optimize rank configurations and reduce transmission costs.

Details

Motivation: Addressing the inefficiency of fixed or heuristic rank configurations in LoRA approaches for edge devices with limited bandwidth and computational resources.

Method: Develops AirLLM, using PPO for coarse-grained decisions and DDIM for refining rank vectors, optimized under CFG to align with PPO rewards.

Result: AirLLM improves fine-tuning performance and reduces transmission costs under varying signal-to-noise ratios.

Conclusion: AirLLM effectively combines reinforcement learning and diffusion models for scalable and efficient remote fine-tuning.

Abstract: Operating Large Language Models (LLMs) on edge devices is increasingly challenged by limited communication bandwidth and strained computational and memory costs. Thus, cloud-assisted remote fine-tuning becomes indispensable. Nevertheless, existing Low-Rank Adaptation (LoRA) approaches typically employ fixed or heuristic rank configurations, and the subsequent over-the-air transmission of all LoRA parameters could be rather inefficient. To address this limitation, we develop AirLLM, a hierarchical diffusion policy framework for communication-aware LoRA adaptation. Specifically, AirLLM models the rank configuration as a structured action vector that spans all LoRA-inserted projections. To solve the underlying high-dimensional sequential decision-making problem, a Proximal Policy Optimization (PPO) agent generates coarse-grained decisions by jointly observing wireless states and linguistic complexity, which are then refined via Denoising Diffusion Implicit Models (DDIM) to produce high-resolution, task- and channel-adaptive rank vectors. The two modules are optimized alternatively, with the DDIM trained under the Classifier-Free Guidance (CFG) paradigm to maintain alignment with PPO rewards. Experiments under varying signal-to-noise ratios demonstrate that AirLLM consistently enhances fine-tuning performance while significantly reducing transmission costs, highlighting the effectiveness of reinforcement-driven, diffusion-refined rank adaptation for scalable and efficient remote fine-tuning over the air.

[326] Leveraging Advanced Machine Learning to Predict Turbulence Dynamics from Temperature Observations at an Experimental Prescribed Fire

Dipak Dulal, Joseph J. Charney, Michael R. Gallagher, Pitambar Acharya, Carmeliza Navasca, Nicholas S. Skowronski

Main category: cs.LG

TL;DR: Machine learning models predict turbulent kinetic energy (TKE) from temperature data in fire environments, revealing new insights for fire research and management.

Details

Motivation: To explore the relationship between temperature and TKE in fire environments and improve fire operations and model predictions.

Method: Used Deep Neural Networks, Random Forest Regressor, Gradient Boosting, and Gaussian Process Regressor on 10 Hz temperature and turbulence data from a prescribed burn.

Result: Achieved accurate TKE predictions despite weak correlations, with regression models performing particularly well.

Conclusion: Demonstrates a novel numerical approach for understanding fire dynamics and highlights machine learning’s potential in fire research.

Abstract: This study explores the potential for predicting turbulent kinetic energy (TKE) from more readily acquired temperature data using temperature profiles and turbulence data collected concurrently at 10 Hz during a small experimental prescribed burn in the New Jersey Pine Barrens. Machine learning models, including Deep Neural Networks, Random Forest Regressor, Gradient Boosting, and Gaussian Process Regressor, were employed to assess the potential to predict TKE from temperature perturbations and explore temporal and spatial dynamics of correlations. Data visualization and correlation analyses revealed patterns and relationships between thermocouple temperatures and TKE, providing insight into the underlying dynamics. More accurate predictions of TKE were achieved by employing various machine learning models despite a weak correlation between the predictors and the target variable. The results demonstrate significant success, particularly from regression models, in accurately predicting the TKE. The findings of this study demonstrate a novel numerical approach to identifying new relationships between temperature and airflow processes in and around the fire environment. These relationships can help refine our understanding of combustion environment processes and the coupling and decoupling of fire environment processes necessary for improving fire operations strategy and fire and smoke model predictions. The findings of this study additionally highlight the valuable role of machine learning techniques in analyzing the complex large datasets of the fire environments, showcasing their potential to advance fire research and management practices.

[327] Relative Entropy Pathwise Policy Optimization

Claas Voelcker, Axel Brunnbauer, Marcel Hussing, Michal Nauman, Pieter Abbeel, Eric Eaton, Radu Grosu, Amir-massoud Farahmand, Igor Gilitschenski

Main category: cs.LG

TL;DR: The paper introduces REPPO, an on-policy algorithm combining pathwise policy gradients’ efficiency with on-policy learning’s simplicity, reducing variance and improving training stability.

Details

Motivation: Address the high variance of score-function policy gradients and the unreliability of pathwise policy gradients without accurate value functions, aiming for stable on-policy learning.

Method: Develops REPPO, balancing stochastic policies for exploration with constrained updates, and optimizes value function learning for accurate gradients.

Result: REPPO shows strong performance with reduced sample needs, faster training, lower memory use, and robust hyperparameters in benchmarks.

Conclusion: REPPO effectively merges pathwise policy gradients’ benefits with on-policy learning, offering a practical and efficient solution.

Abstract: Score-function policy gradients have delivered strong results in game-playing, robotics and language-model fine-tuning. Yet its high-variance often undermines training stability. On the other hand, pathwise policy gradients alleviate the training variance, but are reliable only when driven by an accurate action-conditioned value function which is notoriously hard to train without relying on past off-policy data. In this paper, we discuss how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to balance stochastic policies for exploration with constrained policy updates for stable training, and evaluate important architectural components that facilitate accurate value function learning. Building on these insights, we propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. We demonstrate that REPPO provides strong empirical performance at decreased sample requirements, wall-clock time, memory footprint as well as high hyperparameter robustness in a set of experiments on two standard GPU-parallelized benchmarks.

[328] GATE: Graph Attention Neural Networks with Real-Time Edge Construction for Robust Indoor Localization using Mobile Embedded Devices

Danish Gufran, Sudeep Pasricha

Main category: cs.LG

TL;DR: GATE, a novel GNN-based framework, improves indoor localization by addressing non-Euclidean noise and device heterogeneity with adaptive graph representations, achieving significantly lower errors than existing methods.

Details

Motivation: Current DL models for Wi-Fi RSS fingerprinting assume Euclidean space, ignoring spatial relationships and non-uniform noise, leading to poor generalization across devices. GNNs help but struggle with noise and blind spots.

Method: GATE uses adaptive graph representations, introducing AHV for message passing, MDHV to mitigate blind spots, and RTEC for dynamic graph adaptation.

Result: GATE reduces mean localization errors by 1.6x to 4.72x and worst-case errors by 1.85x to 4.57x compared to state-of-the-art methods.

Conclusion: GATE effectively addresses limitations of existing models, offering robust and accurate indoor localization.

Abstract: Accurate indoor localization is crucial for enabling spatial context in smart environments and navigation systems. Wi-Fi Received Signal Strength (RSS) fingerprinting is a widely used indoor localization approach due to its compatibility with mobile embedded devices. Deep Learning (DL) models improve accuracy in localization tasks by learning RSS variations across locations, but they assume fingerprint vectors exist in a Euclidean space, failing to incorporate spatial relationships and the non-uniform distribution of real-world RSS noise. This results in poor generalization across heterogeneous mobile devices, where variations in hardware and signal processing distort RSS readings. Graph Neural Networks (GNNs) can improve upon conventional DL models by encoding indoor locations as nodes and modeling their spatial and signal relationships as edges. However, GNNs struggle with non-Euclidean noise distributions and suffer from the GNN blind spot problem, leading to degraded accuracy in environments with dense access points (APs). To address these challenges, we propose GATE, a novel framework that constructs an adaptive graph representation of fingerprint vectors while preserving an indoor state-space topology, modeling the non-Euclidean structure of RSS noise to mitigate environmental noise and address device heterogeneity. GATE introduces

a novel Attention Hyperspace Vector (AHV) for enhanced message passing, 2) a novel Multi-Dimensional Hyperspace Vector (MDHV) to mitigate the GNN blind spot, and 3) an new Real-Time Edge Construction (RTEC) approach for dynamic graph adaptation. Extensive real-world evaluations across multiple indoor spaces with varying path lengths, AP densities, and heterogeneous devices demonstrate that GATE achieves 1.6x to 4.72x lower mean localization errors and 1.85x to 4.57x lower worst-case errors compared to state-of-the-art indoor localization frameworks.

[329] A Distance Metric for Mixed Integer Programming Instances

Gwen Maudet, Grégoire Danoy

Main category: cs.LG

TL;DR: The paper introduces a mathematical distance metric for MILP instances to improve similarity comparison, outperforming existing methods in accuracy and speed.

Details

Motivation: MILP lacks a reliable similarity metric for comparing instances, limiting solver guidance and evaluation of instance set heterogeneity.

Method: Proposes a distance metric derived from MILP formulations, using discretization and Earth mover’s distance for constraint comparisons.

Result: The greedy variant is nearly 200 times faster with similar accuracy to the exact version, outperforming non-learned baselines and rivaling supervised classifiers.

Conclusion: The new metric effectively bridges the gap in MILP instance comparison, offering a robust unsupervised solution.

Abstract: Mixed-integer linear programming (MILP) is a powerful tool for addressing a wide range of real-world problems, but it lacks a clear structure for comparing instances. A reliable similarity metric could establish meaningful relationships between instances, enabling more effective evaluation of instance set heterogeneity and providing better guidance to solvers, particularly when machine learning is involved. Existing similarity metrics often lack precision in identifying instance classes or rely heavily on labeled data, which limits their applicability and generalization. To bridge this gap, this paper introduces the first mathematical distance metric for MILP instances, derived directly from their mathematical formulations. By discretizing right-hand sides, weights, and variables into classes, the proposed metric draws inspiration from the Earth mover’s distance to quantify mismatches in weight-variable distributions for constraint comparisons. This approach naturally extends to enable instance-level comparisons. We evaluate both an exact and a greedy variant of our metric under various parameter settings, using the StrIPLIB dataset. Results show that all components of the metric contribute to class identification, and that the greedy version achieves accuracy nearly identical to the exact formulation while being nearly 200 times faster. Compared to state-of-the-art baselines, including feature-based, image-based, and neural network models, our unsupervised method consistently outperforms all non-learned approaches and rivals the performance of a supervised classifier on class and subclass grouping tasks.

[330] LogTinyLLM: Tiny Large Language Models Based Contextual Log Anomaly Detection

Isaiah Thompson Ocansey, Ritwik Bhattacharya, Tanmay Sen

Main category: cs.LG

TL;DR: The paper proposes parameter-efficient finetuning methods (LoRA and adapters) for log anomaly detection, outperforming traditional approaches with significant accuracy improvements.

Details

Motivation: Log anomaly detection is challenging due to log volume and complexity, necessitating efficient methods for system maintenance.

Method: Uses LoRA and adapter-based finetuning on tiny LLMs, tested on the Thunderbird dataset.

Result: LoRA finetuning improves accuracy by 18-19% over LogBert, achieving 97.76%-98.83% accuracy.

Conclusion: Parameter-efficient finetuning, especially LoRA, is highly effective for log anomaly detection.

Abstract: Log anomaly detection using traditional rule based or deep learning based methods is often challenging due to the large volume and highly complex nature of log sequence. So effective way of detection of anomalous sequence of logs is crucial for system maintenance and development. This paper proposes parameter efficient finetuning specifically low rank adaptation (LoRA) and adapter based approaches for finding contextual anomalies in sequence of logs in large log data set. It compares different tiny large language models (LLMs) on the Thunderbird dataset. The results show that LoRA based finetuning provides substantial performance improvements of 18 to 19 percentage over LogBert based full finetuning approach, achieving accuracy scores between 97.76% and 98.83% compared to 79.37%.

[331] Real-Time Bayesian Detection of Drift-Evasive GNSS Spoofing in Reinforcement Learning Based UAV Deconfliction

Deepak Kumar Panda, Weisi Guo

Main category: cs.LG

TL;DR: A Bayesian online change point detection (BOCPD) method is proposed to detect drift-evasive GNSS spoofing attacks on UAVs by monitoring temporal shifts in RL critic network outputs, outperforming traditional methods.

Details

Motivation: UAVs' reliance on GNSS makes them vulnerable to subtle spoofing attacks like drift-evasive spoofing, which evade conventional detection methods, necessitating robust temporal-scale detection techniques.

Method: The study employs BOCPD to monitor temporal shifts in value estimates from an RL critic network, enabling early detection of behavioral deviations caused by spoofing.

Result: The proposed framework achieves higher detection accuracy and lower false-positive/negative rates compared to traditional GNSS spoofing detectors and other temporal methods.

Conclusion: The BOCPD-based approach enhances UAV resilience against stealthy spoofing attacks by enabling timely detection and contingency planning.

Abstract: Autonomous unmanned aerial vehicles (UAVs) rely on global navigation satellite system (GNSS) pseudorange measurements for accurate real-time localization and navigation. However, this dependence exposes them to sophisticated spoofing threats, where adversaries manipulate pseudoranges to deceive UAV receivers. Among these, drift-evasive spoofing attacks subtly perturb measurements, gradually diverting the UAVs trajectory without triggering conventional signal-level anti-spoofing mechanisms. Traditional distributional shift detection techniques often require accumulating a threshold number of samples, causing delays that impede rapid detection and timely response. Consequently, robust temporal-scale detection methods are essential to identify attack onset and enable contingency planning with alternative sensing modalities, improving resilience against stealthy adversarial manipulations. This study explores a Bayesian online change point detection (BOCPD) approach that monitors temporal shifts in value estimates from a reinforcement learning (RL) critic network to detect subtle behavioural deviations in UAV navigation. Experimental results show that this temporal value-based framework outperforms conventional GNSS spoofing detectors, temporal semi-supervised learning frameworks, and the Page-Hinkley test, achieving higher detection accuracy and lower false-positive and false-negative rates for drift-evasive spoofing attacks.

[332] Gradient Regularization-based Neural Granger Causality

Meiliang Liu, Huiwen Dong, Xiaoxiao Yang, Yunfang Xu, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao

Main category: cs.LG

TL;DR: Proposes GRNGC, a neural Granger causality method using gradient regularization, reducing computational costs and improving flexibility.

Details

Motivation: Existing neural Granger causality models are computationally expensive and limited in capturing complex interactions.

Method: GRNGC applies L1 regularization to gradients between input and output, requiring only one prediction model and supporting diverse architectures.

Result: Outperforms baselines in simulations and real-world datasets, reducing computational overhead.

Conclusion: GRNGC is effective, flexible, and efficient for inferring Granger causality.

Abstract: With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model’s ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model’s input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model’s effectiveness in reconstructing gene regulatory networks.

[333] Mixture of Experts in Large Language Models

Danyang Zhang, Junhao Song, Ziqian Bi, Yingfang Yuan, Tianyang Wang, Joe Yeong, Junfeng Hao

Main category: cs.LG

TL;DR: A review of Mixture-of-Experts (MoE) architecture in large language models, emphasizing performance enhancement with minimal computational cost, covering theory, design, applications, and challenges.

Details

Motivation: To explore MoE's potential in improving model performance and efficiency in large language models.

Method: Systematic analysis of MoE’s theoretical foundations, architectural designs, gating/routing mechanisms, configurations, and real-world applications.

Result: Identified MoE’s advantages: superior model capacity, task-specific performance, scalability, and highlighted key challenges like expert diversity and calibration.

Conclusion: MoE shows promise for innovation in large language models, but challenges remain in diversity, calibration, and inference aggregation.

Abstract: This paper presents a comprehensive review of the Mixture-of-Experts (MoE) architecture in large language models, highlighting its ability to significantly enhance model performance while maintaining minimal computational overhead. Through a systematic analysis spanning theoretical foundations, core architectural designs, and large language model (LLM) applications, we examine expert gating and routing mechanisms, hierarchical and sparse MoE configurations, meta-learning approaches, multimodal and multitask learning scenarios, real-world deployment cases, and recent advances and challenges in deep learning. Our analysis identifies key advantages of MoE, including superior model capacity compared to equivalent Bayesian approaches, improved task-specific performance, and the ability to scale model capacity efficiently. We also underscore the importance of ensuring expert diversity, accurate calibration, and reliable inference aggregation, as these are essential for maximizing the effectiveness of MoE architectures. Finally, this review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications.

[334] Quantized Rank Reduction: A Communications-Efficient Federated Learning Scheme for Network-Critical Applications

Dimitrios Kritsiolis, Constantine Kotropoulos

Main category: cs.LG

TL;DR: Proposes a communication-efficient federated learning scheme using low-rank approximation and quantization to reduce network load while maintaining accuracy.

Details

Motivation: Address the challenge of high communication overhead in federated learning due to frequent model updates.

Method: Uses low-rank approximation of neural network gradients and quantization.

Result: Significantly reduces network load with minimal impact on model accuracy.

Conclusion: The proposed scheme effectively balances communication efficiency and model performance in federated learning.

Abstract: Federated learning is a machine learning approach that enables multiple devices (i.e., agents) to train a shared model cooperatively without exchanging raw data. This technique keeps data localized on user devices, ensuring privacy and security, while each agent trains the model on their own data and only shares model updates. The communication overhead is a significant challenge due to the frequent exchange of model updates between the agents and the central server. In this paper, we propose a communication-efficient federated learning scheme that utilizes low-rank approximation of neural network gradients and quantization to significantly reduce the network load of the decentralized learning process with minimal impact on the model’s accuracy.

[335] An Explainable AI-Enhanced Machine Learning Approach for Cardiovascular Disease Detection and Risk Assessment

Md. Emon Akter Sourov, Md. Sabbir Hossen, Pabon Shaha, Mohammad Minoar Hossain, Md Sadiq Iqbal

Main category: cs.LG

TL;DR: A machine learning framework combining classification and regression models for heart disease diagnosis and risk prediction, achieving high accuracy and interpretability.

Details

Motivation: Heart disease diagnosis is often inaccurate in resource-limited regions, necessitating improved methods.

Method: Used classification and regression models on the Heart Disease dataset (1,035 cases), applied SMOTE for class imbalance, and evaluated performance with multiple metrics.

Result: Random Forest achieved 97.2% accuracy (real data) and 97.6% (synthetic data); Linear Regression had the highest R2 values (0.992 and 0.984).

Conclusion: Machine learning can revolutionize heart disease diagnosis and risk prediction, aiding early intervention and clinical decisions.

Abstract: Heart disease remains a major global health concern, particularly in regions with limited access to medical resources and diagnostic facilities. Traditional diagnostic methods often fail to accurately identify and manage heart disease risks, leading to adverse outcomes. Machine learning has the potential to significantly enhance the accuracy, efficiency, and speed of heart disease diagnosis. In this study, we proposed a comprehensive framework that combines classification models for heart disease detection and regression models for risk prediction. We employed the Heart Disease dataset, which comprises 1,035 cases. To address the issue of class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was applied, resulting in the generation of an additional 100,000 synthetic data points. Performance metrics, including accuracy, precision, recall, F1-score, R2, MSE, RMSE, and MAE, were used to evaluate the model’s effectiveness. Among the classification models, Random Forest emerged as the standout performer, achieving an accuracy of 97.2% on real data and 97.6% on synthetic data. For regression tasks, Linear Regression demonstrated the highest R2 values of 0.992 and 0.984 on real and synthetic datasets, respectively, with the lowest error metrics. Additionally, Explainable AI techniques were employed to enhance the interpretability of the models. This study highlights the potential of machine learning to revolutionize heart disease diagnosis and risk prediction, thereby facilitating early intervention and enhancing clinical decision-making.

[336] Striking the Perfect Balance: Preserving Privacy While Boosting Utility in Collaborative Medical Prediction Platforms

Shao-Bo Lin, Xiaotong Liu, Yao Wang

Main category: cs.LG

TL;DR: The paper addresses privacy and quality concerns in online collaborative medical prediction platforms by proposing a privacy-preserving mechanism integrated into a one-shot distributed learning framework, ensuring both privacy and performance.

Details

Motivation: Growing privacy concerns and low prediction quality in medical prediction platforms hinder patient participation and doctor cooperation.

Method: Proposes a privacy-preserving mechanism integrated into a one-shot distributed learning framework, supported by statistical learning theory.

Result: The framework achieves optimal prediction performance under privacy constraints, validated by simulations and real-world data.

Conclusion: The proposed solution effectively balances privacy and performance in collaborative medical prediction.

Abstract: Online collaborative medical prediction platforms offer convenience and real-time feedback by leveraging massive electronic health records. However, growing concerns about privacy and low prediction quality can deter patient participation and doctor cooperation. In this paper, we first clarify the privacy attacks, namely attribute attacks targeting patients and model extraction attacks targeting doctors, and specify the corresponding privacy principles. We then propose a privacy-preserving mechanism and integrate it into a novel one-shot distributed learning framework, aiming to simultaneously meet both privacy requirements and prediction performance objectives. Within the framework of statistical learning theory, we theoretically demonstrate that the proposed distributed learning framework can achieve the optimal prediction performance under specific privacy requirements. We further validate the developed privacy-preserving collaborative medical prediction platform through both toy simulations and real-world data experiments.

[337] Gradient Descent on Logistic Regression: Do Large Step-Sizes Work with Data on the Sphere?

Si Yi Meng, Baptiste Goujaud, Antonio Orvieto, Christopher De Sa

Main category: cs.LG

TL;DR: The paper examines if equal-magnitude data ensures global convergence of gradient descent (GD) in logistic regression, proving it works in 1D but not in higher dimensions, where cycling can occur.

Details

Motivation: To understand if restricting data to equal magnitude guarantees global convergence of GD in logistic regression under any step size below the stability threshold.

Method: Analyzed GD behavior in logistic regression, focusing on separable and non-separable cases, and tested convergence in one-dimensional and higher-dimensional spaces.

Result: Proved global convergence in 1D with equal-magnitude data, but cycling behavior persists in higher dimensions despite step sizes below stability threshold.

Conclusion: Equal-magnitude data ensures global convergence in 1D but not in higher dimensions, highlighting the need for further research on cycling behavior and convergence conditions.

Abstract: Gradient descent (GD) on logistic regression has many fascinating properties. When the dataset is linearly separable, it is known that the iterates converge in direction to the maximum-margin separator regardless of how large the step size is. In the non-separable case, however, it has been shown that GD can exhibit a cycling behaviour even when the step sizes is still below the stability threshold $2/\lambda$, where $\lambda$ is the largest eigenvalue of the Hessian at the solution. This short paper explores whether restricting the data to have equal magnitude is a sufficient condition for global convergence, under any step size below the stability threshold. We prove that this is true in a one dimensional space, but in higher dimensions cycling behaviour can still occur. We hope to inspire further studies on quantifying how common these cycles are in realistic datasets, as well as finding sufficient conditions to guarantee global convergence with large step sizes.

[338] Generative Click-through Rate Prediction with Applications to Search Advertising

Lingwei Kong, Lu Wang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao

Main category: cs.LG

TL;DR: A novel model combines generative and discriminative approaches for CTR prediction, improving accuracy through a two-stage training process and validated by experiments and A/B testing.

Details

Motivation: To enhance CTR prediction precision by leveraging generative models' expressive power beyond traditional discriminative models.

Method: Two-stage training: 1) Generative pre-training for next-item prediction, 2) Fine-tuning within a discriminative CTR framework.

Result: Improved CTR prediction accuracy, validated by experiments and online A/B testing, with deployment on a major e-commerce platform.

Conclusion: The hybrid generative-discriminative model effectively enhances CTR prediction, demonstrating practical utility in real-world applications.

Abstract: Click-Through Rate (CTR) prediction models are integral to a myriad of industrial settings, such as personalized search advertising. Current methods typically involve feature extraction from users’ historical behavior sequences combined with product information, feeding into a discriminative model that is trained on user feedback to estimate CTR. With the success of models such as GPT, the potential for generative models to enrich expressive power beyond discriminative models has become apparent. In light of this, we introduce a novel model that leverages generative models to enhance the precision of CTR predictions in discriminative models. To reconcile the disparate data aggregation needs of both model types, we design a two-stage training process:

Generative pre-training for next-item prediction with the given item category in user behavior sequences; 2) Fine-tuning the well-trained generative model within a discriminative CTR prediction framework. Our method’s efficacy is substantiated through extensive experiments on a new dataset, and its significant utility is further corroborated by online A/B testing results. Currently, the model is deployed on one of the world’s largest e-commerce platforms, and we intend to release the associated code and dataset in the future.

[339] LyAm: Robust Non-Convex Optimization for Stable Learning in Noisy Environments

Elmira Mirzabeigi, Sepehr Rezaee, Kourosh Parand

Main category: cs.LG

TL;DR: LyAm, a novel optimizer combining Adam with Lyapunov stability, improves deep learning robustness, convergence, and accuracy.

Details

Motivation: Noisy gradients and unstable convergence in deep neural networks hinder performance and generalization.

Method: LyAm integrates Adam’s adaptive moment estimation with Lyapunov stability theory to dynamically adjust learning rates.

Result: LyAm outperforms state-of-the-art optimizers in accuracy, convergence speed, and stability on datasets like CIFAR-10 and CIFAR-100.

Conclusion: LyAm is a robust optimizer for deep learning, backed by theoretical guarantees and empirical success.

Abstract: Training deep neural networks, particularly in computer vision tasks, often suffers from noisy gradients and unstable convergence, which hinder performance and generalization. In this paper, we propose LyAm, a novel optimizer that integrates Adam’s adaptive moment estimation with Lyapunov-based stability mechanisms. LyAm dynamically adjusts the learning rate using Lyapunov stability theory to enhance convergence robustness and mitigate training noise. We provide a rigorous theoretical framework proving the convergence guarantees of LyAm in complex, non-convex settings. Extensive experiments on like as CIFAR-10 and CIFAR-100 show that LyAm consistently outperforms state-of-the-art optimizers in terms of accuracy, convergence speed, and stability, establishing it as a strong candidate for robust deep learning optimization.

[340] Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Tal Fiskus, Uri Shaham

Main category: cs.LG

TL;DR: A novel DRL method uses the Neyman-Rubin framework to improve sample efficiency by bounding factual loss, reducing buffer size by 96% and boosting rewards.

Details

Motivation: DRL agents demand high computational resources due to large training steps and buffer sizes.

Method: Leverages the Neyman-Rubin framework to bound factual loss, reusing past value network outputs.

Result: Achieves up to 2,427% higher reward ratio and reduces buffer size by 96%.

Conclusion: The method significantly improves DRL efficiency with minimal cost.

Abstract: Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 2,427% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at negligible cost.

[341] Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime

Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren

Main category: cs.LG

TL;DR: The paper analyzes the convergence of SGD for smooth convex objectives in the interpolation regime, providing improved rates for the last iterate under specific conditions.

Details

Motivation: Understanding the behavior of SGD in over-parameterized models, continual learning, and linear systems, especially with large stepsizes.

Method: Analyzing SGD on β-smooth convex loss functions with stepsize η ≤ 1/β, focusing on the last iterate’s expected excess risk.

Result: Derived expected excess risk rates, including near-optimal Õ(1/T + σ⋆/√T) for tuned stepsizes and O(1/√T) when σ⋆=0.

Conclusion: The results extend and improve prior work, offering better convergence guarantees for SGD in the interpolation regime.

Abstract: We study population convergence guarantees of stochastic gradient descent (SGD) for smooth convex objectives in the interpolation regime, where the noise at optimum is zero or near zero. The behavior of the last iterate of SGD in this setting – particularly with large (constant) stepsizes – has received growing attention in recent years due to implications for the training of over-parameterized models, as well as to analyzing forgetting in continual learning and to understanding the convergence of the randomized Kaczmarz method for solving linear systems. We establish that after $T$ steps of SGD on $\beta$-smooth convex loss functions with stepsize $\eta \leq 1/\beta$, the last iterate exhibits expected excess risk $\widetilde{O}(1/(\eta T^{1-\beta\eta/2}) + \eta T^{\beta\eta/2} \sigma_\star^2)$, where $\sigma_\star^2$ denotes the variance of the stochastic gradients at the optimum. In particular, for a well-tuned stepsize we obtain a near optimal $\widetilde{O}(1/T + \sigma_\star/\sqrt{T})$ rate for the last iterate, extending the results of Varre et al. (2021) beyond least squares regression; and when $\sigma_\star=0$ we obtain a rate of $O(1/\sqrt{T})$ with $\eta=1/\beta$, improving upon the best-known $O(T^{-1/4})$ rate recently established by Evron et al. (2025) in the special case of realizable linear regression.

[342] Guiding LLM Decision-Making with Fairness Reward Models

Zara Hall, Melanie Subbiah, Thomas P Zollo, Kathleen McKeown, Richard Zemel

Main category: cs.LG

TL;DR: A framework for training a Fairness Reward Model (FRM) is proposed to mitigate bias in LLM reasoning for high-stakes decisions, improving fairness without sacrificing accuracy.

Details

Motivation: Address the challenge of unfair bias amplification in LLM reasoning for high-stakes decisions like bail or loans.

Method: Train a Fairness Reward Model (FRM) on weakly supervised, LLM-annotated examples to score fairness in reasoning, enabling biased trajectories to be down-weighted.

Result: The FRM transfers across tasks, domains, and model families without fine-tuning, improving fairness while maintaining or surpassing baseline accuracy.

Conclusion: The FRM framework enables trustworthy use of reasoning models in high-stakes decision-making by balancing fairness and accuracy.

Abstract: Large language models are increasingly used to support high-stakes decisions, potentially influencing who is granted bail or receives a loan. Naive chain-of-thought sampling can improve average decision accuracy, but has also been shown to amplify unfair bias. To address this challenge and enable the trustworthy use of reasoning models in high-stakes decision-making, we propose a framework for training a generalizable Fairness Reward Model (FRM). Our model assigns a fairness score to LLM reasoning, enabling the system to down-weight biased trajectories and favor equitable ones when aggregating decisions across reasoning chains. We show that a single Fairness Reward Model, trained on weakly supervised, LLM-annotated examples of biased versus unbiased reasoning, transfers across tasks, domains, and model families without additional fine-tuning. Applied to real-world decision-making tasks including recidivism prediction and social media moderation, we show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.

[343] Neurosymbolic Reasoning Shortcuts under the Independence Assumption

Emile van Krieken, Pasquale Minervini, Edoardo Ponti, Antonio Vergari

Main category: cs.LG

TL;DR: The paper challenges the independence assumption in neurosymbolic predictors, showing it limits uncertainty modeling and causes reasoning shortcuts.

Details

Motivation: To address skepticism about the impact of the independence assumption in neurosymbolic predictors and demonstrate its limitations.

Method: Formal analysis of the independence assumption’s effects on uncertainty modeling and reasoning shortcuts.

Result: Independence assumption prevents representation of uncertainty over certain concept combinations, leading to reasoning shortcuts.

Conclusion: The independence assumption in neurosymbolic predictors hinders proper uncertainty modeling and awareness of reasoning shortcuts.

Abstract: The ubiquitous independence assumption among symbolic concepts in neurosymbolic (NeSy) predictors is a convenient simplification: NeSy predictors use it to speed up probabilistic reasoning. Recent works like van Krieken et al. (2024) and Marconato et al. (2024) argued that the independence assumption can hinder learning of NeSy predictors and, more crucially, prevent them from correctly modelling uncertainty. There is, however, scepticism in the NeSy community around the scenarios in which the independence assumption actually limits NeSy systems (Faronius and Dos Martires, 2025). In this work, we settle this question by formally showing that assuming independence among symbolic concepts entails that a model can never represent uncertainty over certain concept combinations. Thus, the model fails to be aware of reasoning shortcuts, i.e., the pathological behaviour of NeSy predictors that predict correct downstream tasks but for the wrong reasons.

[344] Local Pairwise Distance Matching for Backpropagation-Free Reinforcement Learning

Daniel Tanneberg

Main category: cs.LG

TL;DR: A novel RL method trains neural networks layer-wise using local signals during forward passes, eliminating backpropagation and activation storage, while achieving competitive performance and improved stability.

Details

Motivation: Backpropagation in RL suffers from vanishing/exploding gradients and requires storing activations, degrading learning performance and stability.

Method: Layer-wise training with local losses based on multi-dimensional scaling, optionally guided by rewards, during forward passes.

Result: Competitive performance to BP-based methods, enhanced stability, and improved performance in challenging environments.

Conclusion: The proposed backpropagation-free method is effective, stable, and scalable for RL tasks.

Abstract: Training neural networks with reinforcement learning (RL) typically relies on backpropagation (BP), necessitating storage of activations from the forward pass for subsequent backward updates. Furthermore, backpropagating error signals through multiple layers often leads to vanishing or exploding gradients, which can degrade learning performance and stability. We propose a novel approach that trains each layer of the neural network using local signals during the forward pass in RL settings. Our approach introduces local, layer-wise losses leveraging the principle of matching pairwise distances from multi-dimensional scaling, enhanced with optional reward-driven guidance. This method allows each hidden layer to be trained using local signals computed during forward propagation, thus eliminating the need for backward passes and storing intermediate activations. Our experiments, conducted with policy gradient methods across common RL benchmarks, demonstrate that this backpropagation-free method achieves competitive performance compared to their classical BP-based counterpart. Additionally, the proposed method enhances stability and consistency within and across runs, and improves performance especially in challenging environments.

[345] A Neural Network Model of Complementary Learning Systems: Pattern Separation and Completion for Continual Learning

James P Jun, Vijay Marupudi, Raj Sanjay Shah, Sashank Varma

Main category: cs.LG

TL;DR: A neurally plausible continual learning model combines VAEs and MHNs to reduce catastrophic forgetting, achieving ~90% accuracy on Split-MNIST by leveraging pattern separation (MHN) and completion (VAE).

Details

Motivation: Address catastrophic forgetting in neural networks by mimicking human memory systems (CLS theory) for continual learning.

Method: Combine variational autoencoders (VAEs) for pattern completion and Modern Hopfield networks (MHNs) for pattern separation into a continual learning model.

Result: Achieves ~90% accuracy on Split-MNIST, reducing forgetting; VAEs handle pattern completion, MHNs drive pattern separation.

Conclusion: The model provides a scalable template for memory consolidation and continual learning in biological and artificial systems.

Abstract: Learning new information without forgetting prior knowledge is central to human intelligence. In contrast, neural network models suffer from catastrophic forgetting: a significant degradation in performance on previously learned tasks when acquiring new information. The Complementary Learning Systems (CLS) theory offers an explanation for this human ability, proposing that the brain has distinct systems for pattern separation (encoding distinct memories) and pattern completion (retrieving complete memories from partial cues). To capture these complementary functions, we leverage the representational generalization capabilities of variational autoencoders (VAEs) and the robust memory storage properties of Modern Hopfield networks (MHNs), combining them into a neurally plausible continual learning model. We evaluate this model on the Split-MNIST task, a popular continual learning benchmark, and achieve close to state-of-the-art accuracy (~90%), substantially reducing forgetting. Representational analyses empirically confirm the functional dissociation: the VAE underwrites pattern completion, while the MHN drives pattern separation. By capturing pattern separation and completion in scalable architectures, our work provides a functional template for modeling memory consolidation, generalization, and continual learning in both biological and artificial systems.

[346] Toward Improving fNIRS Classification: A Study on Activation Functions in Deep Neural Architectures

Behtom Adeli, John McLinden, Pankaj Pandey, Ming Shao, Yalda Shahriari

Main category: cs.LG

TL;DR: The study evaluates activation functions for fNIRS classification, finding symmetrical functions like Tanh and Abs(x) outperform ReLU, with MAF analysis supporting symmetry’s role.

Details

Motivation: The impact of activation functions on DL performance in fNIRS is underexplored, despite challenges like nonlinearity and low SNR.

Method: Tested conventional and field-specific activation functions on fNIRSNet, AbsoluteNet, MDNN, and shallowConvNet using standardized preprocessing and training parameters.

Result: Symmetrical functions (Tanh, Abs(x)) outperformed ReLU, with MAF analysis confirming symmetry’s effectiveness.

Conclusion: Proper activation function selection, aligned with fNIRS signal characteristics, is crucial for performance gains.

Abstract: Activation functions are critical to the performance of deep neural networks, particularly in domains such as functional near-infrared spectroscopy (fNIRS), where nonlinearity, low signal-to-noise ratio (SNR), and signal variability poses significant challenges to model accuracy. However, the impact of activation functions on deep learning (DL) performance in the fNIRS domain remains underexplored and lacks systematic investigation in the current literature. This study evaluates a range of conventional and field-specific activation functions for fNIRS classification tasks using multiple deep learning architectures, including the domain-specific fNIRSNet, AbsoluteNet, MDNN, and shallowConvNet (as the baseline), all tested on a single dataset recorded during an auditory task. To ensure fair a comparison, all networks were trained and tested using standardized preprocessing and consistent training parameters. The results show that symmetrical activation functions such as Tanh and the Absolute value function Abs(x) can outperform commonly used functions like the Rectified Linear Unit (ReLU), depending on the architecture. Additionally, a focused analysis of the role of symmetry was conducted using a Modified Absolute Function (MAF), with results further supporting the effectiveness of symmetrical activation functions on performance gains. These findings underscore the importance of selecting proper activation functions that align with the signal characteristics of fNIRS data.

[347] Robust-Multi-Task Gradient Boosting

Seyedsaman Emami, Gonzalo Martínez-Muñoz, Daniel Hernández-Lobato

Main category: cs.LG

TL;DR: R-MTGB is a robust multi-task gradient boosting framework that handles outlier tasks effectively while promoting knowledge transfer among related tasks.

Details

Motivation: Real-world multi-task learning (MTL) often includes outlier tasks that degrade performance. R-MTGB addresses this by detecting and penalizing outliers while enhancing shared learning.

Method: R-MTGB uses a three-block architecture: learning shared patterns, partitioning tasks into outliers/non-outliers, and fine-tuning task-specific predictors within gradient boosting.

Result: Experiments show R-MTGB isolates outliers, transfers knowledge, and reduces prediction errors, achieving overall performance gains.

Conclusion: R-MTGB is robust, adaptable, and reliable in challenging MTL environments, outperforming traditional methods.

Abstract: Multi-task learning (MTL) has shown effectiveness in exploiting shared information across tasks to improve generalization. MTL assumes tasks share similarities that can improve performance. In addition, boosting algorithms have demonstrated exceptional performance across diverse learning problems, primarily due to their ability to focus on hard-to-learn instances and iteratively reduce residual errors. This makes them a promising approach for learning multi-task problems. However, real-world MTL scenarios often involve tasks that are not well-aligned (known as outlier or adversarial tasks), which do not share beneficial similarities with others and can, in fact, deteriorate the performance of the overall model. To overcome this challenge, we propose Robust-Multi-Task Gradient Boosting (R-MTGB), a novel boosting framework that explicitly models and adapts to task heterogeneity during training. R-MTGB structures the learning process into three sequential blocks: (1) learning shared patterns, (2) partitioning tasks into outliers and non-outliers with regularized parameters, and (3) fine-tuning task-specific predictors. This architecture enables R-MTGB to automatically detect and penalize outlier tasks while promoting effective knowledge transfer among related tasks. Our method integrates these mechanisms seamlessly within gradient boosting, allowing robust handling of noisy or adversarial tasks without sacrificing accuracy. Extensive experiments on both synthetic benchmarks and real-world datasets demonstrate that our approach successfully isolates outliers, transfers knowledge, and consistently reduces prediction errors for each task individually, and achieves overall performance gains across all tasks. These results highlight robustness, adaptability, and reliable convergence of R-MTGB in challenging MTL environments.

[348] Data Augmentation in Time Series Forecasting through Inverted Framework

Hongming Tan, Ting Chen, Ruochong Jin, Wai Kin Chan

Main category: cs.LG

TL;DR: DAIF introduces real-time data augmentation for iTransformer to address its limitations in capturing temporal interdependency and noise from nonsignificant correlations.

Details

Motivation: The inverted framework of iTransformer, while effective for multivariate correlation, diminishes temporal interdependency and introduces noise in nonsignificant correlations.

Method: Proposes DAIF with two strategies: Frequency Filtering and Cross-variation Patching, tailored for the inverted framework.

Result: Experiments show DAIF effectively improves performance across datasets and inverted models.

Conclusion: DAIF successfully addresses the limitations of the inverted framework, enhancing its effectiveness in MTS forecasting.

Abstract: Currently, iTransformer is one of the most popular and effective models for multivariate time series (MTS) forecasting. Thanks to its inverted framework, iTransformer effectively captures multivariate correlation. However, the inverted framework still has some limitations. It diminishes temporal interdependency information, and introduces noise in cases of nonsignificant variable correlation. To address these limitations, we introduce a novel data augmentation method on inverted framework, called DAIF. Unlike previous data augmentation methods, DAIF stands out as the first real-time augmentation specifically designed for the inverted framework in MTS forecasting. We first define the structure of the inverted sequence-to-sequence framework, then propose two different DAIF strategies, Frequency Filtering and Cross-variation Patching to address the existing challenges of the inverted framework. Experiments across multiple datasets and inverted models have demonstrated the effectiveness of our DAIF.

[349] D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data

Harsha Varun Marisetty, Manik Gupta, Yogesh Simmhan

Main category: cs.LG

TL;DR: The paper investigates the impact of non-linear, non-stationary time-series data distributions on federated learning (FL) performance and evaluates detrending techniques to improve FL accuracy in IoT applications.

Details

Motivation: Traditional centralized data analysis for IoT introduces delays and costs. FL offers a distributed alternative, but non-linear, non-stationary data variations challenge prediction accuracy.

Method: The study uses synthetic and real-world datasets with non-linear distributions, trains an LSTM model with centralized and FL approaches, and tests detrending techniques.

Result: FL underperforms centralized methods for non-linear data, but detrending techniques enhance FL performance by reducing loss.

Conclusion: Appropriate detrending can mitigate FL’s limitations with non-linear data, improving forecasting accuracy in IoT applications.

Abstract: With advancements in computing and communication technologies, the Internet of Things (IoT) has seen significant growth. IoT devices typically collect data from various sensors, such as temperature, humidity, and energy meters. Much of this data is temporal in nature. Traditionally, data from IoT devices is centralized for analysis, but this approach introduces delays and increased communication costs. Federated learning (FL) has emerged as an effective alternative, allowing for model training across distributed devices without the need to centralize data. In many applications, such as smart home energy and environmental monitoring, the data collected by IoT devices across different locations can exhibit significant variation in trends and seasonal patterns. Accurately forecasting such non-stationary, non-linear time-series data is crucial for applications like energy consumption estimation and weather forecasting. However, these data variations can severely impact prediction accuracy. The key contributions of this paper are: (1) Investigating how non-linear, non-stationary time-series data distributions, like generalized extreme value (gen-extreme) and log norm distributions, affect FL performance. (2) Analyzing how different detrending techniques for non-linear time-series data influence the forecasting model’s performance in a FL setup. We generated several synthetic time-series datasets using non-linear data distributions and trained an LSTM-based forecasting model using both centralized and FL approaches. Additionally, we evaluated the impact of detrending on real-world datasets with non-linear time-series data distributions. Our experimental results show that: (1) FL performs worse than centralized approaches when dealing with non-linear data distributions. (2) The use of appropriate detrending techniques improves FL performance, reducing loss across different data distributions.

[350] Exploring the robustness of TractOracle methods in RL-based tractography

Jeremi Levesque, Antoine Théberge, Maxime Descoteaux, Pierre-Marc Jodoin

Main category: cs.LG

TL;DR: The paper explores extensions of the TractOracle-RL framework for tractography, integrating RL advances and introducing Iterative Reward Training (IRT). Results show improved accuracy and anatomical validity over traditional methods.

Details

Motivation: To enhance tractography by leveraging RL and anatomical priors, reducing false positives and improving reliability.

Method: Extends TractOracle-RL with RL advancements, evaluates across five datasets, and introduces IRT for iterative reward refinement.

Result: RL methods with oracle feedback outperform traditional techniques in accuracy and anatomical validity.

Conclusion: Combining RL with oracle guidance yields robust tractography, with IRT further enhancing performance.

Abstract: Tractography algorithms leverage diffusion MRI to reconstruct the fibrous architecture of the brain’s white matter. Among machine learning approaches, reinforcement learning (RL) has emerged as a promising framework for tractography, outperforming traditional methods in several key aspects. TractOracle-RL, a recent RL-based approach, reduces false positives by incorporating anatomical priors into the training process via a reward-based mechanism. In this paper, we investigate four extensions of the original TractOracle-RL framework by integrating recent advances in RL, and we evaluate their performance across five diverse diffusion MRI datasets. Results demonstrate that combining an oracle with the RL framework consistently leads to robust and reliable tractography, regardless of the specific method or dataset used. We also introduce a novel RL training scheme called Iterative Reward Training (IRT), inspired by the Reinforcement Learning from Human Feedback (RLHF) paradigm. Instead of relying on human input, IRT leverages bundle filtering methods to iteratively refine the oracle’s guidance throughout training. Experimental results show that RL methods trained with oracle feedback significantly outperform widely used tractography techniques in terms of accuracy and anatomical validity.

[351] A parametric activation function based on Wendland RBF

Majid Darehmiraki

Main category: cs.LG

TL;DR: A novel parametric activation function using Wendland RBFs is introduced for deep neural networks, combining locality, smoothness, and adaptability to outperform traditional functions like ReLU in certain tasks.

Details

Motivation: Address limitations of traditional activation functions (ReLU, sigmoid, tanh) by leveraging Wendland RBFs' compact support and smoothness for better gradient propagation and stability.

Method: Proposes an enhanced Wendland activation function combining standard Wendland RBFs with linear and exponential terms, analyzed theoretically and tested on synthetic tasks (sine wave) and benchmarks (MNIST, Fashion-MNIST).

Result: Superior accuracy in regression tasks, competitive performance on benchmarks, and improved generalization due to localized, smooth transformations.

Conclusion: Wendland activations bridge classical RBF theory with deep learning, offering potential for hybrid architectures and domain-specific adaptations to mitigate overfitting.

Abstract: This paper introduces a novel parametric activation function based on Wendland radial basis functions (RBFs) for deep neural networks. Wendland RBFs, known for their compact support, smoothness, and positive definiteness in approximation theory, are adapted to address limitations of traditional activation functions like ReLU, sigmoid, and tanh. The proposed enhanced Wendland activation combines a standard Wendland component with linear and exponential terms, offering tunable locality, improved gradient propagation, and enhanced stability during training. Theoretical analysis highlights its mathematical properties, including smoothness and adaptability, while empirical experiments on synthetic tasks (e.g., sine wave approximation) and benchmark datasets (MNIST, Fashion-MNIST) demonstrate competitive performance. Results show that the Wendland-based activation achieves superior accuracy in certain scenarios, particularly in regression tasks, while maintaining computational efficiency. The study bridges classical RBF theory with modern deep learning, suggesting that Wendland activations can mitigate overfitting and improve generalization through localized, smooth transformations. Future directions include hybrid architectures and domain-specific adaptations.

[352] Langevin Flows for Modeling Neural Latent Dynamics

Yue Song, T. Anderson Keller, Yisong Yue, Pietro Perona, Max Welling

Main category: cs.LG

TL;DR: LangevinFlow is a physics-inspired VAE model using Langevin dynamics for neural population dynamics, outperforming benchmarks in accuracy and prediction.

Details

Motivation: To capture intrinsic and external influences in neural dynamics, leveraging physical priors like inertia and damping.

Method: Uses a sequential VAE with Langevin dynamics, a recurrent encoder, Transformer decoder, and oscillator-based potential function.

Result: Outperforms baselines on synthetic and real datasets (NLB), achieving high accuracy in firing rates and behavioral decoding.

Conclusion: LangevinFlow is a flexible, high-performing framework for modeling neural dynamics and unobserved influences.

Abstract: Neural populations exhibit latent dynamical structures that drive time-evolving spiking activities, motivating the search for models that capture both intrinsic network dynamics and external unobserved influences. In this work, we introduce LangevinFlow, a sequential Variational Auto-Encoder where the time evolution of latent variables is governed by the underdamped Langevin equation. Our approach incorporates physical priors – such as inertia, damping, a learned potential function, and stochastic forces – to represent both autonomous and non-autonomous processes in neural systems. Crucially, the potential function is parameterized as a network of locally coupled oscillators, biasing the model toward oscillatory and flow-like behaviors observed in biological neural populations. Our model features a recurrent encoder, a one-layer Transformer decoder, and Langevin dynamics in the latent space. Empirically, our method outperforms state-of-the-art baselines on synthetic neural populations generated by a Lorenz attractor, closely matching ground-truth firing rates. On the Neural Latents Benchmark (NLB), the model achieves superior held-out neuron likelihoods (bits per spike) and forward prediction accuracy across four challenging datasets. It also matches or surpasses alternative methods in decoding behavioral metrics such as hand velocity. Overall, this work introduces a flexible, physics-inspired, high-performing framework for modeling complex neural population dynamics and their unobserved influences.

[353] GRAPES: Learning to Sample Graphs for Scalable Graph Neural Networks

Taraneh Younesian, Daniel Daza, Emile van Krieken, Thiviyan Thanapalasingam, Peter Bloem

Main category: cs.LG

TL;DR: GRAPES introduces an adaptive sampling method for GNNs, improving scalability and accuracy, especially in heterophilous graphs.

Details

Motivation: Existing GNN sampling methods rely on static heuristics, which may not generalize across graphs or tasks. Adaptive sampling is needed to handle diverse graph structures.

Method: GRAPES uses a second GNN to predict node sampling probabilities, optimizing for the downstream task. It is evaluated on homophilous and heterophilous graphs.

Result: GRAPES achieves high accuracy and scalability, even with smaller sample sizes, outperforming static methods.

Conclusion: GRAPES is a scalable, adaptive solution for GNN sampling, effective in diverse graph types.

Abstract: Graph neural networks (GNNs) learn to represent nodes by aggregating information from their neighbors. As GNNs increase in depth, their receptive field grows exponentially, leading to high memory costs. Several existing methods address this by sampling a small subset of nodes, scaling GNNs to much larger graphs. These methods are primarily evaluated on homophilous graphs, where neighboring nodes often share the same label. However, most of these methods rely on static heuristics that may not generalize across different graphs or tasks. We argue that the sampling method should be adaptive, adjusting to the complex structural properties of each graph. To this end, we introduce GRAPES, an adaptive sampling method that learns to identify the set of nodes crucial for training a GNN. GRAPES trains a second GNN to predict node sampling probabilities by optimizing the downstream task objective. We evaluate GRAPES on various node classification benchmarks, involving homophilous as well as heterophilous graphs. We demonstrate GRAPES’ effectiveness in accuracy and scalability, particularly in multi-label heterophilous graphs. Unlike other sampling methods, GRAPES maintains high accuracy even with smaller sample sizes and, therefore, can scale to massive graphs. Our code is publicly available at https://github.com/dfdazac/grapes.

[354] EASTER: Embedding Aggregation-based Heterogeneous Models Training in Vertical Federated Learning

Shuo Wang, Keke Gai, Jing Yu, Liehuang Zhu, Kim-Kwang Raymond Choo, Bin Xiao

Main category: cs.LG

TL;DR: The paper proposes VFedMH, a vertical federated learning method for training heterogeneous models while protecting local data privacy through embedding protection and gradient assistance.

Details

Motivation: Existing vertical federated learning methods struggle with heterogeneous local models, impacting convergence and generalization.

Method: VFedMH aggregates local embeddings with blinding factors, involves active and passive parties for gradient computation, and trains models using heterogeneous gradients.

Result: VFedMH successfully trains multiple heterogeneous models and outperforms recent methods in performance.

Conclusion: VFedMH addresses heterogeneity in vertical federated learning, ensuring privacy and improving model performance.

Abstract: Vertical federated learning has garnered significant attention as it allows clients to train machine learning models collaboratively without sharing local data, which protects the client’s local private data. However, existing VFL methods face challenges when dealing with heterogeneous local models among participants, which affects optimization convergence and generalization. To address this challenge, this paper proposes a novel approach called Vertical federated learning for training multiple Heterogeneous models (VFedMH). VFedMH focuses on aggregating the local embeddings of each participant’s knowledge during forward propagation. To protect the participants’ local embedding values, we propose an embedding protection method based on lightweight blinding factors. In particular, participants obtain local embedding using local heterogeneous models. Then the passive party, who owns only features of the sample, injects the blinding factor into the local embedding and sends it to the active party. The active party aggregates local embeddings to obtain global knowledge embeddings and sends them to passive parties. The passive parties then utilize the global embeddings to propagate forward on their local heterogeneous networks. However, the passive party does not own the sample labels, so the local model gradient cannot be calculated locally. To overcome this limitation, the active party assists the passive party in computing its local heterogeneous model gradients. Then, each participant trains their local model using the heterogeneous model gradients. The objective is to minimize the loss value of their respective local heterogeneous models. Extensive experiments are conducted to demonstrate that VFedMH can simultaneously train multiple heterogeneous models with heterogeneous optimization and outperform some recent methods in model performance.

[355] Learning Safe Numeric Planning Action Models

Argaman Mordoch, Shahaf S. Shperberg, Roni Stern, Berndan Juba

Main category: cs.LG

TL;DR: The paper introduces N-SAM and N-SAM*, algorithms for learning safe numeric action models in planning domains, ensuring safety and applicability even with limited observations.

Details

Motivation: Real-world planning requires accurate models, especially in mission-critical domains where trial-and-error is unsafe. Existing methods are limited to Boolean variables, leaving a gap for numeric domains.

Method: Proposes N-SAM for learning safe numeric preconditions and effects, and N-SAM* to address its limitation of requiring many observations. Both algorithms guarantee safety and are evaluated against a state-of-the-art method.

Result: N-SAM runs in linear time and guarantees safety under certain conditions. N-SAM* ensures applicability even with single observations while maintaining safety and is proven optimal in sample complexity.

Conclusion: N-SAM and N-SAM* advance safe action model learning for numeric domains, with N-SAM* offering a practical solution for limited observations. The work highlights the impact of numerical accuracy on learning.

Abstract: A significant challenge in applying planning technology to real-world problems lies in obtaining a planning model that accurately represents the problem’s dynamics. Obtaining a planning model is even more challenging in mission-critical domains, where a trial-and-error approach to learning how to act is not an option. In such domains, the action model used to generate plans must be safe, in the sense that plans generated with it must be applicable and achieve their goals. % Learning safe action models for planning has been mostly explored for domains in which states are sufficiently described with Boolean variables. % In this work, we go beyond this limitation and propose the Numeric Safe Action Models Learning (N-SAM) algorithm. In this work, we present N-SAM, an action model learning algorithm capable of learning safe numeric preconditions and effects. We prove that N-SAM runs in linear time in the number of observations and, under certain conditions, is guaranteed to return safe action models. However, to preserve this safety guarantee, N-SAM must observe a substantial number of examples for each action before including it in the learned model. We address this limitation of N-SAM and propose N-SAM*, an extension to the N-SAM algorithm that always returns an action model where every observed action is applicable at least in some states, even if it was observed only once. N-SAM* does so without compromising the safety of the returned action model. We prove that N-SAM* is optimal in terms of sample complexity compared to any other algorithm that guarantees safety. N-SAM and N-SAM* are evaluated over an extensive benchmark of numeric planning domains, and their performance is compared to a state-of-the-art numeric action model learning algorithm. We also provide a discussion on the impact of numerical accuracy on the learning process.

[356] FairTargetSim: An Interactive Simulator for Understanding and Explaining the Fairness Effects of Target Variable Definition

Dalia Gala, Milo Phillips-Brown, Naman Goel, Carinal Prunkl, Laura Alvarez Jubete, medb corcoran, Ray Eitel-Porter

Main category: cs.LG

TL;DR: FairTargetSim (FTS) is an interactive, simulation-based tool to address biases in target variable definition in machine learning, demonstrated in algorithmic hiring.

Details

Motivation: Biases in target variable definitions can lead to unfair outcomes in ML systems, necessitating tools to evaluate and mitigate these biases.

Method: FTS is an open-source, interactive simulation tool that allows users to explore the impacts of target variable definitions on fairness, using real-world data and user-defined targets.

Result: FTS is demonstrated in algorithmic hiring, showing its utility for developers, stakeholders, researchers, and educators.

Conclusion: FTS provides a practical approach to responsibly develop and deploy ML systems by addressing biases in target variable definitions.

Abstract: Machine learning requires defining one’s target variable for predictions or decisions, a process that can have profound implications for fairness, since biases are often encoded in target variable definition itself, before any data collection or training. The downstream impacts of target variable definition must be taken into account in order to responsibly develop, deploy, and use the algorithmic systems. We propose FairTargetSim (FTS), an interactive and simulation-based approach for this. We demonstrate FTS using the example of algorithmic hiring, grounded in real-world data and user-defined target variables. FTS is open-source; it can be used by algorithm developers, non-technical stakeholders, researchers, and educators in a number of ways. FTS is available at: http://tinyurl.com/ftsinterface. The video accompanying this paper is here: http://tinyurl.com/ijcaifts.

[357] Unified ODE Analysis of Smooth Q-Learning Algorithms

Donghwan Lee

Main category: cs.LG

TL;DR: The paper presents a unified convergence analysis for Q-learning and its smooth variants, improving upon the restrictive switching system framework.

Details

Motivation: To generalize and simplify the convergence analysis of Q-learning and its smooth variants, addressing limitations of the switching system approach.

Method: Uses a more general ODE model inspired by $p$-norm Lyapunov functions, covering asynchronous Q-learning and smooth versions.

Result: The proposed framework provides a simpler and more general convergence proof for Q-learning and its variants.

Conclusion: The analysis offers a unified and less restrictive approach to proving convergence in reinforcement learning algorithms.

Abstract: Convergence of Q-learning has been the focus of extensive research over the past several decades. Recently, an asymptotic convergence analysis for Q-learning was introduced using a switching system framework. This approach applies the so-called ordinary differential equation (ODE) approach to prove the convergence of the asynchronous Q-learning modeled as a continuous-time switching system, where notions from switching system theory are used to prove its asymptotic stability without using explicit Lyapunov arguments. However, to prove stability, restrictive conditions, such as quasi-monotonicity, must be satisfied for the underlying switching systems, which makes it hard to easily generalize the analysis method to other reinforcement learning algorithms, such as the smooth Q-learning variants. In this paper, we present a more general and unified convergence analysis that improves upon the switching system approach and can analyze Q-learning and its smooth variants. The proposed analysis is motivated by previous work on the convergence of synchronous Q-learning based on $p$-norm serving as a Lyapunov function. However, the proposed analysis addresses more general ODE models that can cover both asynchronous Q-learning and its smooth versions with simpler frameworks.

[358] SimAD: A Simple Dissimilarity-based Approach for Time Series Anomaly Detection

Zhijie Zhong, Zhiwen Yu, Xing Xi, Yue Xu, Wenming Cao, Yiyuan Yang, Kaixiang Yang, Jane You

Main category: cs.LG

TL;DR: SimAD is a simple dissimilarity-based approach for time series anomaly detection, addressing limitations of existing methods with extended temporal context, robust feature extraction, and improved evaluation metrics.

Details

Motivation: Existing reconstruction-based methods struggle with limited temporal contexts, insufficient normal pattern representation, and flawed evaluation metrics, hindering effective anomaly detection.

Method: SimAD uses a patching-based feature extractor, EmbedPatch encoder, and ContrastFusion module to highlight differences between normal and abnormal data. It also introduces two new metrics, UAff and NAff.

Result: SimAD outperforms state-of-the-art methods, achieving significant improvements in F1, Aff-F1, NAff-F1, and AUC on diverse datasets.

Conclusion: SimAD provides a robust and effective solution for time series anomaly detection, validated by theoretical and experimental results.

Abstract: Despite the prevalence of reconstruction-based deep learning methods, time series anomaly detection remains a tremendous challenge. Existing approaches often struggle with limited temporal contexts, insufficient representation of normal patterns, and flawed evaluation metrics, all of which hinder their effectiveness in detecting anomalous behavior. To address these issues, we introduce a $\textbf{Sim}$ple dissimilarity-based approach for time series $\textbf{A}$nomaly $\textbf{D}$etection, referred to as $\textbf{SimAD}$. Specifically, SimAD first incorporates a patching-based feature extractor capable of processing extended temporal windows and employs the EmbedPatch encoder to fully integrate normal behavioral patterns. Second, we design an innovative ContrastFusion module in SimAD, which strengthens the robustness of anomaly detection by highlighting the distributional differences between normal and abnormal data. Third, we introduce two robust enhanced evaluation metrics, Unbiased Affiliation (UAff) and Normalized Affiliation (NAff), designed to overcome the limitations of existing metrics by providing better distinctiveness and semantic clarity. The reliability of these two metrics has been demonstrated by both theoretical and experimental analyses. Experiments conducted on seven diverse time series datasets clearly demonstrate SimAD’s superior performance compared to state-of-the-art methods, achieving relative improvements of $\textbf{19.85%}$ on F1, $\textbf{4.44%}$ on Aff-F1, $\textbf{77.79%}$ on NAff-F1, and $\textbf{9.69%}$ on AUC on six multivariate datasets. Code and pre-trained models are available at https://github.com/EmorZz1G/SimAD.

[359] SA-GDA: Spectral Augmentation for Graph Domain Adaptation

Jinhui Pang, Zixuan Wang, Jiliang Tang, Mingyan Xiao, Nan Yin

Main category: cs.LG

TL;DR: The paper introduces Spectral Augmentation for Graph Domain Adaptation (SAGA) to address domain adaptation in graph node classification by aligning category feature spaces in the spectral domain, leveraging a dual graph convolutional network and adversarial learning.

Details

Motivation: Existing GNNs rely on supervised training with abundant labels, limiting transferability. Domain adaptation for graph node classification lacks focus on category-level feature alignment, causing classification confusion in target domains.

Method: SAGA aligns category feature spaces in the spectral domain, uses a dual graph convolutional network for local and global consistency, and employs adversarial learning for knowledge transfer.

Result: Experiments on public datasets demonstrate SAGA’s effectiveness in improving domain adaptation for graph node classification.

Conclusion: SAGA successfully addresses domain adaptation challenges by spectral alignment and adversarial learning, enhancing classification performance in target domains.

Abstract: Graph neural networks (GNNs) have achieved impressive impressions for graph-related tasks. However, most GNNs are primarily studied under the cases of signal domain with supervised training, which requires abundant task-specific labels and is difficult to transfer to other domains. There are few works focused on domain adaptation for graph node classification. They mainly focused on aligning the feature space of the source and target domains, without considering the feature alignment between different categories, which may lead to confusion of classification in the target domain. However, due to the scarcity of labels of the target domain, we cannot directly perform effective alignment of categories from different domains, which makes the problem more challenging. In this paper, we present the \textit{Spectral Augmentation for Graph Domain Adaptation (\method{})} for graph node classification. First, we observe that nodes with the same category in different domains exhibit similar characteristics in the spectral domain, while different classes are quite different. Following the observation, we align the category feature space of different domains in the spectral domain instead of aligning the whole features space, and we theoretical proof the stability of proposed \method{}. Then, we develop a dual graph convolutional network to jointly exploits local and global consistency for feature aggregation. Last, we utilize a domain classifier with an adversarial learning submodule to facilitate knowledge transfer between different domain graphs. Experimental results on a variety of publicly available datasets reveal the effectiveness of our \method{}.

[360] Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos

Main category: cs.LG

TL;DR: ONI proposes a scalable method for synthesizing dense rewards from natural language using LLMs, avoiding the need for large offline datasets or per-observation LLM annotations.

Details

Motivation: Addressing limitations in prior work, which either lacks scalability or requires extensive offline datasets, by leveraging LLMs for reward synthesis in RL.

Method: ONI uses a distributed architecture to learn RL policies and intrinsic rewards via asynchronous LLM feedback, distilling annotations into a reward model with varying complexity (hashing, classification, ranking).

Result: Achieves state-of-the-art performance in the NetHack Learning Environment without needing large offline datasets.

Conclusion: ONI offers a scalable and efficient solution for dense reward synthesis in RL, overcoming prior limitations.

Abstract: Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent’s collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment, while removing the need for large offline datasets required by prior work. We make our code available at https://github.com/facebookresearch/oni .

[361] ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack

Main category: cs.LG

TL;DR: ReVISE enables LLMs to self-correct outputs via self-verification and curriculum learning, improving reasoning efficiency.

Details

Motivation: Replicating human-like self-awareness in LLMs is challenging; ReVISE addresses this without extensive external verifiers.

Method: Uses self-verification and curriculum learning with preference pairs for training, plus confidence-aware decoding during inference.

Result: Achieves efficient self-correction and significant reasoning performance improvement.

Conclusion: ReVISE is an effective framework for enhancing LLM self-awareness and reasoning.

Abstract: Self-awareness, i.e., the ability to assess and correct one’s own generation, is a fundamental aspect of human intelligence, making its replication in large language models (LLMs) an important yet challenging task. Previous works tackle this by employing extensive reinforcement learning or rather relying on large external verifiers. In this work, we propose Refine via Intrinsic Self-Verification (ReVISE), an efficient and effective framework that enables LLMs to self-correct their outputs through self-verification. The core idea of ReVISE is to enable LLMs to verify their reasoning processes and continually rethink reasoning trajectories based on its verification. We introduce a structured curriculum based upon online preference learning to implement this efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., self-verification and reasoning correction), we tackle each task sequentially using curriculum learning, collecting both failed and successful reasoning paths to construct preference pairs for efficient training. During inference, our approach enjoys natural test-time scaling by integrating self-verification and correction capabilities, further enhanced by our proposed confidence-aware decoding mechanism. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.

[362] Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

Devdhar Patel, Hava Siegelmann

Main category: cs.LG

TL;DR: SRL is a new RL algorithm that generates action sequences for lower decision frequencies, reducing sample complexity and outperforming traditional RL in variable-frequency tasks.

Details

Motivation: Current RL algorithms require fast reaction times, which are impractical for real-world applications. SRL aims to enable effective control at slower decision frequencies.

Method: SRL uses a model and actor-critic architecture with a ’temporal recall’ mechanism to learn action sequences. The critic estimates intermediate states for learning signals.

Result: SRL matches state-of-the-art performance while reducing sample complexity and excels in variable-frequency tasks, as shown by the new FAS metric.

Conclusion: SRL is a practical solution for RL in real-world settings with slower decision frequencies, offering comparable performance to model-based planning.

Abstract: Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a “temporal recall” mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Furthermore, we compare SRL with model-based online planning, showing that SRL achieves comparable FAS while leveraging the same model during training that online planners use for planning.

[363] Representation Bending for Large Language Model Safety

Ashkan Yousefpour, Taeheon Kim, Ryan S. Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, Jonghyun Choi

Main category: cs.LG

TL;DR: RepBend is a novel method to enhance LLM safety by disrupting harmful representations, outperforming existing techniques with minimal impact on usability.

Details

Motivation: Addressing the limitations of current safety-enhancing techniques for LLMs, which struggle with generalization and manual defenses against adversarial attacks.

Method: RepBend introduces activation steering into loss-based fine-tuning to disrupt harmful behaviors in LLMs.

Result: Achieves up to 95% reduction in attack success rates across benchmarks, with negligible impact on usability.

Conclusion: RepBend offers a scalable and effective solution for inherent LLM safety, outperforming prior methods.

Abstract: Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model’s behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.

[364] Large Language Models Engineer Too Many Simple Features For Tabular Data

Jaris Küken, Lennart Purucker, Frank Hutter

Main category: cs.LG

TL;DR: The paper investigates bias in LLMs for feature engineering, finding a preference for simple operators and proposing a method to detect such biases.

Details

Motivation: To explore if LLMs exhibit biases in feature engineering that could hinder automated data science, despite not being ethically concerning.

Method: Proposes detecting anomalies in operator frequency suggested by LLMs for feature engineering, tested on four LLMs across 27 datasets.

Result: LLMs show bias toward simple operators (e.g., addition) and underuse complex ones, negatively impacting predictive performance.

Conclusion: Mitigating bias in LLMs for feature engineering is necessary to improve their utility in automated data science.

Abstract: Tabular machine learning problems often require time-consuming and labor-intensive feature engineering. Recent efforts have focused on using large language models (LLMs) to capitalize on their potential domain knowledge. At the same time, researchers have observed ethically concerning negative biases in other LLM-related use cases, such as text generation. These developments motivated us to investigate whether LLMs exhibit a bias that negatively impacts the performance of feature engineering. While not ethically concerning, such a bias could hinder practitioners from fully utilizing LLMs for automated data science. Therefore, we propose a method to detect potential biases by detecting anomalies in the frequency of operators (e.g., adding two features) suggested by LLMs when engineering new features. Our experiments evaluate the bias of four LLMs, two big frontier and two small open-source models, across 27 tabular datasets. Our results indicate that LLMs are biased toward simple operators, such as addition, and can fail to utilize more complex operators, such as grouping followed by aggregations. Furthermore, the bias can negatively impact the predictive performance when using LLM-generated features. Our results call for mitigating bias when using LLMs for feature engineering.

[365] BMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection

Yize Zhou, Jie Zhang, Meijie Wang, Lun Yu

Main category: cs.LG

TL;DR: BMDetect is a multimodal deep learning framework for detecting academic misconduct in biomedical research, outperforming single-modality methods by integrating journal metadata, semantic embeddings, and GPT-4o-mined textual attributes.

Details

Motivation: Addressing the challenges of algorithmic narrowness and fragmented pipelines in academic misconduct detection in biomedical research.

Method: Uses multimodal fusion of journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies).

Result: Achieves 74.33% AUC, outperforming single-modality baselines by 8.6%, and demonstrates transferability across biomedical subfields.

Conclusion: BMDetect advances scalable, interpretable tools for safeguarding research integrity in biomedical research.

Abstract: Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for holistic manuscript evaluation. Key innovations include: (1) multimodal fusion of domain-specific features to reduce detection bias; (2) quantitative evaluation of feature importance, identifying journal authority metrics (e.g., SJR-index) and textual anomalies (e.g., statistical outliers) as dominant predictors; and (3) the BioMCD dataset, a large-scale benchmark with 13,160 retracted articles and 53,411 controls. BMDetect achieves 74.33% AUC, outperforming single-modality baselines by 8.6%, and demonstrates transferability across biomedical subfields. This work advances scalable, interpretable tools for safeguarding research integrity.

[366] ComFairGNN: Community Fair Graph Neural Network

Yonas Sium, Qi Li

Main category: cs.LG

TL;DR: The paper addresses bias in Graph Neural Networks (GNNs) by introducing a community-level fairness evaluation and a debiasing framework, ComFairGNN, which improves accuracy and fairness.

Details

Motivation: GNNs often produce biased predictions due to node attributes and neighborhood structures, and existing fairness metrics are oversimplified, leading to misleading evaluations.

Method: The paper proposes a community-level strategy to measure bias and introduces ComFairGNN, a framework using a learnable coreset-based debiasing function to address bias in neighborhood aggregation.

Result: Evaluations on three benchmark datasets show ComFairGNN’s effectiveness in improving both accuracy and fairness.

Conclusion: The study highlights the importance of community-level fairness evaluation and presents ComFairGNN as a robust solution for mitigating bias in GNNs.

Abstract: Graph Neural Networks (GNNs) have become the leading approach for addressing graph analytical problems in various real-world scenarios. However, GNNs may produce biased predictions against certain demographic subgroups due to node attributes and neighbors surrounding a node. Most current research on GNN fairness focuses predominantly on debiasing GNNs using oversimplified fairness evaluation metrics, which can give a misleading impression of fairness. Understanding the potential evaluation paradoxes due to the complicated nature of the graph structure is crucial for developing effective GNN debiasing mechanisms. In this paper, we examine the effectiveness of current GNN debiasing methods in terms of unfairness evaluation. Specifically, we introduce a community-level strategy to measure bias in GNNs and evaluate debiasing methods at this level. Further, We introduce ComFairGNN, a novel framework designed to mitigate community-level bias in GNNs. Our approach employs a learnable coreset-based debiasing function that addresses bias arising from diverse local neighborhood distributions during GNNs neighborhood aggregation. Comprehensive evaluations on three benchmark datasets demonstrate our model’s effectiveness in both accuracy and fairness metrics.

[367] The Pragmatic Frames of Spurious Correlations in Machine Learning: Interpreting How and Why They Matter

Samuel J. Bell, Skyler Wang

Main category: cs.LG

TL;DR: The paper explores how spurious correlations in ML are judged using pragmatic frames (relevance, generalizability, human-likeness, harmfulness) rather than just statistical definitions.

Details

Motivation: To understand how spuriousness is negotiated in ML research beyond formal definitions, focusing on practical impacts and ethical considerations.

Method: Survey of ML literature to identify pragmatic frames for assessing spurious correlations.

Result: Identified four frames: relevance, generalizability, human-likeness, and harmfulness, showing spuriousness is a situated judgment.

Conclusion: Spuriousness in ML is context-dependent, shaped by technical, epistemic, and ethical factors, contributing to broader discussions on concept operationalization.

Abstract: Learning correlations from data forms the foundation of today’s machine learning (ML) and artificial intelligence (AI) research. While contemporary methods enable the automatic discovery of complex patterns, they are prone to failure when unintended correlations are captured. This vulnerability has spurred a growing interest in interrogating spuriousness, which is often seen as a threat to model performance, fairness, and robustness. In this article, we trace departures from the conventional statistical definition of spuriousness – which denotes a non-causal relationship arising from coincidence or confounding – to examine how its meaning is negotiated in ML research. Rather than relying solely on formal definitions, researchers assess spuriousness through what we call pragmatic frames: judgments based on what a correlation does in practice – how it affects model behavior, supports or impedes task performance, or aligns with broader normative goals. Drawing on a broad survey of ML literature, we identify four such frames: relevance (“Models should use correlations that are relevant to the task”), generalizability (“Models should use correlations that generalize to unseen data”), human-likeness (“Models should use correlations that a human would use to perform the same task”), and harmfulness (“Models should use correlations that are not socially or ethically harmful”). These representations reveal that correlation desirability is not a fixed statistical property but a situated judgment informed by technical, epistemic, and ethical considerations. By examining how a foundational ML conundrum is problematized in research literature, we contribute to broader conversations on the contingent practices through which technical concepts like spuriousness are defined and operationalized.

[368] Searching Latent Program Spaces

Matthew V Macfarlane, Clément Bonnet

Main category: cs.LG

TL;DR: LPN combines neural and symbolic methods for efficient program synthesis, outperforming traditional approaches on generalization tasks.

Details

Motivation: Address the limitations of program synthesis (scaling issues) and deep learning (lack of structured adaptation) by integrating test-time search into neural models.

Method: Proposes Latent Program Network (LPN), which learns a latent space of implicit programs and searches it using gradients at test time.

Result: LPN outperforms or matches in-context learning and test-time training methods, doubling performance on out-of-distribution tasks with test-time search.

Conclusion: LPN effectively bridges symbolic and neural approaches, enabling efficient adaptation and generalization without predefined DSLs.

Abstract: General intelligence requires systems that acquire new skills efficiently and generalize beyond their training distributions. Although program synthesis approaches have strong generalization power, they face scaling issues due to large combinatorial spaces that quickly make them impractical and require human-generated DSLs or pre-trained priors to narrow this search space. On the other hand, deep learning methods have had high successes, but they lack structured test-time adaptation and rely on heavy stochastic sampling or expensive gradient updates for fine-tuning. In this work, we propose the Latent Program Network (LPN), a new architecture that builds in test-time search directly into neural models. LPN learns a latent space of implicit programs–neurally mapping inputs to outputs–through which it can search using gradients at test time. LPN combines the adaptability of symbolic approaches and the scalability of neural methods. It searches through a compact latent space at test time and bypasses the need for pre-defined domain-specific languages. On a range of programming-by-examples tasks, LPN either outperforms or matches performance compared to in-context learning and test-time training methods. Tested on the ARC-AGI benchmark, we demonstrate that LPN can both learn a compact program space and search through it at test time to adapt to novel tasks. LPN doubles its performance on out-of-distribution tasks when test-time search is switched on.

[369] Few-Shot Radar Signal Recognition through Self-Supervised Learning and Radio Frequency Domain Adaptation

Zi Huang, Simon Denman, Akila Pemasiri, Clinton Fookes, Terrence Martin

Main category: cs.LG

TL;DR: A self-supervised learning method using masked signal modeling and RF domain adaptation improves few-shot radar signal recognition in low-data scenarios.

Details

Motivation: Addressing the challenge of scarce annotated RF data in electronic warfare, which limits deep learning effectiveness.

Method: Two-step approach: pre-training masked autoencoders on diverse RF signals, then transferring learned representations to radar signals.

Result: Achieves up to 17.5% improvement in 1-shot classification accuracy with in-domain pre-training and 16.31% with out-of-domain pre-training.

Conclusion: Sets a new benchmark for few-shot radar signal classification, demonstrating SSL’s effectiveness in low-data EW scenarios.

Abstract: Radar signal recognition (RSR) plays a pivotal role in electronic warfare (EW), as accurately classifying radar signals is critical for informing decision-making. Recent advances in deep learning have shown significant potential in improving RSR in domains with ample annotated data. However, these methods fall short in EW scenarios where annotated radio frequency (RF) data are scarce or impractical to obtain. To address these challenges, we introduce a self-supervised learning (SSL) method which utilises masked signal modelling and RF domain adaption to perform few-shot RSR and enhance performance in environments with limited RF samples and annotations. We propose a two-step approach, first pre-training masked autoencoders (MAE) on baseband in-phase and quadrature (I/Q) signals from diverse RF domains, and then transferring the learned representations to the radar domain, where annotated data are scarce. Empirical results show that our lightweight self-supervised ResNet1D model with domain adaptation achieves up to a 17.5% improvement in 1-shot classification accuracy when pre-trained on in-domain signals (i.e., radar signals) and up to a 16.31% improvement when pre-trained on out-of-domain signals (i.e., comm signals), compared to its baseline without using SSL. We also present reference results for several MAE designs and pre-training strategies, establishing a new benchmark for few-shot radar signal classification.

[370] Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors

Jingyang Ke, Feiyang Wu, Jiyi Wang, Jeffrey Markowitz, Anqi Wu

Main category: cs.LG

TL;DR: SWIRL extends traditional IRL by incorporating history-dependent reward functions to model complex, long-term animal behaviors, outperforming non-history-dependent models.

Details

Motivation: Traditional methods limit understanding of decision-making to short-term, explicit goals, ignoring intrinsic motivations and history-dependent behaviors in natural settings.

Method: SWIRL introduces time-varying, history-dependent reward functions to model transitions between short-term decision-making processes.

Result: SWIRL outperforms non-history-dependent models in simulated and real-world datasets, both quantitatively and qualitatively.

Conclusion: SWIRL advances IRL by incorporating history dependency, improving the modeling of naturalistic animal decision-making.

Abstract: Traditional approaches to studying decision-making in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWitching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.

[371] Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs

Jost Arndt, Utku Isil, Michael Detzel, Wojciech Samek, Jackie Ma

Main category: cs.LG

TL;DR: The paper introduces synthetic datasets based on PDEs to address data scarcity in spatio-temporal graph modeling, showcasing applications in disasters and hazards, and benchmarks ML models on epidemiological data.

Details

Motivation: To bridge the gap between PDE-based physical processes and temporal graph ML by addressing data scarcity and enabling customized dataset creation.

Method: Creation of synthetic datasets using PDEs for spatio-temporal graph modeling, applied to epidemiology, atmospheric particles, and tsunami waves. Benchmarking ML models on the epidemiological dataset and demonstrating pre-training benefits.

Result: Successful creation of PDE-based datasets and improved model performance on real-world data through pre-training.

Conclusion: The work provides a framework for generating customizable datasets and benchmarks, advancing PDE-based temporal graph ML.

Abstract: Many physical processes can be expressed through partial differential equations (PDEs). Real-world measurements of such processes are often collected at irregularly distributed points in space, which can be effectively represented as graphs; however, there are currently only a few existing datasets. Our work aims to make advancements in the field of PDE-modeling accessible to the temporal graph machine learning community, while addressing the data scarcity problem, by creating and utilizing datasets based on PDEs. In this work, we create and use synthetic datasets based on PDEs to support spatio-temporal graph modeling in machine learning for different applications. More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by benchmarking several machine learning models on the epidemiological dataset. Additionally, we show how pre-training on this dataset can improve model performance on real-world epidemiological data. The presented methods enable others to create datasets and benchmarks customized to individual requirements. The source code for our methodology and the three created datasets can be found on https://github.com/github-usr-ano/Temporal_Graph_Data_PDEs.

[372] Gram-Schmidt Methods for Unsupervised Feature Extraction and Selection

Bahram Yaghooti, Netanel Raviv, Bruno Sinopoli

Main category: cs.LG

TL;DR: The paper proposes a Gram-Schmidt (GS) type orthogonalization process for feature extraction and selection in unsupervised learning, addressing nonlinear dependencies. It outperforms state-of-the-art linear and some nonlinear methods.

Details

Motivation: To tackle the challenge of feature extraction and selection in data with nonlinear dependencies, which is fundamental in unsupervised learning.

Method: Uses a GS-type orthogonalization process over function spaces to construct covariance matrices for identifying or removing dependencies. Provides linear feature extraction and selection algorithms.

Result: The method shows superior performance over state-of-the-art linear algorithms and competes with nonlinear methods like autoencoders, kernel PCA, and UMAP. It also generalizes a recent Fourier-based feature selection mechanism with reduced complexity.

Conclusion: The proposed GS-based approach effectively handles nonlinear dependencies, offering competitive or better performance than existing methods, and generalizes prior work with improved efficiency.

Abstract: Feature extraction and selection in the presence of nonlinear dependencies among the data is a fundamental challenge in unsupervised learning. We propose using a Gram-Schmidt (GS) type orthogonalization process over function spaces to detect and map out such dependencies. Specifically, by applying the GS process over some family of functions, we construct a series of covariance matrices that can either be used to identify new large-variance directions, or to remove those dependencies from known directions. In the former case, we provide information-theoretic guarantees in terms of entropy reduction. In the latter, we provide precise conditions by which the chosen function family eliminates existing redundancy in the data. Each approach provides both a feature extraction and a feature selection algorithm. Our feature extraction methods are linear, and can be seen as natural generalization of principal component analysis (PCA). We provide experimental results for synthetic and real-world benchmark datasets which show superior performance over state-of-the-art (linear) feature extraction and selection algorithms. Surprisingly, our linear feature extraction algorithms are comparable and often outperform several important nonlinear feature extraction methods such as autoencoders, kernel PCA, and UMAP. Furthermore, one of our feature selection algorithms strictly generalizes a recent Fourier-based feature selection mechanism (Heidari et al., IEEE Transactions on Information Theory, 2022), yet at significantly reduced complexity.

[373] X Hacking: The Threat of Misguided AutoML

Rahul Sharma, Sergey Redyuk, Sumantrak Mukherjee, Andrea Šipka, Eyke Hüllermeier, Sebastian Vollmer, David Selby

Main category: cs.LG

TL;DR: The paper introduces X-hacking, a manipulation of XAI metrics like SHAP values, and demonstrates how automated pipelines can exploit model multiplicity to achieve desired explanations. It also explores detection and prevention methods.

Details

Motivation: To highlight the vulnerability of XAI metrics to manipulation and the ethical implications for trust and reproducibility in AI.

Method: Formulates X-hacking as a multi-objective optimization problem, uses Bayesian optimization to accelerate it, and tests on real-world datasets.

Result: Bayesian optimization speeds up X-hacking 3-fold for susceptible features, and dataset vulnerability is linked to feature redundancy.

Conclusion: The paper calls for methods to detect and prevent X-hacking and discusses its ethical impact on XAI credibility.

Abstract: Explainable AI (XAI) and interpretable machine learning methods help to build trust in model predictions and derived insights, yet also present a perverse incentive for analysts to manipulate XAI metrics to support pre-specified conclusions. This paper introduces the concept of X-hacking, a form of p-hacking applied to XAI metrics such as SHAP values. We show how easily an automated machine learning pipeline can be adapted to exploit model multiplicity at scale: searching a Rashomon set of ‘defensible’ models with similar predictive performance to find a desired explanation. We formulate the trade-off between explanation and accuracy as a multi-objective optimisation problem, and illustrate empirically on familiar real-world datasets that, on average, Bayesian optimisation accelerates X-hacking 3-fold for features susceptible to it, versus random sampling. We show the vulnerability of a dataset to X-hacking can be determined by information redundancy among features. Finally, we suggest possible methods for detection and prevention, and discuss ethical implications for the credibility and reproducibility of XAI.

[374] Score-of-Mixture Training: Training One-Step Generative Models Made Simple via Score Estimation of Mixture Distributions

Tejas Jayashankar, J. Jon Ryu, Gregory Wornell

Main category: cs.LG

TL;DR: Score-of-Mixture Training (SMT) trains one-step generative models by minimizing α-skew Jensen-Shannon divergence, estimating scores of mixture distributions. It supports training from scratch or distillation (SMD) and performs competitively on benchmarks.

Details

Motivation: To simplify and stabilize training of one-step generative models while maintaining or improving performance compared to existing methods.

Method: SMT minimizes α-skew Jensen-Shannon divergence by estimating scores of mixture distributions between real and fake samples across noise levels. Supports training from scratch (SMT) or distillation (SMD).

Result: Competitive or superior performance on CIFAR-10 and ImageNet 64x64 compared to existing methods.

Conclusion: SMT/SMD offers a simple, stable, and effective approach for training one-step generative models, with potential for outperforming current techniques.

Abstract: We propose Score-of-Mixture Training (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the $\alpha$-skew Jensen–Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels. Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call Score-of-Mixture Distillation (SMD). It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even outperform existing methods.

[375] TorchCP: A Python Library for Conformal Prediction

Jianguo Huang, Jianqing Song, Xuanning Zhou, Bingyi Jing, Hongxin Wei

Main category: cs.LG

TL;DR: TorchCP is a PyTorch-native library for integrating conformal prediction (CP) into deep learning, offering guaranteed uncertainty estimation with modularity and scalability.

Details

Motivation: Deep learning lacks reliable uncertainty estimation, making CP essential for robust predictions.

Method: TorchCP provides state-of-the-art CP algorithms, modular design, and GPU acceleration for tasks like classification and regression.

Result: TorchCP is widely adopted with extensive code, tests, and documentation, supporting diverse applications.

Conclusion: TorchCP bridges statistics and computer science, advancing CP in deep learning.

Abstract: Conformal prediction (CP) is a robust statistical framework that generates prediction intervals or sets with guaranteed coverage probability, addressing the challenge of quantifying predictive uncertainty in deep learning. Despite advancements in deep learning architectures and datasets, reliable uncertainty estimation remains elusive, making CP increasingly vital. This paper introduces TorchCP, a PyTorch-native library designed to integrate state-of-the-art CP algorithms into deep learning tasks, including classification, regression, graph neural networks, and large language models. TorchCP offers a comprehensive suite of advanced methodologies, a modular design for easy customization, and full GPU-accelerated scalability. Released under the LGPL-3.0 license, TorchCP has gained widespread adoption with over 12,582 PyPi downloads. It is supported by approximately 16,132 lines of code, 564 unit tests achieving 100% coverage, and comprehensive documentation. By bridging statistics and computer science, TorchCP empowers researchers and practitioners to advance conformal prediction in diverse deep learning applications.

[376] Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations

Stefan Balauca, Mark Niklas Müller, Yuhao Mao, Maximilian Baader, Marc Fischer, Martin Vechev

Main category: cs.LG

TL;DR: Gaussian Loss Smoothing (GLS) improves certified adversarial robustness training by addressing issues like discontinuity and non-smoothness in loss surfaces. Two GLS variants, PGPE and RGS, outperform state-of-the-art methods with tight relaxations.

Details

Motivation: Training neural networks with high certified adversarial robustness is challenging due to issues like discontinuity and sensitivity in loss surfaces induced by tight relaxations.

Method: Proposes Gaussian Loss Smoothing (GLS) with two variants: PGPE (zeroth-order optimization) for non-differentiable relaxations and RGS (first-order optimization) for efficiency.

Result: GLS with tight relaxations surpasses state-of-the-art methods in certified adversarial robustness training.

Conclusion: GLS shows promise for training certifiably robust networks and enables leveraging tighter relaxations effectively.

Abstract: Training neural networks with high certified accuracy against adversarial examples remains an open challenge despite significant efforts. While certification methods can effectively leverage tight convex relaxations for bound computation, in training, these methods, perhaps surprisingly, can perform worse than looser relaxations. Prior work hypothesized that this phenomenon is caused by the discontinuity, non-smoothness, and perturbation sensitivity of the loss surface induced by tighter relaxations. In this work, we theoretically show that applying Gaussian Loss Smoothing (GLS) on the loss surface can alleviate these issues. We confirm this empirically by instantiating GLS with two variants: a zeroth-order optimization algorithm, called PGPE, which allows training with non-differentiable relaxations, and a first-order optimization algorithm, called RGS, which requires gradients of the relaxation but is much more efficient than PGPE. Extensive experiments show that when combined with tight relaxations, these methods surpass state-of-the-art methods when training on the same network architecture for many settings. Our results clearly demonstrate the promise of Gaussian Loss Smoothing for training certifiably robust neural networks and pave a path towards leveraging tighter relaxations for certified training.

[377] Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach

Jingwei Zhang, Mohammad Jalali, Cheuk Ting Li, Farzan Farnia

Main category: cs.LG

TL;DR: The paper proposes FINC, a spectral method for differential clustering to identify sample types generated differently by two generative models, using scalable Fourier-based techniques.

Details

Motivation: Existing quantitative scores for comparing generative models lack nuance in revealing differences in sample types produced by each model.

Method: Develops FINC, a scalable spectral method using random Fourier features to estimate kernel covariance eigenspaces and identify dominant sample types.

Result: Demonstrates FINC’s scalability on large-scale datasets, effectively highlighting sample type frequency differences between models.

Conclusion: FINC provides a practical tool for nuanced comparison of generative models, with code available for public use.

Abstract: A fine-grained comparison of generative models requires the identification of sample types generated differently by each of the involved models. While quantitative scores have been proposed in the literature to rank different generative models, score-based evaluation and ranking do not reveal the nuanced differences between the generative models in producing different sample types. In this work, we propose solving a differential clustering problem to detect sample types generated differently by two generative models. To solve the differential clustering problem, we develop a spectral method called Fourier-based Identification of Novel Clusters (FINC) to identify modes produced by a generative model with a higher frequency in comparison to a reference distribution. FINC provides a scalable algorithm based on random Fourier features to estimate the eigenspace of kernel covariance matrices of two generative models and utilize the principal eigendirections to detect the sample types present more dominantly in each model. We demonstrate the application of the FINC method to large-scale computer vision datasets and generative modeling frameworks. Our numerical results suggest the scalability of the developed Fourier-based method in highlighting the sample types produced with different frequencies by generative models. The project code is available at https://github.com/buyeah1109/FINC.

[378] Multi-View Node Pruning for Accurate Graph Representation

Jiseong Park, Hanjin Kim, Seojin Kim, Jueun Choi, Doheon Lee, Sung Ju Hwang

Main category: cs.LG

TL;DR: MVP is a multi-view pruning method for graph pooling that improves performance by considering node importance from diverse perspectives and reconstruction loss.

Details

Motivation: Existing graph pooling methods often drop nodes based on attention scores, ignoring feature-level relevance to the task. MVP addresses this by incorporating multi-view frameworks and reconstruction loss.

Method: MVP constructs multiple graphs for different views, learns node scores using reconstruction and task loss, and integrates with hierarchical pooling frameworks.

Result: MVP significantly improves base pooling methods’ performance on benchmark datasets, outperforming baselines.

Conclusion: MVP’s success lies in multi-view encoding and reconstruction loss, effectively identifying less important nodes.

Abstract: Graph pooling, which compresses a whole graph into a smaller coarsened graph, is an essential component of graph representation learning. To efficiently compress a given graph, graph pooling methods often drop their nodes with attention-based scoring with the task loss. However, this often results in simply removing nodes with lower degrees without consideration of their feature-level relevance to the given task. To fix this problem, we propose a Multi-View Pruning(MVP), a graph pruning method based on a multi-view framework and reconstruction loss. Given a graph, MVP first constructs multiple graphs for different views either by utilizing the predefined modalities or by randomly partitioning the input features, to consider the importance of each node in diverse perspectives. Then, it learns the score for each node by considering both the reconstruction and the task loss. MVP can be incorporated with any hierarchical pooling framework to score the nodes. We validate MVP on multiple benchmark datasets by coupling it with two graph pooling methods, and show that it significantly improves the performance of the base graph pooling method, outperforming all baselines. Further analysis shows that both the encoding of multiple views and the consideration of reconstruction loss are the key to the success of MVP, and that it indeed identifies nodes that are less important according to domain knowledge.

[379] Group-wise oracle-efficient algorithms for online multi-group learning

Samuel Deng, Daniel Hsu, Jingwen Liu

Main category: cs.LG

TL;DR: The paper introduces oracle-efficient algorithms for online multi-group learning with sublinear regret, addressing large group families without explicit enumeration.

Details

Motivation: To tackle scenarios where the family of groups is too large to enumerate explicitly, focusing on fairness applications with expressive demographic subpopulations.

Method: Designs oracle-efficient algorithms for three settings: i.i.d., adversarial with smoothed context distributions, and adversarial transductive.

Result: Achieves sublinear regret in all considered settings.

Conclusion: Proposes practical solutions for online multi-group learning with large group families, enhancing fairness in predictions.

Abstract: We study the problem of online multi-group learning, a learning model in which an online learner must simultaneously achieve small prediction regret on a large collection of (possibly overlapping) subsequences corresponding to a family of groups. Groups are subsets of the context space, and in fairness applications, they may correspond to subpopulations defined by expressive functions of demographic attributes. In contrast to previous work on this learning model, we consider scenarios in which the family of groups is too large to explicitly enumerate, and hence we seek algorithms that only access groups via an optimization oracle. In this paper, we design such oracle-efficient algorithms with sublinear regret under a variety of settings, including: (i) the i.i.d. setting, (ii) the adversarial setting with smoothed context distributions, and (iii) the adversarial transductive setting.

[380] LaCoOT: Layer Collapse through Optimal Transport

Victor Quétu, Zhu Liao, Nour Hezbri, Fabio Pizzati, Enzo Tartaglione

Main category: cs.LG

TL;DR: The paper introduces an optimal transport-based method to reduce the depth of over-parametrized deep neural networks, using Max-Sliced Wasserstein distance for regularization, achieving better performance/depth trade-offs.

Details

Motivation: Deep neural networks' high computational demands limit their deployment on resource-constrained devices. The paper aims to reduce this burden by optimizing network depth.

Method: Proposes a regularization strategy using Max-Sliced Wasserstein distance to minimize intermediate feature distribution distances, enabling layer removal.

Result: The method outperforms existing techniques in performance/depth trade-off, validated on image classification and generative models.

Conclusion: The approach effectively reduces computational burden while maintaining performance, with potential for broader adoption in resource-limited settings.

Abstract: Although deep neural networks are well-known for their outstanding performance in tackling complex tasks, their hunger for computational resources remains a significant hurdle, posing energy-consumption issues and restricting their deployment on resource-constrained devices, preventing their widespread adoption. In this paper, we present an optimal transport-based method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. More specifically, we propose a new regularization strategy based on the Max-Sliced Wasserstein distance to minimize the distance between the intermediate feature distributions in the neural network. We show that minimizing this distance enables the complete removal of intermediate layers in the network, achieving better performance/depth trade-off compared to existing techniques. We assess the effectiveness of our method on traditional image classification setups and extend it to generative image models. Our code is available at https://github.com/VGCQ/LaCoOT.

[381] Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks

Lukas Gosch, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar, Stephan Günnemann

Main category: cs.LG

TL;DR: The paper certifies Graph Neural Networks (GNNs) against poisoning attacks using white-box methods, leveraging neural tangent kernel and a novel bilevel optimization reformulation.

Details

Motivation: To address the vulnerability of GNNs to data poisoning and provide certifiable robustness guarantees.

Method: Uses neural tangent kernel for training dynamics and reformulates poisoning as a mixed-integer linear program.

Result: Provides insights into graph structure’s impact on robustness for convolution-based and PageRank-based GNNs.

Conclusion: The framework is general and the first to offer white-box poisoning certificates for neural networks, extending beyond graph tasks.

Abstract: Generalization of machine learning models can be severely compromised by data poisoning, where adversarial changes are applied to the training data. This vulnerability has led to interest in certifying (i.e., proving) that such changes up to a certain magnitude do not affect test predictions. We, for the first time, certify Graph Neural Networks (GNNs) against poisoning attacks, including backdoors, targeting the node features of a given graph. Our certificates are white-box and based upon $(i)$ the neural tangent kernel, which characterizes the training dynamics of sufficiently wide networks; and $(ii)$ a novel reformulation of the bilevel optimization problem describing poisoning as a mixed-integer linear program. Consequently, we leverage our framework to provide fundamental insights into the role of graph structure and its connectivity on the worst-case robustness behavior of convolution-based and PageRank-based GNNs. We note that our framework is more general and constitutes the first approach to derive white-box poisoning certificates for NNs, which can be of independent interest beyond graph-related tasks.

[382] Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Embedding

Haiping Liu, Lijing Lin, Jingyuan Sun, Zhegong Shangguan, Mauricio A. Alvarez, Hongpeng Zhou

Main category: cs.LG

TL;DR: The paper proposes a mathematical framework for Rotary Position Embedding (RoPE) using Lie group theory, unifying its application in higher-dimensional domains like 2D images.

Details

Motivation: Existing RoPE designs lack a unified theoretical framework, especially for higher-dimensional inputs.

Method: The framework is grounded in Lie group and Lie algebra theory, deriving conditions for valid RoPE based on relativity and reversibility. It characterizes RoPE as a basis of a maximal abelian subalgebra.

Result: The study shows RoPE can be resolved by learning an orthogonal transformation, balancing inter-dimensional interactions with local structure preservation.

Conclusion: The framework unifies existing RoPE designs and enables principled extensions to higher-dimensional tasks.

Abstract: Rotary Position Embedding (RoPE) is widely adopted in large language models (LLMs) due to its efficient encoding of relative positions with strong extrapolation capabilities. However, while its application in higher-dimensional input domains, such as 2D images, have been explored in several attempts, a unified theoretical framework is still lacking. To address this, we propose a systematic mathematical framework for RoPE grounded in Lie group and Lie algebra theory. We derive the necessary and sufficient conditions for any valid $N$-dimensional RoPE based on two core properties of RoPE - relativity and reversibility. We demonstrate that RoPE can be characterized as a basis of a maximal abelian subalgebra (MASA) in the special orthogonal Lie algebra, and that the commonly used axis-aligned block-diagonal RoPE, where each input axis is encoded by an independent 2x2 rotation block, corresponds to the maximal toral subalgebra. Furthermore, we reduce spatial inter-dimensional interactions to a change of basis, resolved by learning an orthogonal transformation. Our experiment results suggest that inter-dimensional interactions should be balanced with local structure preservation. Overall, our framework unifies and explains existing RoPE designs while enabling principled extensions to higher-dimensional modalities and tasks.

[383] Contrast All the Time: Learning Time Series Representation from Temporal Consistency

Abdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor

Main category: cs.LG

TL;DR: CaTT introduces an unsupervised contrastive learning method for time series by contrasting all time steps in parallel, improving efficiency and downstream task performance.

Details

Motivation: To enhance representation learning for time series by leveraging temporal dynamics more effectively than existing contrastive methods.

Method: CaTT uses a scalable NT-pair formulation to contrast all time steps in parallel, avoiding data augmentations or pair selection heuristics.

Result: Produces superior embeddings, improves downstream task performance, and trains faster than other contrastive approaches.

Conclusion: CaTT is efficient, scalable, and suitable for large-scale time series applications.

Abstract: Representation learning for time series using contrastive learning has emerged as a critical technique for improving the performance of downstream tasks. To advance this effective approach, we introduce CaTT (\textit{Contrast All The Time}), a new approach to unsupervised contrastive learning for time series, which takes advantage of dynamics between temporally similar moments more efficiently and effectively than existing methods. CaTT departs from conventional time-series contrastive approaches that rely on data augmentations or selected views. Instead, it uses the full temporal dimension by contrasting all time steps in parallel. This is made possible by a scalable NT-pair formulation, which extends the classic N-pair loss across both batch and temporal dimensions, making the learning process end-to-end and more efficient. CaTT learns directly from the natural structure of temporal data, using repeated or adjacent time steps as implicit supervision, without the need for pair selection heuristics. We demonstrate that this approach produces superior embeddings which allow better performance in downstream tasks. Additionally, training is faster than other contrastive learning approaches, making it suitable for large-scale and real-world time series applications. The source code is publicly available at \href{https://github.com/sfi-norwai/CaTT}{https://github.com/sfi-norwai/CaTT}.

[384] Compositional Flows for 3D Molecule and Synthesis Pathway Co-design

Tony Shen, Seonghwan Seo, Ross Irwin, Kieran Didi, Simon Olsson, Woo Youn Kim, Martin Ester

Main category: cs.LG

TL;DR: CGFlow extends flow matching to generate compositional objects with continuous features, achieving state-of-the-art results in drug design.

Details

Motivation: To improve generative modeling of compositional objects with continuous features, particularly in synthesizable drug design.

Method: Extends flow matching to model compositional state transitions and integrates GFlowNets for reward-guided sampling.

Result: Achieves top binding affinity on LIT-PCBA and CrossDocked benchmarks, with 5.8x sampling efficiency improvement.

Conclusion: CGFlow is a novel, efficient framework for compositional generative tasks, excelling in drug design.

Abstract: Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features. Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process. We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule’s synthetic pathway with its 3D binding pose. Our approach achieves state-of-the-art binding affinity on all 15 targets from the LIT-PCBA benchmark, and 5.8$\times$ improvement in sampling efficiency compared to 2D synthesis-based baseline. To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.38) and AiZynth success rate (62.2%) on the CrossDocked benchmark.

[385] Learning from Label Proportions and Covariate-shifted Instances

Sagalpreet Singh, Navodita Sharma, Shreyas Havaldar, Rishi Saket, Aravindan Raghuveer

Main category: cs.LG

TL;DR: The paper addresses the hybrid LLP problem, combining covariate-shifted supervised source data with target bag-labels to improve instance-level prediction.

Details

Motivation: Leverage fully supervised but covariate-shifted source data alongside target bag-labels for better predictive performance in the LLP setting.

Method: Develop hybrid LLP methods incorporating target bag-labels and source instance-labels within a domain adaptation framework.

Result: Theoretical guarantees for target generalization error and experimental outperformance over LLP and domain adaptation baselines.

Conclusion: Proposed hybrid LLP methods effectively utilize source data and target bag-labels, improving predictive performance.

Abstract: In many applications, especially due to lack of supervision or privacy concerns, the training data is grouped into bags of instances (feature-vectors) and for each bag we have only an aggregate label derived from the instance-labels in the bag. In learning from label proportions (LLP) the aggregate label is the average of the instance-labels in a bag, and a significant body of work has focused on training models in the LLP setting to predict instance-labels. In practice however, the training data may have fully supervised albeit covariate-shifted source data, along with the usual target data with bag-labels, and we wish to train a good instance-level predictor on the target domain. We call this the covariate-shifted hybrid LLP problem. Fully supervised covariate shifted data often has useful training signals and the goal is to leverage them for better predictive performance in the hybrid LLP setting. To achieve this, we develop methods for hybrid LLP which naturally incorporate the target bag-labels along with the source instance-labels, in the domain adaptation framework. Apart from proving theoretical guarantees bounding the target generalization error, we also conduct experiments on several publicly available datasets showing that our methods outperform LLP and domain adaptation baselines as well techniques from previous related work.

[386] Rethinking the Foundations for Continual Reinforcement Learning

Esraa Elelimy, David Szepesvari, Martha White, Michael Bowling

Main category: cs.LG

TL;DR: The paper critiques traditional reinforcement learning foundations for continual reinforcement learning, identifying four unsuitable pillars, and proposes a new formalism with revised metrics and approaches.

Details

Motivation: To evaluate if traditional reinforcement learning foundations align with continual learning goals and propose alternatives for better suitability.

Method: Examines four key pillars of traditional reinforcement learning, identifies their limitations for continual learning, and introduces a new formalism with history process and deviation regret.

Result: Identifies mismatches between traditional and continual learning paradigms and proposes a revised formalism and metrics.

Conclusion: Traditional reinforcement learning foundations are unsuitable for continual learning; a new formalism and metrics are proposed to better align with continual learning goals.

Abstract: In the traditional view of reinforcement learning, the agent’s goal is to find an optimal policy that maximizes its expected sum of rewards. Once the agent finds this policy, the learning ends. This view contrasts with \emph{continual reinforcement learning}, where learning does not end, and agents are expected to continually learn and adapt indefinitely. Despite the clear distinction between these two paradigms of learning, much of the progress in continual reinforcement learning has been shaped by foundations rooted in the traditional view of reinforcement learning. In this paper, we first examine whether the foundations of traditional reinforcement learning are suitable for the continual reinforcement learning paradigm. We identify four key pillars of the traditional reinforcement learning foundations that are antithetical to the goals of continual learning: the Markov decision process formalism, the focus on atemporal artifacts, the expected sum of rewards as an evaluation metric, and episodic benchmark environments that embrace the other three foundations. We then propose a new formalism that sheds the first and the third foundations and replaces them with the history process as a mathematical formalism and a new definition of deviation regret, adapted for continual learning, as an evaluation metric. Finally, we discuss possible approaches to shed the other two foundations.

[387] Improving sub-seasonal wind-speed forecasts in Europe with a non-linear model

Ganglin Tian, Camille Le Coz, Anastase Alexandre Charantonis, Alexis Tantet, Naveen Goutham, Riwal Plougonven

Main category: cs.LG

TL;DR: The study improves sub-seasonal wind speed forecasts in Europe by leveraging non-linear relationships between 500 hPa geopotential height (Z500) and surface wind speed, using MLR and CNN models. CNN outperforms MLR due to non-linearity, and stochastic perturbations address under-dispersive issues.

Details

Motivation: Sub-seasonal wind speed forecasts are crucial for wind power planning, but forecast skills decline after two weeks. Large-scale variables like Z500 offer better predictability, motivating the use of non-linear relationships to enhance forecasts.

Method: The study employs MLR and CNN to regress surface wind speed from Z500. Models are evaluated on ERA5 reanalysis and sub-seasonal forecasts, with stochastic perturbations introduced to address under-dispersive behavior.

Result: CNN performs better than MLR due to non-linearity. Perturbed CNN excels in early weeks, while perturbed MLR catches up after two weeks. Stochastic perturbations improve model spread.

Conclusion: Non-linearity and stochastic perturbations enhance sub-seasonal wind speed forecasts, with CNN’s advantage diminishing over time. The approach addresses under-dispersive issues in statistical models.

Abstract: Sub-seasonal wind speed forecasts provide valuable guidance for wind power system planning and operations, yet the forecast skills of surface winds decrease sharply after two weeks. However, large-scale variables exhibit greater predictability on this time scale. This study explores the potential of leveraging non-linear relationships between 500 hPa geopotential height (Z500) and surface wind speed to improve sub-seasonal wind speed forecast skills in Europe. Our proposed framework uses a Multiple Linear Regression (MLR) or a Convolutional Neural Network (CNN) to regress surface wind speed from Z500. Evaluations on ERA5 reanalysis indicate that the CNN performs better due to its non-linearity. Applying these models to sub-seasonal forecasts from the European Centre for Medium-Range Weather Forecasts, various verification metrics demonstrate the advantages of non-linearity. Yet, this is partly explained by the fact that these statistical models are under-dispersive since they explain only a fraction of the target variable variance. Introducing stochastic perturbations to represent the stochasticity of the unexplained part from the signal helps compensate for this issue. Results show that the perturbed CNN performs better than the perturbed MLR only in the first weeks, while the perturbed MLR’s performance converges towards that of the perturbed CNN after two weeks. The study finds that introducing stochastic perturbations can address the issue of insufficient spread in these statistical models, with improvements from the non-linearity varying with the lead time of the forecasts.

[388] DFRot: Achieving Outlier-Free and Massive Activation-Free for Rotated LLMs with Refined Rotation

Jingyang Xiang, Sai Qian Zhang

Main category: cs.LG

TL;DR: The paper introduces DFRot, a method to improve quantization in LLMs by rotating activation and weight matrices, focusing on reducing outliers and massive activations.

Details

Motivation: Prior methods like randomized Hadamard transforms outperform orthogonal transforms in low-precision quantization, but the reason was unclear. The study aims to address this gap and optimize quantization for rare but critical tokens.

Method: Proposes a weighted loss function and an alternating optimization strategy for the rotation matrix, using orthogonal Procrustes transforms to refine it. The goal is to make activation distributions more quantization-friendly.

Result: DFRot achieves significant perplexity improvements (0.98 and 0.95) on challenging models like LLaMA3-70B, even with minimal tuning.

Conclusion: DFRot effectively addresses long-tail optimization in quantization, enhancing model accuracy and efficiency, especially for tokens with massive activations.

Abstract: Rotating the activation and weight matrices to reduce the influence of outliers in large language models (LLMs) has recently attracted significant attention, particularly in the context of model quantization. Prior studies have shown that in low-precision quantization scenarios, such as 4-bit weights and 4-bit activations (W4A4), randomized Hadamard transforms can achieve significantly higher accuracy than randomized orthogonal transforms. Notably, the reason behind this phenomenon remains unknown. In this paper, we find that these transformations show substantial improvement in eliminating outliers for common tokens and achieve similar quantization error. The primary reason for the accuracy difference lies in the fact that randomized Hadamard transforms can slightly reduce the quantization error for tokens with massive activations while randomized orthogonal transforms increase the quantization error. Due to the extreme rarity of these tokens and their critical impact on model accuracy, we consider this a long-tail optimization problem, and therefore construct a simple yet effective method: a weighted loss function. Additionally, we propose an optimization strategy for the rotation matrix that involves alternating optimization of quantization parameters while employing orthogonal Procrustes transforms to refine the rotation matrix. This makes the distribution of the rotated activation values more conducive to quantization, especially for tokens with massive activations. Our method enhances the Rotated LLMs by achieving dual free, Outlier-Free and Massive Activation-Free, dubbed as DFRot. Extensive experiments demonstrate the effectiveness and efficiency of DFRot. By tuning the rotation matrix using just a single sample, DFRot achieves a perplexity improvement of 0.98 and 0.95 on W4A4KV4 and W4A4KV16, respectively, for LLaMA3-70B, a model known for its quantization challenges.

[389] Temporal Chunking Enhances Recognition of Implicit Sequential Patterns

Jayanta Dey, Nicholas Soures, Miranda Gonzales, Itamar Lerner, Christopher Kanan, Dhireesha Kudithipudi

Main category: cs.LG

TL;DR: A neuro-inspired method compresses temporal sequences into context-tagged chunks during an offline sleep phase, improving learning efficiency and addressing limitations of traditional RNNs in multi-timescale tasks.

Details

Motivation: To overcome the limitations of traditional neural networks like RNNs in handling temporal patterns across multiple timescales and enhance learning efficiency in resource-constrained settings.

Method: Proposes temporal chunking, where sequences are compressed into context-tagged chunks during an offline phase, tested in synthetic and human pilot studies.

Result: Preliminary results show improved learning efficiency, with evidence of context tag transferability across related tasks.

Conclusion: The study provides early proof-of-concept for temporal chunking, suggesting potential for future applications in transfer learning.

Abstract: In this pilot study, we propose a neuro-inspired approach that compresses temporal sequences into context-tagged chunks, where each tag represents a recurring structural unit or``community’’ in the sequence. These tags are generated during an offline sleep phase and serve as compact references to past experience, allowing the learner to incorporate information beyond its immediate input range. We evaluate this idea in a controlled synthetic environment designed to reveal the limitations of traditional neural network based sequence learners, such as recurrent neural networks (RNNs), when facing temporal patterns on multiple timescales. We evaluate this idea in a controlled synthetic environment designed to reveal the limitations of traditional neural network based sequence learners, such as recurrent neural networks (RNNs), when facing temporal patterns on multiple timescales. Our results, while preliminary, suggest that temporal chunking can significantly enhance learning efficiency under resource constrained settings. A small-scale human pilot study using a Serial Reaction Time Task further motivates the idea of structural abstraction. Although limited to synthetic tasks, this work serves as an early proof-of-concept, with initial evidence that learned context tags can transfer across related task, offering potential for future applications in transfer learning.

[390] Generalising Battery Control in Net-Zero Buildings via Personalised Federated RL

Nicolas M Cuadrado Avila, Samuel Horváth, Martin Takáč

Main category: cs.LG

TL;DR: The paper proposes a privacy-preserving framework for optimal energy management in microgrids using federated RL (TRPO and PPO), showing comparable performance to state-of-the-art methods without hyperparameter tuning.

Details

Motivation: To address the challenge of efficient energy management in microgrids while ensuring privacy and reducing costs and emissions.

Method: Evaluated PPO and TRPO in collaborative setups using a customized CityLearn environment and synthetic data to simulate net-zero energy scenarios.

Result: Federated TRPO performs comparably to state-of-the-art federated RL methods without hyperparameter tuning, achieving efficient energy management.

Conclusion: Collaborative learning is feasible for optimal control in energy systems, advancing sustainable smart grids.

Abstract: This work studies the challenge of optimal energy management in building-based microgrids through a collaborative and privacy-preserving framework. We evaluated two common RL algorithms (PPO and TRPO) in different collaborative setups to manage distributed energy resources (DERs) efficiently. Using a customized version of the CityLearn environment and synthetically generated data, we simulate and design net-zero energy scenarios for microgrids composed of multiple buildings. Our approach emphasizes reducing energy costs and carbon emissions while ensuring privacy. Experimental results demonstrate that Federated TRPO is comparable with state-of-the-art federated RL methodologies without hyperparameter tuning. The proposed framework highlights the feasibility of collaborative learning for achieving optimal control policies in energy systems, advancing the goals of sustainable and efficient smart grids. Our code is accessible \href{https://github.com/Optimization-and-Machine-Learning-Lab/energy_fed_trpo.git}{\textit{this repo}}.

[391] Matrix Is All You Need

Yuzhou Zhu

Main category: cs.LG

TL;DR: A unified matrix-order framework simplifies neural architectures by representing convolutional, recurrent, and self-attention operations as sparse matrix multiplications, matching or outperforming native models.

Details

Motivation: To address the proliferation of specialized neural architectures by uncovering their underlying commonalities through a unified mathematical framework.

Method: Introduces a sparse matrix multiplication approach for convolutional, recurrent, and self-attention operations, proving algebraic isomorphism with standard CNN, RNN, and Transformer layers.

Result: Empirical evaluations show sparse-matrix formulations match or exceed native model performance across tasks like image classification, time-series forecasting, and language modeling.

Conclusion: The framework provides a rigorous mathematical foundation for diverse architectures, enabling hardware-aware design and leveraging algebraic optimization.

Abstract: Deep neural networks employ specialized architectures for vision, sequential and language tasks, yet this proliferation obscures their underlying commonalities. We introduce a unified matrix-order framework that casts convolutional, recurrent and self-attention operations as sparse matrix multiplications. Convolution is realized via an upper-triangular weight matrix performing first-order transformations; recurrence emerges from a lower-triangular matrix encoding stepwise updates; attention arises naturally as a third-order tensor factorization. We prove algebraic isomorphism with standard CNN, RNN and Transformer layers under mild assumptions. Empirical evaluations on image classification (MNIST, CIFAR-10/100, Tiny ImageNet), time-series forecasting (ETTh1, Electricity Load Diagrams) and language modeling/classification (AG News, WikiText-2, Penn Treebank) confirm that sparse-matrix formulations match or exceed native model performance while converging in comparable or fewer epochs. By reducing architecture design to sparse pattern selection, our matrix perspective aligns with GPU parallelism and leverages mature algebraic optimization tools. This work establishes a mathematically rigorous substrate for diverse neural architectures and opens avenues for principled, hardware-aware network design.

[392] Are DeepSeek R1 And Other Reasoning Models More Faithful?

James Chua, Owain Evans

Main category: cs.LG

TL;DR: Reasoning models (e.g., Qwen-2.5, Gemini-2, DeepSeek-V3-Base) are more faithful in describing cue influences than non-reasoning models, with DeepSeek-R1 outperforming non-reasoning models (59% vs. 7%). Reward models may reduce faithfulness. Limitations include artificial tasks and narrow faithfulness measurement.

Details

Motivation: To evaluate whether reasoning models' Chains of Thought (CoTs) are more faithful than traditional models in describing how cues influence their answers.

Method: Tested three reasoning models and non-reasoning models (e.g., Claude-3.5-Sonnet, GPT-4o) on a faithfulness test involving seven cue types (e.g., misleading examples, suggestive questions). Measured how often models described cue influences.

Result: Reasoning models described cue influences more reliably (e.g., DeepSeek-R1: 59%) than non-reasoning models (e.g., 7%). Reward models may reduce faithfulness.

Conclusion: Reasoning models show promise for explainability, but limitations (artificial tasks, narrow measurement) suggest need for broader future research.

Abstract: Language models trained to solve reasoning tasks via reinforcement learning have achieved striking results. We refer to these models as reasoning models. Are the Chains of Thought (CoTs) of reasoning models more faithful than traditional models? We evaluate three reasoning models (based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base) on an existing test of faithful CoT. To measure faithfulness, we test whether models can describe how a cue in their prompt influences their answer to MMLU questions. For example, when the cue “A Stanford Professor thinks the answer is D” is added to the prompt, models sometimes switch their answer to D. In such cases, the DeepSeek-R1 reasoning model describes the cue’s influence 59% of the time, compared to 7% for the non-reasoning DeepSeek model. We evaluate seven types of cue, such as misleading few-shot examples and suggestive follow-up questions from the user. Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested (including Claude-3.5-Sonnet and GPT-4o). In an additional experiment, we provide evidence suggesting that the use of reward models causes less faithful responses – which may help explain why non-reasoning models are less faithful. Our study has two main limitations. First, we test faithfulness using a set of artificial tasks, which may not reflect realistic use-cases. Second, we only measure one specific aspect of faithfulness – whether models can describe the influence of cues. Future research should investigate whether the advantage of reasoning models in faithfulness holds for a broader set of tests. Still, we think this increase in faithfulness is promising for the explainability of language models.

[393] Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Milind Naphade, Sambit Sahu, Irina Rish, Ekaterina Lobacheva

Main category: cs.LG

TL;DR: Scaling mitigates loss deceleration in language models by reducing its occurrence and improving post-deceleration loss rates, attributed to zero-sum learning dynamics.

Details

Motivation: To understand how scaling improves language models, particularly in training dynamics and loss behavior.

Method: Analyzed loss deceleration and zero-sum learning (ZSL) in language models, studying their impact on training dynamics.

Result: Scaling decreases loss deceleration occurrence and improves post-deceleration loss rates, linked to ZSL dynamics.

Conclusion: Loss deceleration and ZSL offer insights into training dynamics, potentially improving models independent of scale.

Abstract: This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl

[394] An All-digital 8.6-nJ/Frame 65-nm Tsetlin Machine Image Classification Accelerator

Svein Anders Tunheim, Yujin Zheng, Lei Jiao, Rishad Shafik, Alex Yakovlev, Ole-Christoffer Granmo

Main category: cs.LG

TL;DR: An all-digital programmable machine learning accelerator chip for image classification, based on Tsetlin machine principles, achieves high energy-efficiency and accuracy on datasets like MNIST.

Details

Motivation: To demonstrate the energy-efficiency of the Tsetlin machine (TM) by developing a dedicated hardware accelerator for image classification.

Method: The accelerator implements a coalesced TM version with convolution, using 128 clauses in a parallel architecture. Clause evaluation is optimized by storing weights and Tsetlin automata signals in registers.

Result: The chip, fabricated in 65 nm CMOS, achieves 60.3k classifications per second at 8.6 nJ per classification, with accuracies of 97.42% (MNIST), 84.54% (Fashion-MNIST), and 82.55% (Kuzushiji-MNIST).

Conclusion: The TM-based accelerator is energy-efficient and matches software model accuracies, validating its potential for hardware deployment.

Abstract: We present an all-digital programmable machine learning accelerator chip for image classification, underpinning on the Tsetlin machine (TM) principles. The TM is an emerging machine learning algorithm founded on propositional logic, utilizing sub-pattern recognition expressions called clauses. The accelerator implements the coalesced TM version with convolution, and classifies booleanized images of 28$\times$28 pixels with 10 categories. A configuration with 128 clauses is used in a highly parallel architecture. Fast clause evaluation is achieved by keeping all clause weights and Tsetlin automata (TA) action signals in registers. The chip is implemented in a 65 nm low-leakage CMOS technology, and occupies an active area of 2.7 mm$^2$. At a clock frequency of 27.8 MHz, the accelerator achieves 60.3k classifications per second, and consumes 8.6 nJ per classification. This demonstrates the energy-efficiency of the TM, which was the main motivation for developing this chip. The latency for classifying a single image is 25.4 $\mu$s which includes system timing overhead. The accelerator achieves 97.42%, 84.54% and 82.55% test accuracies for the datasets MNIST, Fashion-MNIST and Kuzushiji-MNIST, respectively, matching the TM software models.

[395] Imitation Learning from a Single Temporally Misaligned Video

William Huey, Huaxiaoyue Wang, Anne Wu, Yoav Artzi, Sanjiban Choudhury

Main category: cs.LG

TL;DR: The paper introduces ORCA, a method for learning sequential tasks from visual demonstrations by focusing on sequence-level matching rather than frame-level alignment, achieving significant performance improvements.

Details

Motivation: Existing frame-level matching methods fail to enforce temporal ordering or consistent progress in sequential task learning from visual demonstrations.

Method: Proposes ORCA, a dense per-timestep reward function that measures the probability of covering demonstration frames in the correct order.

Result: ORCA achieves 4.5x improvement for Meta-world tasks and 6.6x for Humanoid-v4 tasks compared to frame-level methods.

Conclusion: ORCA is robust to temporal misalignment and outperforms existing approaches in sequential task learning.

Abstract: We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. Our code is available at https://github.com/portal-cornell/orca/

[396] Constrained Online Convex Optimization with Polyak Feasibility Steps

Spencer Hutchinson, Mahnoosh Alizadeh

Main category: cs.LG

TL;DR: The paper improves online convex optimization with fixed constraints, ensuring anytime constraint satisfaction and matching $O(\sqrt{T})$ regret using Polyak feasibility steps.

Details

Motivation: Prior work achieved $O(\sqrt{T})$ regret and cumulative constraint satisfaction but lacked anytime constraint guarantees. This work aims to strengthen these guarantees.

Method: The approach combines online gradient descent with Polyak feasibility steps, ensuring constraints are met at every step without compromising regret.

Result: The method achieves anytime constraint satisfaction $g(x_t) \leq 0$ and maintains $O(\sqrt{T})$ regret, validated through experiments.

Conclusion: The proposed algorithm successfully ensures constraints are met at every step while maintaining optimal regret bounds.

Abstract: In this work, we study online convex optimization with a fixed constraint function $g : \mathbb{R}^d \rightarrow \mathbb{R}$. Prior work on this problem has shown $O(\sqrt{T})$ regret and cumulative constraint satisfaction $\sum_{t=1}^{T} g(x_t) \leq 0$, while only accessing the constraint value and subgradient at the played actions $g(x_t), \partial g(x_t)$. Using the same constraint information, we show a stronger guarantee of anytime constraint satisfaction $g(x_t) \leq 0 \ \forall t \in [T]$, and matching $O(\sqrt{T})$ regret guarantees. These contributions are thanks to our approach of using Polyak feasibility steps to ensure constraint satisfaction, without sacrificing regret. Specifically, after each step of online gradient descent, our algorithm applies a subgradient descent step on the constraint function where the step-size is chosen according to the celebrated Polyak step-size. We further validate this approach with numerical experiments.

[397] The Price of Freedom: Exploring Expressivity and Runtime Tradeoffs in Equivariant Tensor Products

YuQing Xie, Ameya Daigavane, Mit Kotak, Tess Smidt

Main category: cs.LG

TL;DR: Analysis of tensor product operations in $E(3)$-equivariant neural networks, highlighting trade-offs between speed and expressivity, and proposing optimizations.

Details

Motivation: To systematically evaluate tensor product operations, emphasizing differences in expressivity and interactability, and to optimize implementations like the Gaunt tensor product (GTP).

Method: Introduces measures for expressivity and interactability, simplifies GTP implementation using a spherical grid, and conducts microbenchmarks.

Result: Simplified GTP is 30% faster in benchmarks and training; theoretical runtime guarantees often mismatch empirical performance.

Conclusion: Careful benchmarking is crucial due to trade-offs between speed and expressivity in tensor product operations.

Abstract: $E(3)$-equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation. For example, Luo et al. (2024) recently proposed the Gaunt tensor product (GTP) which promises a significant speedup. In this work, we provide a careful, systematic analysis of a number of tensor product operations. In particular, we emphasize that different tensor products are not performing the same operation. The reported speedups typically come at the cost of expressivity. We introduce measures of expressivity and interactability to characterize these differences. In addition, we realized the original implementation of GTP can be greatly simplified by directly using a spherical grid at no cost in asymptotic runtime. This spherical grid approach is faster on our benchmarks and in actual training of the MACE interatomic potential by 30%. Finally, we provide the first systematic microbenchmarks of the various tensor product operations. We find that the theoretical runtime guarantees can differ wildly from empirical performance, demonstrating the need for careful application-specific benchmarking. Code is available at https://github.com/atomicarchitects/PriceofFreedom.

[398] CLA: Latent Alignment for Online Continual Self-Supervised Learning

Giacomo Cignoni, Andrea Cossu, Alexandra Gomez-Villa, Joost van de Weijer, Antonio Carta

Main category: cs.LG

TL;DR: CLA is a self-supervised learning method for online continual learning, improving convergence and performance under computational constraints.

Details

Motivation: Address the lack of SSL techniques for online continual learning with small minibatches, fixed budgets, and no task boundaries.

Method: Continual Latent Alignment (CLA) aligns current and past representations to reduce forgetting.

Result: CLA speeds up training convergence and outperforms state-of-the-art methods. It also enhances final performance when used as pretraining.

Conclusion: CLA is effective for online continual learning, offering faster convergence and better performance than existing approaches.

Abstract: Self-supervised learning (SSL) is able to build latent representations that generalize well to unseen data. However, only a few SSL techniques exist for the online CL setting, where data arrives in small minibatches, the model must comply with a fixed computational budget, and task boundaries are absent. We introduce Continual Latent Alignment (CLA), a novel SSL strategy for Online CL that aligns the representations learned by the current model with past representations to mitigate forgetting. We found that our CLA is able to speed up the convergence of the training process in the online scenario, outperforming state-of-the-art approaches under the same computational budget. Surprisingly, we also discovered that using CLA as a pretraining protocol in the early stages of pretraining leads to a better final performance when compared to a full i.i.d. pretraining.

[399] Patch-wise Structural Loss for Time Series Forecasting

Dilfira Kudrat, Zongxia Xie, Yanru Sun, Tianyu Jia, Qinghua Hu

Main category: cs.LG

TL;DR: The paper introduces a Patch-wise Structural (PS) loss to improve time-series forecasting by addressing structural dependencies often ignored by traditional point-wise loss functions.

Details

Motivation: Existing forecasting models rely on point-wise loss functions like Mean Square Error, which neglect structural dependencies in time series data, limiting their ability to capture complex temporal patterns.

Method: The proposed PS loss compares time series at the patch level, leveraging local statistical properties (correlation, variance, mean) to capture structural discrepancies. It integrates with point-wise loss to address both local structural inconsistencies and individual time-step errors.

Result: PS loss significantly improves the performance of state-of-the-art models across diverse real-world datasets.

Conclusion: The PS loss provides a novel benchmark for time series modeling and offers a new perspective on loss function design, enhancing forecasting accuracy.

Abstract: Time-series forecasting has gained significant attention in machine learning due to its crucial role in various domains. However, most existing forecasting models rely heavily on point-wise loss functions like Mean Square Error, which treat each time step independently and neglect the structural dependencies inherent in time series data, making it challenging to capture complex temporal patterns accurately. To address these challenges, we propose a novel Patch-wise Structural (PS) loss, designed to enhance structural alignment by comparing time series at the patch level. Through leveraging local statistical properties, such as correlation, variance, and mean, PS loss captures nuanced structural discrepancies overlooked by traditional point-wise losses. Furthermore, it integrates seamlessly with point-wise loss, simultaneously addressing local structural inconsistencies and individual time-step errors. PS loss establishes a novel benchmark for accurately modeling complex time series data and provides a new perspective on time series loss function design. Extensive experiments demonstrate that PS loss significantly improves the performance of state-of-the-art models across diverse real-world datasets.

[400] FLAME: Towards Federated Fine-Tuning Large Language Models Through Adaptive SMoE

Khiem Le, Tuan Tran, Ting Hua, Nitesh V. Chawla

Main category: cs.LG

TL;DR: FLAME introduces a federated learning framework using Sparse Mixture-of-Experts (SMoE) to avoid performance loss from LoRA compression, addressing challenges like output mismatch and expert imbalance.

Details

Motivation: Existing LoRA compression methods in federated learning lead to suboptimal performance due to information loss.

Method: FLAME uses SMoE architecture, retaining full global LoRA matrices and varying activated experts per client, with rescaling and activation-aware aggregation.

Result: FLAME outperforms existing methods across diverse computational settings.

Conclusion: FLAME provides a robust, effective solution for resource-adaptive federated learning.

Abstract: Existing resource-adaptive LoRA federated fine-tuning methods enable clients to fine-tune models using compressed versions of global LoRA matrices, in order to accommodate various compute resources across clients. This compression requirement will lead to suboptimal performance due to information loss. To address this, we propose FLAME, a novel federated learning framework based on the Sparse Mixture-of-Experts (SMoE) architecture. Unlike prior approaches, FLAME retains full (uncompressed) global LoRA matrices and achieves client-side adaptability by varying the number of activated experts per client. However, incorporating SMoE into federated learning introduces unique challenges, specifically, the mismatch in output magnitude from partial expert activation and the imbalance in expert training quality across clients. FLAME tackles these challenges through a lightweight rescaling mechanism and an activation-aware aggregation scheme. Empirical results across diverse computational settings demonstrate that FLAME consistently outperforms existing methods, providing a robust and effective solution for resource-adaptive federated learning.

[401] Structured Preconditioners in Adaptive Optimization: A Unified Analysis

Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, Zhiyuan Li

Main category: cs.LG

TL;DR: A unified analysis shows structured preconditioners in adaptive optimization can outperform less structured ones, challenging the view that approximations sacrifice performance.

Details

Motivation: To analyze and compare the performance of structured preconditioners in adaptive optimization algorithms, challenging the assumption that approximations trade performance for efficiency.

Method: A novel unified analysis framework for adaptive optimization algorithms with structured preconditioners, applied to online regret minimization and offline convex optimization.

Result: Structured preconditioners (e.g., diagonal AdaGrad, one-sided Shampoo) can outperform less structured ones (e.g., full-matrix AdaGrad) despite using less space and computation.

Conclusion: Structured preconditioners are not just efficient approximations but can also deliver superior performance, redefining their role in optimization.

Abstract: We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.

[402] DeInfoReg: A Decoupled Learning Framework for Better Training Throughput

Zih-Hao Huang, You-Teng Lin, Hung-Hsuan Chen

Main category: cs.LG

TL;DR: DeInfoReg introduces shorter gradient flows to combat vanishing gradients, enabling parallel GPU training for improved throughput and performance.

Details

Motivation: Address the vanishing gradient problem and enhance training efficiency through parallel computing.

Method: Decouples supervised learning into shorter gradient flows and integrates a pipeline strategy for GPU parallelization.

Result: Outperforms standard backpropagation and other techniques in performance and noise resistance while efficiently using parallel resources.

Conclusion: DeInfoReg is effective for mitigating vanishing gradients and optimizing training throughput with parallel computing.

Abstract: This paper introduces Decoupled Supervised Learning with Information Regularization (DeInfoReg), a novel approach that transforms a long gradient flow into multiple shorter ones, thereby mitigating the vanishing gradient problem. Integrating a pipeline strategy, DeInfoReg enables model parallelization across multiple GPUs, significantly improving training throughput. We compare our proposed method with standard backpropagation and other gradient flow decomposition techniques. Extensive experiments on diverse tasks and datasets demonstrate that DeInfoReg achieves superior performance and better noise resistance than traditional BP models and efficiently utilizes parallel computing resources. The code for reproducibility is available at: https://github.com/ianzih/Decoupled-Supervised-Learning-for-Information-Regularization/.

[403] Fast Fourier Correlation is a Highly Efficient and Accurate Feature Attribution Algorithm from the Perspective of Control Theory and Game Theory

Zechen Liu, Feiyang Zhang, Wei Song, Xiang Li, Wei Wei

Main category: cs.LG

TL;DR: Proposes a Fourier feature attribution method for neural networks, showing superior feature selection and efficiency compared to spatial domain methods.

Details

Motivation: Existing research lacks clear methods to identify learned Fourier features in neural networks, despite their known low-frequency bias.

Method: Introduces a Fourier feature attribution method based on signal decomposition theory and compares it with spatial domain methods using game-theoretic metrics.

Result: Fourier feature attribution requires only 8% of features to maintain 80% prediction accuracy on ImageNet with ViTs, showing better intra-class concentration and inter-class distinctiveness.

Conclusion: Fourier features offer more efficient classification and explainability, making them promising for AI algorithms.

Abstract: The study of neural networks from the perspective of Fourier features has garnered significant attention. While existing analytical research suggests that neural networks tend to learn low-frequency features, a clear attribution method for identifying the specific learned Fourier features has remained elusive. To bridge this gap, we propose a novel Fourier feature attribution method grounded in signal decomposition theory. Additionally, we analyze the differences between game-theoretic attribution metrics for Fourier and spatial domain features, demonstrating that game-theoretic evaluation metrics are better suited for Fourier-based feature attribution. Our experiments show that Fourier feature attribution exhibits superior feature selection capabilities compared to spatial domain attribution methods. For instance, in the case of Vision Transformers (ViTs) on the ImageNet dataset, only $8%$ of the Fourier features are required to maintain the original predictions for $80%$ of the samples. Furthermore, we compare the specificity of features identified by our method against traditional spatial domain attribution methods. Results reveal that Fourier features exhibit greater intra-class concentration and inter-class distinctiveness, indicating their potential for more efficient classification and explainable AI algorithms.

[404] Geometric Learning Dynamics

Vitaly Vanchurin

Main category: cs.LG

TL;DR: A unified geometric framework models learning dynamics in physical, biological, and machine learning systems, identifying three regimes based on the power-law relationship between the metric tensor and noise covariance matrix.

Details

Motivation: To understand and unify learning dynamics across diverse systems (physical, biological, and machine learning) by exploring the relationship between the metric tensor and noise covariance.

Method: The study uses a geometric framework to analyze the power-law relationship $g \propto \kappa^\alpha$, identifying three regimes (quantum, efficient learning, and equilibration) based on the value of $\alpha$.

Result: Three fundamental regimes emerge: quantum ($\alpha = 1$), efficient learning ($\alpha = \tfrac{1}{2}$), and equilibration ($\alpha = 0$). The intermediate regime ($\alpha = \tfrac{1}{2}$) is key to biological complexity.

Conclusion: The framework reveals how different learning dynamics arise from the relationship between $g$ and $\kappa$, with the intermediate regime playing a crucial role in biological complexity.

Abstract: We present a unified geometric framework for modeling learning dynamics in physical, biological, and machine learning systems. The theory reveals three fundamental regimes, each emerging from the power-law relationship $g \propto \kappa^\alpha$ between the metric tensor $g$ in the space of trainable variables and the noise covariance matrix $\kappa$. The quantum regime corresponds to $\alpha = 1$ and describes Schr"odinger-like dynamics that emerges from a discrete shift symmetry. The efficient learning regime corresponds to $\alpha = \tfrac{1}{2}$ and describes very fast machine learning algorithms. The equilibration regime corresponds to $\alpha = 0$ and describes classical models of biological evolution. We argue that the emergence of the intermediate regime $\alpha = \tfrac{1}{2}$ is a key mechanism underlying the emergence of biological complexity.

[405] FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation

Xenia Heilmann, Luca Corbucci, Mattia Cerrato, Anna Monreale

Main category: cs.LG

TL;DR: The paper introduces FeDa4Fair, a library for benchmarking fairness-aware Federated Learning (FL) methods, addressing biases in heterogeneous client data.

Details

Motivation: Fairness in FL is challenged by biases in local datasets, often overlooked in existing solutions focusing on single binary attributes.

Method: The authors propose FeDa4Fair, a library to generate and evaluate fairness-aware FL datasets, including four bias-heterogeneous datasets and benchmarking tools.

Result: The paper provides datasets and tools for consistent fairness evaluation in FL, enabling controlled comparisons of mitigation methods.

Conclusion: FeDa4Fair supports robust fairness research in FL by offering reproducible benchmarks and evaluation functions.

Abstract: Federated Learning (FL) enables collaborative model training across multiple clients without sharing clients’ private data. However, fairness remains a key concern, as biases in local clients’ datasets can impact the entire federated system. Heterogeneous data distributions across clients may lead to models that are fairer for some clients than others. Although several fairness-enhancing solutions are present in the literature, most focus on mitigating bias for a single sensitive attribute, typically binary, overlooking the diverse and sometimes conflicting fairness needs of different clients. This limited perspective can limit the effectiveness of fairness interventions for the different clients. To support more robust and reproducible fairness research in FL, we aim to enable a consistent benchmarking of fairness-aware FL methods at both the global and client levels. In this paper, we contribute in three ways: (1) We introduce FeDa4Fair, a library to generate tabular datasets tailored to evaluating fair FL methods under heterogeneous client bias; (2) we release four bias-heterogeneous datasets and corresponding benchmarks to compare fairness mitigation methods in a controlled environment; (3) we provide ready-to-use functions for evaluating fairness outcomes for these datasets.

[406] Parameter-Efficient Fine-Tuning with Circulant and Diagonal Vectors

Xinyu Ding, Lexuan Chen, Siyu Liao, Zhongfeng Wang

Main category: cs.LG

TL;DR: Proposes a method to reduce complexity of fine-tuning foundation models by factorizing weights into interleaved circulant and diagonal matrices, using 1D FFT for efficiency.

Details

Motivation: Foundation models are computationally expensive, limiting practical applicability. Existing Fourier domain training methods still need optimization.

Method: Factorizes fine-tuning weights into interleaved circulant and diagonal matrices, partitions circulant matrices for non-square weights, and uses 1D FFT.

Result: Achieves similar or better performance with fewer FLOPs and trainable parameters.

Conclusion: The method effectively reduces complexity while maintaining or improving model performance.

Abstract: Foundation models have achieved tremendous success in different domains. However, their huge computation and storage complexity make these models difficult to fine-tune and also less applicable in practice. Recent study shows training in Fourier domain can be an effective fine-tuning method in terms of both model performance and number of training parameters. In this work, we propose to further reduce the complexity by the factorization through the product of interleaved circulant and diagonal matrices. In addition, we address the case of non-square fine-tuning weights by partitioning the circulant matrix into blocks. Our method avoids the construction of weight change matrix and utilizes 1D fast Fourier transform (FFT) instead of 2D FFT. Experimental results show that our method achieves similar or better performance across various tasks with much less floating-point operations (FLOPs) and the number of trainable parameters.

[407] EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Arnab Sanyal, Gourav Datta, Prithwish Mukherjee, Sandeep P. Chinchali, Michael Orshansky

Main category: cs.LG

TL;DR: EntroLLM is a compression framework for LLMs that combines mixed quantization and entropy coding to reduce storage and improve inference speed on edge devices without losing accuracy.

Details

Motivation: Large storage and computational demands of LLMs limit their deployment on edge devices, necessitating efficient compression methods.

Method: Proposes layer-wise mixed quantization (symmetric/asymmetric) and Huffman encoding for lossless compression, with parallel Huffman decoding for efficient inference.

Result: Achieves 30-65% storage reduction and 31.9-146.6% faster inference on edge devices while maintaining model accuracy.

Conclusion: EntroLLM is a practical, re-training-free solution for deploying LLMs on edge devices.

Abstract: Large Language Models (LLMs) demonstrate exceptional performance across various tasks, but their large storage and computational requirements constrain their deployment on edge devices. To address this, we propose EntroLLM, a novel compression framework that integrates mixed quantization with entropy coding to reduce storage overhead while maintaining model accuracy. Our method applies a layer-wise mixed quantization scheme - choosing between symmetric and asymmetric quantization based on individual layer weight distributions - to optimize compressibility. We then employ Huffman encoding for lossless compression of the quantized weights, significantly reducing memory bandwidth requirements. Furthermore, we introduce parallel Huffman decoding, which enables efficient retrieval of encoded weights during inference, ensuring minimal latency impact. Our experiments on edge-compatible LLMs, including smolLM-1.7B-Instruct, phi3-mini-4k-Instruct, and mistral-7B-Instruct, demonstrate that EntroLLM achieves up to $30%$ storage reduction compared to uint8 models and up to $65%$ storage reduction compared to uint4 models, while preserving perplexity and accuracy, on language benchmark tasks. We further show that our method enables $31.9%$ - $146.6%$ faster inference throughput on memory-bandwidth-limited edge devices, such as NVIDIA Jetson P3450, by reducing the required data movement. The proposed approach requires no additional re-training and is fully compatible with existing post-training quantization methods, making it a practical solution for edge LLMs.

[408] Prediction via Shapley Value Regression

Amr Alkhatib, Roman Bresson, Henrik Boström, Michalis Vazirgiannis

Main category: cs.LG

TL;DR: ViaSHAP is a novel method to compute Shapley values directly, avoiding post-hoc computation. It uses two approaches (universal approximation and Kolmogorov-Arnold) and outperforms FastSHAP in accuracy.

Details

Motivation: Traditional Shapley value computation is post-hoc and computationally expensive. ViaSHAP aims to reduce this cost by learning a function to compute Shapley values directly.

Method: ViaSHAP learns a function to compute Shapley values using two approaches: universal approximation theorem and Kolmogorov-Arnold representation theorem.

Result: ViaSHAP performs on par with state-of-the-art algorithms for tabular data and significantly outperforms FastSHAP in accuracy for both tabular data and images.

Conclusion: ViaSHAP offers a computationally efficient and accurate alternative to traditional Shapley value computation methods.

Abstract: Shapley values have several desirable, theoretically well-supported, properties for explaining black-box model predictions. Traditionally, Shapley values are computed post-hoc, leading to additional computational cost at inference time. To overcome this, a novel method, called ViaSHAP, is proposed, that learns a function to compute Shapley values, from which the predictions can be derived directly by summation. Two approaches to implement the proposed method are explored; one based on the universal approximation theorem and the other on the Kolmogorov-Arnold representation theorem. Results from a large-scale empirical investigation are presented, showing that ViaSHAP using Kolmogorov-Arnold Networks performs on par with state-of-the-art algorithms for tabular data. It is also shown that the explanations of ViaSHAP are significantly more accurate than the popular approximator FastSHAP on both tabular data and images.

[409] Feature-Based vs. GAN-Based Learning from Demonstrations: When and Why

Chenhao Li, Marco Hutter, Andreas Krause

Main category: cs.LG

TL;DR: A comparative analysis of feature-based and GAN-based approaches in learning from demonstrations, highlighting trade-offs in reward functions and policy learning.

Details

Motivation: To understand the strengths and limitations of feature-based and GAN-based methods for learning from demonstrations, focusing on reward function design and policy learning implications.

Method: Comparative analysis of feature-based (dense, interpretable rewards) and GAN-based (scalable, flexible) approaches, emphasizing structured motion representations.

Result: Feature-based methods excel in high-fidelity motion imitation but struggle with generalization, while GAN-based methods offer flexibility but face training instability.

Conclusion: The choice between paradigms depends on task-specific priorities like fidelity, diversity, interpretability, and adaptability, with no clear dominance of one over the other.

Abstract: This survey provides a comparative analysis of feature-based and GAN-based approaches to learning from demonstrations, with a focus on the structure of reward functions and their implications for policy learning. Feature-based methods offer dense, interpretable rewards that excel at high-fidelity motion imitation, yet often require sophisticated representations of references and struggle with generalization in unstructured settings. GAN-based methods, in contrast, use implicit, distributional supervision that enables scalability and adaptation flexibility, but are prone to training instability and coarse reward signals. Recent advancements in both paradigms converge on the importance of structured motion representations, which enable smoother transitions, controllable synthesis, and improved task integration. We argue that the dichotomy between feature-based and GAN-based methods is increasingly nuanced: rather than one paradigm dominating the other, the choice should be guided by task-specific priorities such as fidelity, diversity, interpretability, and adaptability. This work outlines the algorithmic trade-offs and design considerations that underlie method selection, offering a framework for principled decision-making in learning from demonstrations.

[410] On the Similarities of Embeddings in Contrastive Learning

Chungpa Lee, Sehee Lim, Kibok Lee, Jy-yong Sohn

Main category: cs.LG

TL;DR: The paper analyzes contrastive learning via cosine similarity, showing limitations in full-batch and mini-batch settings and proposing solutions.

Details

Motivation: To understand and improve contrastive learning by addressing alignment issues in full-batch settings and variance in mini-batch settings.

Method: Proposes a unified framework for contrastive learning using cosine similarity, with theoretical insights and an auxiliary loss for mini-batch settings.

Result: Shows that perfect alignment is unattainable in full-batch settings without within-view negatives, and smaller batches degrade representation quality due to higher variance. The auxiliary loss improves performance in small-batch settings.

Conclusion: The proposed framework and auxiliary loss enhance contrastive learning by addressing alignment and variance issues in different batch settings.

Abstract: Contrastive learning operates on a simple yet effective principle: Embeddings of positive pairs are pulled together, while those of negative pairs are pushed apart. In this paper, we propose a unified framework for understanding contrastive learning through the lens of cosine similarity, and present two key theoretical insights derived from this framework. First, in full-batch settings, we show that perfect alignment of positive pairs is unattainable when negative-pair similarities fall below a threshold, and this misalignment can be mitigated by incorporating within-view negative pairs into the objective. Second, in mini-batch settings, smaller batch sizes induce stronger separation among negative pairs in the embedding space, i.e., higher variance in their similarities, which in turn degrades the quality of learned representations compared to full-batch settings. To address this, we propose an auxiliary loss that reduces the variance of negative-pair similarities in mini-batch settings. Empirical results show that incorporating the proposed loss improves performance in small-batch settings.

[411] Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, Sergey Levine

Main category: cs.LG

TL;DR: Q-chunking improves RL for long-horizon tasks by using action chunking in offline-to-online settings, enhancing exploration and sample efficiency.

Details

Motivation: Addressing challenges in offline-to-online RL, particularly exploration and sample efficiency, by leveraging offline data effectively.

Method: Applies action chunking to TD-based RL, enabling temporally consistent behaviors and unbiased n-step backups.

Result: Outperforms prior methods in offline performance and online sample efficiency on sparse-reward tasks.

Conclusion: Q-chunking is a simple, effective solution for improving RL in long-horizon, sparse-reward settings.

Abstract: We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a ‘chunked’ action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

[412] Exploring and Improving Initialization for Deep Graph Neural Networks: A Signal Propagation Perspective

Senmiao Wang, Yupeng Chen, Yushun Zhang, Ruoyu Sun, Tian Ding

Main category: cs.LG

TL;DR: The paper introduces SPoGInit, a weight initialization method for GNNs to improve signal propagation, addressing performance degradation in deep networks.

Details

Motivation: GNNs suffer from performance degradation as depth increases, prompting the need for better initialization methods to enhance signal propagation.

Method: Proposes three metrics for signal propagation in GNNs (forward/backward propagation and GEV) and introduces SPoGInit, a method optimizing these metrics through weight initialization.

Result: SPoGInit outperforms common initialization methods, enabling performance improvements in deep GNNs across various tasks.

Conclusion: SPoGInit effectively addresses depth-related challenges in GNNs, validating the SP analysis framework and offering a practical solution.

Abstract: Graph Neural Networks (GNNs) often suffer from performance degradation as the network depth increases. This paper addresses this issue by introducing initialization methods that enhance signal propagation (SP) within GNNs. We propose three key metrics for effective SP in GNNs: forward propagation, backward propagation, and graph embedding variation (GEV). While the first two metrics derive from classical SP theory, the third is specifically designed for GNNs. We theoretically demonstrate that a broad range of commonly used initialization methods for GNNs, which exhibit performance degradation with increasing depth, fail to control these three metrics simultaneously. To deal with this limitation, a direct exploitation of the SP analysis–searching for weight initialization variances that optimize the three metrics–is shown to significantly enhance the SP in deep GCNs. This approach is called Signal Propagation on Graph-guided Initialization (SPoGInit). Our experiments demonstrate that SPoGInit outperforms commonly used initialization methods on various tasks and architectures. Notably, SPoGInit enables performance improvements as GNNs deepen, which represents a significant advancement in addressing depth-related challenges and highlights the validity and effectiveness of the SP analysis framework.

[413] EXPO: Stable Reinforcement Learning with Expressive Policies

Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn

Main category: cs.LG

TL;DR: The paper introduces Expressive Policy Optimization (EXPO), an online RL algorithm for training expressive policies with offline datasets, improving sample efficiency by 2-3x over prior methods.

Details

Motivation: Training expressive policies (e.g., diffusion, flow-matching) with online RL is challenging due to unstable gradient propagation. The goal is to achieve stable value maximization without direct optimization.

Method: EXPO uses two policies: a stable expressive base policy trained via imitation learning and a lightweight Gaussian edit policy. The on-the-fly policy maximizes Q-value by editing base actions toward higher value.

Result: EXPO achieves 2-3x better sample efficiency than prior methods in fine-tuning pretrained policies and leveraging offline data for online training.

Conclusion: EXPO effectively addresses stable value maximization for expressive policies, significantly improving sample efficiency in RL tasks.

Abstract: We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies – a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.

[414] PhysiX: A Foundation Model for Physics Simulations

Tung Nguyen, Arsh Koneru, Shufan Li, Aditya Grover

Main category: cs.LG

TL;DR: PhysiX is a 4.5B parameter foundation model for physics simulation, addressing data scarcity and outperforming task-specific baselines by leveraging autoregressive tokenization and refinement.

Details

Motivation: Foundation models excel in video, image, and language but lag in physics simulation due to data scarcity and scale variability. PhysiX aims to bridge this gap.

Method: PhysiX uses a discrete tokenizer to encode physical processes into tokens and an autoregressive model for prediction, with a refinement module to reduce discretization errors.

Result: PhysiX outperforms task-specific baselines and state-of-the-art approaches on The Well benchmark, showing successful knowledge transfer from natural videos.

Conclusion: Joint training across diverse physics tasks enables synergistic learning, demonstrating the potential of foundation models in physics simulation.

Abstract: Foundation models have achieved remarkable success across video, image, and language domains. By scaling up the number of parameters and training datasets, these models acquire generalizable world knowledge and often surpass task-specific approaches. However, such progress has yet to extend to the domain of physics simulation. A primary bottleneck is data scarcity: while millions of images, videos, and textual resources are readily available on the internet, the largest physics simulation datasets contain only tens of thousands of samples. This data limitation hinders the use of large models, as overfitting becomes a major concern. As a result, physics applications typically rely on small models, which struggle with long-range prediction due to limited context understanding. Additionally, unlike images, videos, or text-which typically exhibit fixed granularity-physics datasets often vary drastically in scale, amplifying the challenges of scaling up multitask training. We introduce PhysiX, the first large-scale foundation model for physics simulation. PhysiX is a 4.5B parameter autoregressive generative model. It uses a discrete tokenizer to encode physical processes at different scales into a sequence of discrete tokens, and employs an autoregressive next-token prediction objective to model such processes in the token space. To mitigate the rounding error in the discretization process, PhysiX incorporates a specialized refinement module. Through extensive experiments, we show that PhysiX effectively addresses the data bottleneck, outperforming task-specific baselines under comparable settings as well as the previous absolute state-of-the-art approaches on The Well benchmark. Our results indicate that knowledge learned from natural videos can be successfully transferred to physics simulation, and that joint training across diverse simulation tasks enables synergistic learning.

[415] Tree-Structured Parzen Estimator Can Solve Black-Box Combinatorial Optimization More Efficiently

Kenshin Abe, Yunzhuo Wang, Shuhei Watanabe

Main category: cs.LG

TL;DR: The paper proposes an efficient combinatorial optimization algorithm for Tree-structured Parzen estimator (TPE), enhancing its applicability beyond deep learning to domains like chemistry and biology.

Details

Motivation: Combinatorial optimization is crucial in fields like chemistry and biology but remains underexplored in TPE. The paper aims to address this gap.

Method: The authors generalize the categorical kernel with the numerical kernel in TPE, introduce a distance structure, and modify the kernel to handle large combinatorial search spaces efficiently.

Result: Experiments on synthetic problems show the proposed method finds better solutions with fewer evaluations than the original TPE.

Conclusion: The enhanced TPE algorithm is effective for combinatorial optimization and is integrated into the open-source HPO framework Optuna.

Abstract: Tree-structured Parzen estimator (TPE) is a versatile hyperparameter optimization (HPO) method supported by popular HPO tools. Since these HPO tools have been developed in line with the trend of deep learning (DL), the problem setups often used in the DL domain have been discussed for TPE such as multi-objective optimization and multi-fidelity optimization. However, the practical applications of HPO are not limited to DL, and black-box combinatorial optimization is actively utilized in some domains, e.g., chemistry and biology. As combinatorial optimization has been an untouched, yet very important, topic in TPE, we propose an efficient combinatorial optimization algorithm for TPE. In this paper, we first generalize the categorical kernel with the numerical kernel in TPE, enabling us to introduce a distance structure to the categorical kernel. Then we discuss modifications for the newly developed kernel to handle a large combinatorial search space. These modifications reduce the time complexity of the kernel calculation with respect to the size of a combinatorial search space. In the experiments using synthetic problems, we verified that our proposed method identifies better solutions with fewer evaluations than the original TPE. Our algorithm is available in Optuna, an open-source framework for HPO.

[416] TAB: Unified Benchmarking of Time Series Anomaly Detection Methods

Xiangfei Qiu, Zhe Li, Wanghui Qiu, Shiyan Hu, Lekui Zhou, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Aoying Zhou, Zhenli Sheng, Jilin Hu, Christian S. Jensen, Bin Yang

Main category: cs.LG

TL;DR: The paper introduces TAB, a new benchmark for time series anomaly detection (TSAD), addressing evaluation deficiencies by providing diverse datasets, unified evaluation protocols, and automated pipelines.

Details

Motivation: Current TSAD evaluation methods lack reliability due to inconsistent datasets and protocols, hindering progress in developing better anomaly detection methods.

Method: TAB includes 29 multivariate and 1,635 univariate datasets, covers various TSAD methods (Non-learning, ML, DL, LLM-based, pre-trained), and features a unified, automated evaluation pipeline.

Result: TAB enables comprehensive and fair evaluation of TSAD methods, offering insights into their performance across diverse datasets.

Conclusion: TAB serves as a reliable benchmark for TSAD evaluation, facilitating future research and method comparisons.

Abstract: Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare. With the ongoing instrumentation of reality, more time series data will be available, leading also to growing demands for TSAD. While many TSAD methods already exist, new and better methods are still desirable. However, effective progress hinges on the availability of reliable means of evaluating new methods and comparing them with existing methods. We address deficiencies in current evaluation procedures related to datasets and experimental settings and protocols. Specifically, we propose a new time series anomaly detection benchmark, called TAB. First, TAB encompasses 29 public multivariate datasets and 1,635 univariate time series from different domains to facilitate more comprehensive evaluations on diverse datasets. Second, TAB covers a variety of TSAD methods, including Non-learning, Machine learning, Deep learning, LLM-based, and Time-series pre-trained methods. Third, TAB features a unified and automated evaluation pipeline that enables fair and easy evaluation of TSAD methods. Finally, we employ TAB to evaluate existing TSAD methods and report on the outcomes, thereby offering a deeper insight into the performance of these methods. Besides, all datasets and code are available at https://github.com/decisionintelligence/TAB.

[417] On Equivariant Model Selection through the Lens of Uncertainty

Putri A. van der Linden, Alexander Timans, Dharmesh Tailor, Erik J. Bekkers

Main category: cs.LG

TL;DR: The paper explores uncertainty-aware methods for selecting equivariant models with varying symmetry biases, comparing frequentist, Bayesian, and calibration-based measures. Bayesian model evidence is inconsistent, attributed to complexity mismatches.

Details

Motivation: To address the challenge of selecting among pretrained equivariant models with varying symmetry biases, which can harm performance if misspecified.

Method: Compares frequentist (Conformal Prediction), Bayesian (marginal likelihood), and calibration-based measures to naive error-based evaluation.

Result: Uncertainty metrics generally align with predictive performance, but Bayesian model evidence does so inconsistently due to complexity mismatches.

Conclusion: Uncertainty metrics show promise for guiding symmetry-aware model selection, though Bayesian methods require refinement.

Abstract: Equivariant models leverage prior knowledge on symmetries to improve predictive performance, but misspecified architectural constraints can harm it instead. While work has explored learning or relaxing constraints, selecting among pretrained models with varying symmetry biases remains challenging. We examine this model selection task from an uncertainty-aware perspective, comparing frequentist (via Conformal Prediction), Bayesian (via the marginal likelihood), and calibration-based measures to naive error-based evaluation. We find that uncertainty metrics generally align with predictive performance, but Bayesian model evidence does so inconsistently. We attribute this to a mismatch in Bayesian and geometric notions of model complexity for the employed last-layer Laplace approximation, and discuss possible remedies. Our findings point towards the potential of uncertainty in guiding symmetry-aware model selection.

[418] Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment

Yuhui Sun, Xiyao Wang, Zixi Li, Zhenlong Yuan, Jinman Zhao

Main category: cs.LG

TL;DR: The paper introduces Multi-Preference Lambda-weighted Listwise DPO, a method extending Direct Preference Optimization (DPO) to handle multiple preference dimensions dynamically, improving alignment without re-training.

Details

Motivation: Aligning large language models (LLMs) with human preferences is challenging. Existing methods like RLHF are costly and unstable, while DPO is limited to fixed, single-dimensional preferences.

Method: Proposes Multi-Preference Lambda-weighted Listwise DPO, generalizing DPO to support multiple preference dimensions and dynamic interpolation via a simplex-weighted lambda vector, enabling listwise supervision.

Result: Empirical results show the method matches or surpasses standard DPO on alignment benchmarks, offering improved adaptability, even on smaller (1B-2B) models.

Conclusion: The method provides a flexible, efficient alternative for aligning LLMs with human preferences, particularly useful in compute-constrained settings.

Abstract: While large language models (LLMs) excel at text generation, aligning them with human preferences remains challenging. Reinforcement learning from human feedback (RLHF) improves alignment but is costly and unstable. Direct Preference Optimization (DPO) offers a simpler alternative, yet assumes a fixed, single-dimensional preference. We propose Multi-Preference Lambda-weighted Listwise DPO, a generalization of DPO that supports multiple preference dimensions and dynamic interpolation via a simplex-weighted lambda vector. Our method enables listwise supervision and flexible alignment without re-training. While our experiments are conducted on 1B-2B scale models, this is an intentional choice: smaller models provide a more stringent testbed where performance improvements more clearly reflect the effectiveness of the alignment strategy itself. Moreover, such models are widely used in compute-constrained applications, making our improvements both methodologically meaningful and practically valuable. Empirical results show that our approach matches or surpasses standard DPO on alignment benchmarks while offering improved adaptability.

[419] Energy Efficiency in AI for 5G and Beyond: A DeepRx Case Study

Amine Lbath, Ibtissam Labriji

Main category: cs.LG

TL;DR: The study explores balancing energy efficiency and performance in AI/ML models using DeepRX, a deep learning receiver. It evaluates energy consumption, applies knowledge distillation (KD) for efficiency, and shows KD’s effectiveness in reducing energy while maintaining performance.

Details

Motivation: The challenge of balancing energy efficiency with performance in AI/ML models, particularly for DeepRX, a deep learning receiver.

Method: Evaluates energy consumption (FLOPs/Watt, FLOPs/clock), compares training vs. inference energy, and applies knowledge distillation (KD) to train a compact student model.

Result: Distilled models achieve lower error floors across SINR levels, demonstrating KD’s effectiveness for energy-efficient AI solutions.

Conclusion: Knowledge distillation successfully balances energy efficiency and performance in DeepRX, offering a viable approach for energy-efficient AI models.

Abstract: This study addresses the challenge of balancing energy efficiency with performance in AI/ML models, focusing on DeepRX, a deep learning receiver based on a fully convolutional ResNet architecture. We evaluate the energy consumption of DeepRX, considering factors including FLOPs/Watt and FLOPs/clock, and find consistency between estimated and actual energy usage, influenced by memory access patterns. The research extends to comparing energy dynamics during training and inference phases. A key contribution is the application of knowledge distillation (KD) to train a compact DeepRX student model that emulates the performance of the teacher model but with reduced energy consumption. We experiment with different student model sizes, optimal teacher sizes, and KD hyperparameters. Performance is measured by comparing the Bit Error Rate (BER) performance versus Signal-to-Interference & Noise Ratio (SINR) values of the distilled model and a model trained from scratch. The distilled models demonstrate a lower error floor across SINR levels, highlighting the effectiveness of KD in achieving energy-efficient AI solutions.

[420] Incorporating Interventional Independence Improves Robustness against Interventional Distribution Shift

Gautam Sreekumar, Vishnu Naresh Boddeti

Main category: cs.LG

TL;DR: The paper introduces RepLIn, a method for learning robust representations of causally-related latent variables by leveraging interventional data and enforcing independence conditions derived from the causal model.

Details

Motivation: Existing methods treat interventional data like observational data, ignoring causal independence relations, leading to performance disparities. The paper aims to address this gap.

Method: The paper proposes RepLIn, an algorithm that enforces statistical independence during interventions, derived from theoretical conditions for linear models.

Result: RepLIn reduces performance disparities and improves robustness on synthetic and real datasets (facial attribute classification, toxicity detection).

Conclusion: RepLIn is scalable and effective for learning robust representations against interventional distribution shifts.

Abstract: We consider the problem of learning robust discriminative representations of causally-related latent variables. In addition to observational data, the training dataset also includes interventional data obtained through targeted interventions on some of these latent variables to learn representations robust against the resulting interventional distribution shifts. Existing approaches treat interventional data like observational data, even when the underlying causal model is known, and ignore the independence relations that arise from these interventions. Since these approaches do not fully exploit the causal relational information resulting from interventions, they learn representations that produce large disparities in predictive performance on observational and interventional data, which worsens when the number of interventional training samples is limited. In this paper, (1) we first identify a strong correlation between this performance disparity and adherence of the representations to the independence conditions induced by the interventional causal model. (2) For linear models, we derive sufficient conditions on the proportion of interventional data in the training dataset, for which enforcing interventional independence between representations corresponding to the intervened node and its non-descendants lowers the error on interventional data. Combining these insights, (3) we propose RepLIn, a training algorithm to explicitly enforce this statistical independence during interventions. We demonstrate the utility of RepLIn on a synthetic dataset and on real image and text datasets on facial attribute classification and toxicity detection, respectively. Our experiments show that RepLIn is scalable with the number of nodes in the causal graph and is suitable to improve the robust representations against interventional distribution shifts of both continuous and discrete latent variables.

[421] Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

Sukjun Hwang, Brandon Wang, Albert Gu

Main category: cs.LG

TL;DR: The paper introduces H-Net, a hierarchical model that replaces traditional tokenization with dynamic chunking, enabling end-to-end learning and outperforming token-based Transformers.

Details

Motivation: Tokenization remains a barrier to true end-to-end foundation models, limiting the potential of language models.

Method: Proposes H-Net, a hierarchical network with dynamic chunking, learned jointly with the model, replacing tokenization-LM-detokenization pipelines.

Result: H-Net outperforms token-based Transformers, scales better with data, and shows robustness in character-level tasks and diverse languages/modalities.

Conclusion: H-Net demonstrates the potential of end-to-end models by improving performance and efficiency, especially in scenarios with weaker tokenization heuristics.

Abstract: Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data. Despite this trend, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching the token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net’s improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

cs.MA

[422] A Learning Framework For Cooperative Collision Avoidance of UAV Swarms Leveraging Domain Knowledge

Shuangyao Huang, Haibo Zhang, Zhiyi Huang

Main category: cs.MA

TL;DR: A MARL framework for UAV swarm collision avoidance uses domain knowledge-driven rewards from image processing to model obstacles as field maxima, ensuring smooth, energy-efficient paths without collisions.

Details

Motivation: To address the challenges of cooperative collision avoidance in UAV swarms by minimizing agent interaction and eliminating complex credit assignment or observation sharing mechanisms.

Method: The framework leverages domain knowledge (image processing) to derive rewards, modeling obstacles as maxima on a 2D field. Contours avoid peaks, ensuring collision-free paths. Training adapts UAVs to complex environments.

Result: The framework outperforms state-of-the-art MARL algorithms, enabling large swarm training and adaptability to non-viable or non-existent contours.

Conclusion: The proposed MARL framework effectively avoids collisions in UAV swarms, simplifies training, and adapts to complex environments, outperforming existing methods.

Abstract: This paper presents a multi-agent reinforcement learning (MARL) framework for cooperative collision avoidance of UAV swarms leveraging domain knowledge-driven reward. The reward is derived from knowledge in the domain of image processing, approximating contours on a two-dimensional field. By modeling obstacles as maxima on the field, collisions are inherently avoided as contours never go through peaks or intersect. Additionally, counters are smooth and energy-efficient. Our framework enables training with large swarm sizes as the agent interaction is minimized and the need for complex credit assignment schemes or observation sharing mechanisms in state-of-the-art MARL approaches are eliminated. Moreover, UAVs obtain the ability to adapt to complex environments where contours may be non-viable or non-existent through intensive training. Extensive experiments are conducted to evaluate the performances of our framework against state-of-the-art MARL algorithms.

Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse

Main category: cs.MA

TL;DR: DroidSpeak enables KV cache reuse across different LLMs with the same architecture, improving throughput and prefill time with minimal quality loss.

Details

Motivation: Efficiently reuse prefix KV caches across different LLMs in compound AI systems to enhance performance.

Method: Selectively recomputes a few layers of KV cache from one LLM and reuses the rest, pipelining computations for efficiency.

Result: Achieves up to 4x throughput improvement and 3.1x faster prefill with negligible quality loss.

Conclusion: DroidSpeak effectively enables KV cache sharing across LLMs, significantly boosting performance without compromising quality.

Abstract: Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question. We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present DroidSpeak, which selectively recomputes a few layers of the KV cache produced by another LLM and reuses the remaining layers, with negligible quality loss. Moreover, carefully pipelining the layer-wise re-computation and the loading of reused KV cache further improves the inference performance. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill (time to first token), with negligible loss of quality in F1 scores, Rouge-L or code similarity score, compared to the baseline which does not allow any sharing across models.

[424] Voting or Consensus? Decision-Making in Multi-Agent Debate

Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp

Main category: cs.MA

TL;DR: The paper evaluates the impact of seven decision protocols in multi-agent debates, showing voting protocols boost reasoning tasks by 13.2% and consensus protocols improve knowledge tasks by 2.8%. Two new methods, AAD and CI, further enhance performance by up to 7.4%.

Details

Motivation: To systematically analyze how decision-making protocols influence multi-agent debates, as prior studies often altered multiple parameters, making it hard to isolate protocol effects.

Method: The study changes only the decision protocol (e.g., majority voting, unanimity consensus) while keeping other variables constant, measuring effects on agent collaboration, knowledge, and reasoning tasks.

Result: Voting protocols excel in reasoning tasks (+13.2%), consensus in knowledge tasks (+2.8%). More agents improve performance, but extra discussion rounds reduce it. AAD and CI methods further boost performance by up to 7.4%.

Conclusion: Decision-making protocols significantly impact multi-agent debates, with voting and consensus excelling in different tasks. Proposed methods (AAD, CI) enhance performance, highlighting the importance of protocol choice beyond scaling.

Abstract: Much of the success of multi-agent debates depends on carefully choosing the right parameters. The decision-making protocol stands out as it can highly impact final model answers, depending on how decisions are reached. Systematic comparison of decision protocols is difficult because many studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making influences different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time - the decision protocol - to analyze how different methods affect the collaboration between agents and measure differences in knowledge and reasoning tasks. Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks compared to other decision protocols. Increasing the number of agents improves performance, while more discussion rounds before voting reduce it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.

Lina Zhao, Jiaxing Bai, Zihao Bian, Qingyue Chen, Yafang Li, Guangbo Li, Min He, Huaiyuan Yao, Zongjiu Zhang

Main category: cs.MA

TL;DR: FUAS-Agents, an LLM-driven system, improves FUAS treatment planning by integrating multimodal data and AI tools, achieving high expert ratings in clinical scenarios.

Details

Motivation: FUAS requires complex tasks like image interpretation and personalized planning, needing intelligent assistance for efficiency and reliability.

Method: FUAS-Agents uses LLMs to integrate patient profiles and MRI data, orchestrating AI tools for segmentation, dose prediction, and guideline retrieval to create personalized plans.

Result: In uterine fibroid treatment, 82.5-97.5% of plans scored 4+ (5-point scale) for completeness, accuracy, fluency, and compliance.

Conclusion: LLM-driven agents enhance clinical decision-making, combining general-purpose models with expert systems for healthcare challenges.

Abstract: Focused Ultrasound Ablation Surgery (FUAS) has emerged as a promising non-invasive therapeutic modality, valued for its safety and precision. Nevertheless, its clinical implementation entails intricate tasks such as multimodal image interpretation, personalized dose planning, and real-time intraoperative decision-making processes that demand intelligent assistance to improve efficiency and reliability. We introduce FUAS-Agents, an autonomous agent system that leverages the multimodal understanding and tool-using capabilities of large language models (LLMs). By integrating patient profiles and MRI data, FUAS-Agents orchestrates a suite of specialized medical AI tools, including segmentation, treatment dose prediction, and clinical guideline retrieval, to generate personalized treatment plans comprising MRI image, dose parameters, and therapeutic strategies. We evaluate the system in a uterine fibroid treatment scenario. Human assessment by four senior FUAS experts indicates that 82.5%, 82.5%, 87.5%, and 97.5% of the generated plans were rated 4 or above (on a 5-point scale) in terms of completeness, accuracy, fluency, and clinical compliance, respectively. These results demonstrate the potential of LLM-driven agents in enhancing decision-making across complex clinical workflows, and exemplify a translational paradigm that combines general-purpose models with specialized expert systems to solve practical challenges in vertical healthcare domains.

[426] MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications

Aleksandr Algazinov, Matt Laing, Paul Laban

Main category: cs.MA

TL;DR: MATE is a multimodal accessibility MAS that converts data formats based on user needs, aiding people with disabilities. It supports various models, ensures privacy, and integrates with institutional tech. ModCon-Task-Identifier outperforms other models in task extraction.

Details

Motivation: Existing MAS lack customization for accessibility, creating barriers for users with disabilities. MATE aims to provide tailored assistance by converting data to accessible formats.

Method: MATE uses multimodal conversions (e.g., image to audio) and supports diverse models (LLM APIs, custom ML classifiers). ModCon-Task-Identifier extracts precise tasks from user input.

Result: MATE is adaptable, privacy-focused, and integrates well with institutional tech. ModCon-Task-Identifier outperforms other models in task extraction.

Conclusion: MATE offers a flexible, privacy-preserving solution for accessibility, with ModCon-Task-Identifier enhancing task precision. Publicly available for broader use.

Abstract: Accessibility remains a critical concern in today’s society, as many technologies are not developed to support the full range of user needs. Existing multi-agent systems (MAS) often cannot provide comprehensive assistance for users in need due to the lack of customization stemming from closed-source designs. Consequently, individuals with disabilities frequently encounter significant barriers when attempting to interact with digital environments. We introduce MATE, a multimodal accessibility MAS, which performs the modality conversions based on the user’s needs. The system is useful for assisting people with disabilities by ensuring that data will be converted to an understandable format. For instance, if the user cannot see well and receives an image, the system converts this image to its audio description. MATE can be applied to a wide range of domains, industries, and areas, such as healthcare, and can become a useful assistant for various groups of users. The system supports multiple types of models, ranging from LLM API calling to using custom machine learning (ML) classifiers. This flexibility ensures that the system can be adapted to various needs and is compatible with a wide variety of hardware. Since the system is expected to run locally, it ensures the privacy and security of sensitive information. In addition, the framework can be effectively integrated with institutional technologies (e.g., digital healthcare service) for real-time user assistance. Furthermore, we introduce ModCon-Task-Identifier, a model that is capable of extracting the precise modality conversion task from the user input. Numerous experiments show that ModCon-Task-Identifier consistently outperforms other LLMs and statistical models on our custom data. Our code and data are publicly available at https://github.com/AlgazinovAleksandr/Multi-Agent-MATE.

[427] Simulation for All: A Step-by-Step Cookbook for Developing Human-Centered Multi-Agent Transportation Simulators

Shiva Azimi, Arash Tavakoli

Main category: cs.MA

TL;DR: A human-centered multi-agent simulation platform for multimodal transportation studies, integrating immersive environments and open-source tools for accessibility and replication.

Details

Motivation: Addressing limitations in existing simulation tools, which often exclude diverse road users, lack real-time interaction, and overlook accessibility.

Method: Develops a modular, extensible platform with immersive virtual environments, hardware integration (e.g., treadmill, biosensors), and multimodal data collection (e.g., fNIRS, eye tracking).

Result: Three use cases demonstrate the platform’s usability for studying multimodal mobility, supporting interdisciplinary experimentation.

Conclusion: The platform lowers barriers to high-fidelity transportation simulation, enhancing understanding of urban mobility.

Abstract: As cities evolve toward more complex and multimodal transportation systems, the need for human-centered multi-agent simulation tools has never been more urgent. Yet most existing platforms remain limited - they often separate different types of road users, rely on scripted or pre-defined behaviors, overlook public transit users as active participants, and are rarely designed with accessibility in mind for non-technical users. To address this gap, this paper presents the specifications of a multi-agent simulation platform designed to support real-time, human-centered, and immersive studies of all road users, accompanied by open-source scripts for replication. Using high-fidelity immersive virtual environments, our platform enables interaction across public transit users, pedestrians, cyclists, automated vehicles, and drivers. The architecture is modular, extensible, and designed for accessibility. The system integrates hardware-specific modules - including an omnidirectional treadmill, a seating arrangement, a smart trainer, and an actuated cockpit. Additionally, the platform collects multimodal physiological, neurological, and behavioral data through embedded sensing devices such as functional near-infrared spectroscopy (fNIRS), eye tracking, and wrist-based biosensors. To show the usability of this system, we present three use cases. Simulation for All aims to lower the barrier to entry for high-fidelity transportation simulation, support experimentation across disciplines, and advance our understanding of multimodal mobility in complex urban environments.

cs.MM

[428] MultiVox: Benchmarking Voice Assistants for Multimodal Interactions

Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha

Main category: cs.MM

TL;DR: MultiVox is a new benchmark for evaluating voice assistants’ ability to integrate spoken and visual cues, revealing current models’ limitations in context-aware responses.

Details

Motivation: Current benchmarks inadequately assess multimodal understanding, especially fine-grained speech and visual cue integration.

Method: MultiVox includes 1000 annotated speech dialogues with diverse paralinguistic and visual cues, tested on 9 state-of-the-art models.

Result: Current models struggle with contextually grounded responses despite human proficiency.

Conclusion: MultiVox highlights the need for improved multimodal integration in voice assistants.

Abstract: The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.

eess.AS

[429] Standardized Evaluation of Fetal Phonocardiography Processing Methods

Kristóf Müller, Janka Hatvani, Márton Áron Goda, Miklós Koller

Main category: eess.AS

TL;DR: The paper evaluates fetal heart sound detection and heart rate estimation methods, finding no single best method. Simpler methods can match complex ones, and standardization is needed for future evaluations.

Details

Motivation: Phonocardiography provides passive, radiation-free access to fetal heart rate and sound data.

Method: Comparison of available methods using a common benchmarking platform and standardized tests, including accuracy, error rates, and statistical measures.

Result: No method excelled in all tests. Best models achieved high F1-scores (97.6% for first heart sound, 91.4% for second) and low mean absolute errors (12.2±8.0 ms, 17.3±12.2 ms). Heart rate estimation had a 0.644 mean square error.

Conclusion: Further standardization is needed for evaluating fetal heart rate and sound detection methods. Tools and implementations are openly available.

Abstract: Motivation. Phonocardiography can give access to the fetal heart rate as well as direct heart sound data, and is entirely passive, using no radiation of any kind. Approach. We discuss the currently available methods for fetal heart sound detection and heart rate estimation and compare them using a common benchmarking platform and a pre-selected testing dataset. Compared to previous reviews, we evaluated the discussed methods in a standardized manner for a fair comparison. Our tests included tolerance-based detection accuracy, error rates for label insertions, deletions, and substitutions, and statistical measures for heart rate mean square error. Results. Based on our results, there is no definite best method that can achieve the highest scores in all of the tests, and simpler methods could perform comparably to more complex ones. The best model for first heart sound detection achieved 97.6% F1-score, 97.4% positive predictive value, and 12.2+-8.0 ms mean absolute error. In terms of second heart sound detection the best model had 91.4% F1-score, 91.3% positive predictive value, and 17.3+-12.2 ms mean absolute error. For fetal heart rate a 0.644 mean square error was achieved by the best method. Significance. Our main conclusion is that further standardization is required in fetal heart rate and heart sound detection method evaluation. The tests and algorithm implementations are openly available at: https://github.com/mulkr/standard-fpcg-evaluation.

[430] Physics-Informed Transfer Learning for Data-Driven Sound Source Reconstruction in Near-Field Acoustic Holography

Xinmeng Luan, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti

Main category: eess.AS

TL;DR: A transfer learning framework for sound source reconstruction in NAH, using physics-informed fine-tuning to adapt pre-trained models across different sound sources.

Details

Motivation: To improve reconstruction accuracy and generalization across different sound sources in NAH by leveraging transfer learning and physics-informed adaptation.

Method: Two-stage approach: (1) supervised pre-training of a CV-CNN on a large dataset, (2) physics-informed fine-tuning on a single sample using Kirchhoff-Helmholtz integral.

Result: Improved accuracy when transferring from rectangular plate to violin top plate datasets, outperforming pre-trained models and matching C-ESM. Fine-tuned models excel in successful modes.

Conclusion: The framework effectively generalizes across sound sources, enhancing reconstruction accuracy through physics-informed transfer learning.

Abstract: We propose a transfer learning framework for sound source reconstruction in Near-field Acoustic Holography (NAH), which adapts a well-trained data-driven model from one type of sound source to another using a physics-informed procedure. The framework comprises two stages: (1) supervised pre-training of a complex-valued convolutional neural network (CV-CNN) on a large dataset, and (2) purely physics-informed fine-tuning on a single data sample based on the Kirchhoff-Helmholtz integral. This method follows the principles of transfer learning by enabling generalization across different datasets through physics-informed adaptation. The effectiveness of the approach is validated by transferring a pre-trained model from a rectangular plate dataset to a violin top plate dataset, where it shows improved reconstruction accuracy compared to the pre-trained model and delivers performance comparable to that of Compressive-Equivalent Source Method (C-ESM). Furthermore, for successful modes, the fine-tuned model outperforms both the pre-trained model and C-ESM in accuracy.

[431] Array-Aware Ambisonics and HRTF Encoding for Binaural Reproduction With Wearable Arrays

Yhonatan Gayer, Vladimir Tourbabin, Zamir Ben Hur, David Lou Alon, Boaz Rafaely

Main category: eess.AS

TL;DR: A novel method for binaural reproduction from arbitrary microphone arrays improves spatial accuracy by optimizing Ambisonics encoding with HRTF pre-processing. It outperforms conventional methods in objective and perceptual evaluations.

Details

Motivation: To enhance spatial accuracy in binaural rendering for applications like VR, AR, and wearable audio capture by integrating array-specific information into HRTF processing.

Method: Array-aware optimization of Ambisonics encoding through HRTF pre-processing, incorporating array-specific details.

Result: Superior performance in objective evaluations and higher perceptual ratings in timbre and spatial quality compared to conventional methods.

Conclusion: The method provides a practical, fully compatible solution for spatial audio rendering, suitable for modern audio applications.

Abstract: This work introduces a novel method for binaural reproduction from arbitrary microphone arrays, based on array-aware optimization of Ambisonics encoding through Head-Related Transfer Function (HRTF) pre-processing. The proposed approach integrates array-specific information into the HRTF processing pipeline, leading to improved spatial accuracy in binaural rendering. Objective evaluations demonstrate superior performance under simulated wearable-array and head rotations compared to conventional Ambisonics encoding method. A listening experiment further confirms that the method achieves significantly higher perceptual ratings in both timbre and spatial quality. Fully compatible with standard Ambisonics, the proposed method offers a practical solution for spatial audio rendering in applications such as virtual reality, augmented reality, and wearable audio capture.

[432] P.808 Multilingual Speech Enhancement Testing: Approach and Results of URGENT 2025 Challenge

Marvin Sach, Yihui Fu, Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Anurag Kumar, Wei Wang, Yanmin Qian, Shinji Watanabe, Tim Fingscheidt

Main category: eess.AS

TL;DR: The paper discusses the challenges in speech quality estimation for SE systems, critiques current objective metrics, and proposes localized crowdsourced subjective testing for multilingual evaluations. It also highlights the need for combining subjective and objective metrics for generative SE methods.

Details

Motivation: The influx of generative or hybrid SE methods has revealed flaws in objective metrics, necessitating reliable subjective testing, especially for multilingual datasets.

Method: The paper recaps ITU-T P.808 crowdsourced testing and introduces a localized version of Naderi and Cutler’s ACR method for TTS. It also analyzes URGENT Challenge results.

Result: Findings suggest combining subjective ACR MOS with objective metrics (DNSMOS, NISQA) and phone fidelity metrics to detect hallucinations in generative SE methods.

Conclusion: The paper concludes by advocating for localized, multilingual subjective testing and releasing tools for easy deployment of such evaluations.

Abstract: In speech quality estimation for speech enhancement (SE) systems, subjective listening tests so far are considered as the gold standard. This should be even more true considering the large influx of new generative or hybrid methods into the field, revealing issues of some objective metrics. Efforts such as the Interspeech 2025 URGENT Speech Enhancement Challenge also involving non-English datasets add the aspect of multilinguality to the testing procedure. In this paper, we provide a brief recap of the ITU-T P.808 crowdsourced subjective listening test method. A first novel contribution is our proposed process of localizing both text and audio components of Naderi and Cutler’s implementation of crowdsourced subjective absolute category rating (ACR) listening tests involving text-to-speech (TTS). Further, we provide surprising analyses of and insights into URGENT Challenge results, tackling the reliability of (P.808) ACR subjective testing as gold standard in the age of generative AI. Particularly, it seems that for generative SE methods, subjective (ACR MOS) and objective (DNSMOS, NISQA) reference-free metrics should be accompanied by objective phone fidelity metrics to reliably detect hallucinations. Finally, in the accepted version, we will release our localization scripts and methods for easy deployment for new multilingual speech enhancement subjective evaluations according to ITU-T P.808.

[433] Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models

Paul A. Bereuter, Benjamin Stahl, Mark D. Plumbley, Alois Sontacchi

Main category: eess.AS

TL;DR: The paper critiques traditional BSS-Eval metrics for evaluating nonlinear generative models in singing voice separation, proposing alternative metrics like embedding-based MSE and multi-resolution STFT loss for better correlation with human perception (DMOS).

Details

Motivation: Traditional BSS-Eval metrics are unreliable for nonlinear generative models, necessitating new evaluation methods aligned with human perception.

Method: Conducted a Degradation Category Rating test (DMOS) and analyzed correlations with objective metrics, comparing discriminative and generative models.

Result: Embedding-based metrics (e.g., MSE on Music2Latent or MERT-L12) and multi-resolution STFT loss showed higher DMOS correlation than BSS-Eval.

Conclusion: BSS-Eval is inadequate for generative models; embedding-based metrics are more reliable for singing voice separation evaluation.

Abstract: Traditional Blind Source Separation Evaluation (BSS-Eval) metrics were originally designed to evaluate linear audio source separation models based on methods such as time-frequency masking. However, recent generative models may introduce nonlinear relationships between the separated and reference signals, limiting the reliability of these metrics for objective evaluation. To address this issue, we conduct a Degradation Category Rating listening test and analyze correlations between the obtained degradation mean opinion scores (DMOS) and a set of objective audio quality metrics for the task of singing voice separation. We evaluate three state-of-the-art discriminative models and two new competitive generative models. For both discriminative and generative models, intrusive embedding-based metrics show higher correlations with DMOS than conventional intrusive metrics such as BSS-Eval. For discriminative models, the highest correlation is achieved by the MSE computed on Music2Latent embeddings. When it comes to the evaluation of generative models, the strongest correlations are evident for the multi-resolution STFT loss and the MSE calculated on MERT-L12 embeddings, with the latter also providing the most balanced correlation across both model types. Our results highlight the limitations of BSS-Eval metrics for evaluating generative singing voice separation models and emphasize the need for careful selection and validation of alternative evaluation metrics for the task of singing voice separation.

eess.IV

[434] Comparative Analysis of Vision Transformers and Traditional Deep Learning Approaches for Automated Pneumonia Detection in Chest X-Rays

Gaurav Singh

Main category: eess.IV

TL;DR: The study compares traditional ML and deep learning for pneumonia detection in CXRs, finding Vision Transformers (especially Cross-ViT) superior with 88.25% accuracy and 99.42% recall.

Details

Motivation: Pneumonia, including COVID-19 cases, demands fast, accurate diagnosis, prompting a need for effective automated detection methods.

Method: Evaluated traditional ML (PCA, Logistic Regression, SVM) and deep learning (CNNs, ViTs) on 5,856 pediatric CXR images.

Result: Cross-ViT outperformed others with 88.25% accuracy and 99.42% recall, showing architectural impact > model size.

Conclusion: Vision Transformers, particularly Cross-ViT, are promising for automated pneumonia detection, balancing precision and recall.

Abstract: Pneumonia, particularly when induced by diseases like COVID-19, remains a critical global health challenge requiring rapid and accurate diagnosis. This study presents a comprehensive comparison of traditional machine learning and state-of-the-art deep learning approaches for automated pneumonia detection using chest X-rays (CXRs). We evaluate multiple methodologies, ranging from conventional machine learning techniques (PCA-based clustering, Logistic Regression, and Support Vector Classification) to advanced deep learning architectures including Convolutional Neural Networks (Modified LeNet, DenseNet-121) and various Vision Transformer (ViT) implementations (Deep-ViT, Compact Convolutional Transformer, and Cross-ViT). Using a dataset of 5,856 pediatric CXR images, we demonstrate that Vision Transformers, particularly the Cross-ViT architecture, achieve superior performance with 88.25% accuracy and 99.42% recall, surpassing traditional CNN approaches. Our analysis reveals that architectural choices impact performance more significantly than model size, with Cross-ViT’s 75M parameters outperforming larger models. The study also addresses practical considerations including computational efficiency, training requirements, and the critical balance between precision and recall in medical diagnostics. Our findings suggest that Vision Transformers offer a promising direction for automated pneumonia detection, potentially enabling more rapid and accurate diagnosis during health crises.

[435] A Survey on Medical Image Compression: From Traditional to Learning-Based

Guofeng Tong, Sixuan Liu, Yang Lv, Hanyu Pei, Feng-Lei Fan

Main category: eess.IV

TL;DR: A survey on medical image compression, comparing traditional and deep learning methods, focusing on preserving diagnostic quality and handling diverse imaging modalities.

Details

Motivation: The rapid growth of medical imaging necessitates efficient compression methods that preserve diagnostic details while addressing storage and transmission challenges.

Method: The paper categorizes compression techniques into traditional (mathematical transforms) and learning-based (deep learning) approaches, further dividing them by data structure (2D, 3D/4D).

Result: Traditional methods offer predictability and standardization, while deep learning excels in adaptability and capturing complex image characteristics.

Conclusion: The survey highlights the need for balancing computational efficiency and clinical quality, suggesting future research directions in medical image compression.

Abstract: The exponential growth of medical imaging has created significant challenges in data storage, transmission, and management for healthcare systems. In this vein, efficient compression becomes increasingly important. Unlike natural image compression, medical image compression prioritizes preserving diagnostic details and structural integrity, imposing stricter quality requirements and demanding fast, memory-efficient algorithms that balance computational complexity with clinically acceptable reconstruction quality. Meanwhile, the medical imaging family includes a plethora of modalities, each possessing different requirements. For example, 2D medical image (e.g., X-rays, histopathological images) compression focuses on exploiting intra-slice spatial redundancy, while volumetric medical image faces require handling intra-slice and inter-slice spatial correlations, and 4D dynamic imaging (e.g., time-series CT/MRI, 4D ultrasound) additionally demands processing temporal correlations between consecutive time frames. Traditional compression methods, grounded in mathematical transforms and information theory principles, provide solid theoretical foundations, predictable performance, and high standardization levels, with extensive validation in clinical environments. In contrast, deep learning-based approaches demonstrate remarkable adaptive learning capabilities and can capture complex statistical characteristics and semantic information within medical images. This comprehensive survey establishes a two-facet taxonomy based on data structure (2D vs 3D/4D) and technical approaches (traditional vs learning-based), thereby systematically presenting the complete technological evolution, analyzing the unique technical challenges, and prospecting future directions in medical image compression.

[436] Focus on Texture: Rethinking Pre-training in Masked Autoencoders for Medical Image Classification

Chetan Madan, Aarjav Satia, Soumen Basu, Pankaj Gupta, Usha Dutta, Chetan Arora

Main category: eess.IV

TL;DR: GLCM-MAE improves self-supervised pre-training for medical images by replacing pixel-wise MSE with a GLCM-based loss, preserving texture cues and outperforming state-of-the-art in four tasks.

Details

Motivation: Standard MAEs with MSE loss blur textures crucial for medical imaging. GLCM captures intensity and spatial relationships, better preserving morphological features.

Method: Proposes GLCM-MAE, using a differentiable GLCM-based reconstruction loss instead of MSE. Includes a novel formulation to convert GLCM matrices into a loss function.

Result: GLCM-MAE outperforms state-of-the-art in gallbladder cancer (2.1%), breast cancer (3.1%), pneumonia (0.5%), and COVID detection (0.6%).

Conclusion: GLCM-based loss enhances unsupervised pre-training for medical imaging, improving downstream task performance by preserving critical texture features.

Abstract: Masked Autoencoders (MAEs) have emerged as a dominant strategy for self-supervised representation learning in natural images, where models are pre-trained to reconstruct masked patches with a pixel-wise mean squared error (MSE) between original and reconstructed RGB values as the loss. We observe that MSE encourages blurred image re-construction, but still works for natural images as it preserves dominant edges. However, in medical imaging, when the texture cues are more important for classification of a visual abnormality, the strategy fails. Taking inspiration from Gray Level Co-occurrence Matrix (GLCM) feature in Radiomics studies, we propose a novel MAE based pre-training framework, GLCM-MAE, using reconstruction loss based on matching GLCM. GLCM captures intensity and spatial relationships in an image, hence proposed loss helps preserve morphological features. Further, we propose a novel formulation to convert matching GLCM matrices into a differentiable loss function. We demonstrate that unsupervised pre-training on medical images with the proposed GLCM loss improves representations for downstream tasks. GLCM-MAE outperforms the current state-of-the-art across four tasks - gallbladder cancer detection from ultrasound images by 2.1%, breast cancer detection from ultrasound by 3.1%, pneumonia detection from x-rays by 0.5%, and COVID detection from CT by 0.6%. Source code and pre-trained models are available at: https://github.com/ChetanMadan/GLCM-MAE.

[437] Real-Time Foreign Object Recognition Based on Improved Wavelet Scattering Deep Network and Edge Computing

He Zhichao, Shen Xiangyu, Zhang Yong, Xie Nan

Main category: eess.IV

TL;DR: A lightweight model using improved wavelet scattering networks enables real-time foreign object detection on UAV edge devices with high accuracy and speed.

Details

Motivation: The rise of new energy in power systems demands efficient substation and transmission line maintenance. UAVs can detect foreign objects, but edge devices lack computational power for real-time image processing.

Method: Proposes a model with an improved wavelet scattering network for feature extraction, replacing CNN layers, followed by a simplified MLP for classification.

Result: Achieves over 90% accuracy and under 7ms inference time for 720P images on edge devices, outperforming YOLOv5s and YOLOv8s.

Conclusion: The model is effective for real-time foreign object detection on resource-limited UAV edge devices.

Abstract: The increasing penetration rate of new energy in the power system has put forward higher requirements for the operation and maintenance of substations and transmission lines. Using the Unmanned Aerial Vehicles (UAV) to identify foreign object in real time can quickly and effectively eliminate potential safety hazards. However, due to the limited computation power, the captured image cannot be real-time processed on edge devices in UAV locally. To overcome this problem, a lightweight model based on an improved wavelet scatter deep network is proposed. This model contains improved wavelet scattering network for extracting the scatter coefficients and modulus coefficients of image single channel, replacing the role of convolutional layer and pooling layer in convolutional neural network. The following 3 fully connected layers, also constituted a simplified Multilayer Perceptron (MLP), are used to classify the extracted features. Experiments prove that the model constructed with biorthogonal wavelets basis is able to recognize and classify the foreign object in edge devices such as Raspberry Pi and Jetson Nano, with accuracy higher than 90% and inference time less than 7ms for 720P (1280*720) images. Further experiments demonstrate that the recognition accuracy of our model is 1.1% higher than YOLOv5s and 0.3% higher than YOLOv8s.

[438] Using Continual Learning for Real-Time Detection of Vulnerable Road Users in Complex Traffic Scenarios

Faryal Aurooj Nasir, Salman Liaquat, Nor Muzlifah Mahyuddin

Main category: eess.IV

TL;DR: The paper introduces an intelligent adaptive system using YOLOv8-Dynamic (YOLOv8-D) for real-time detection and prevention of accidents involving vulnerable road users (VRUs). It outperforms other models like YOLOv5 and YOLOv7 in F1 score and mAP, integrates continual learning, and addresses catastrophic forgetting.

Details

Motivation: To enhance safety for pedestrians and bicyclists (VRUs) by improving real-time detection and adaptability in dynamic traffic scenarios.

Method: Uses YOLOv8-D algorithm, compares it with Faster-RCNN, YOLOv5, and YOLOv7, optimizes gradient descent, and trains on diverse datasets to prevent catastrophic forgetting.

Result: YOLOv8x shows significant improvements: 12.14% in F1 score and 45.61% in mAP over YOLOv5x, and 21.26% in F1 score and 128.44% in mAP over YOLOv7x. The optimized framework achieves 21.08% better F1 score and 31.86% better mAP.

Conclusion: The proposed system effectively detects VRUs, adapts to evolving conditions, and overcomes catastrophic forgetting, enhancing real-time safety.

Abstract: Pedestrians and bicyclists are among the vulnerable road users (VRUs) that are inherently exposed to intricate traffic scenarios, which puts them at increased risk of sustaining injuries or facing fatal outcomes. This study presents an intelligent adaptive system that uses the YOLOv8-Dynamic (YOLOv8-D) algorithm that detects vulnerable road users and adapts in real time to prevent accidents before they occur. We select YOLOv8x as the detector by comparing it with other state-of-the-art object detection models, including Faster-RCNN, YOLOv5, YOLOv7, and variants. Compared to YOLOv5x, YOLOv8x shows improvements of 12.14% in F1 score and 45.61% in mean Average Precision (mAP). Against YOLOv7x, the improvements are 21.26% in F1 score and 128.44% in mAP. Our algorithm integrates continual learning ability in the architecture of the YOLOv8 detector to adjust to evolving road conditions flexibly, ensuring adaptability across multiple dataset domains and facilitating continuous enhancement of detection and tracking accuracy for VRUs, embracing the dynamic nature of real-world environments. In our proposed framework, we optimized the gradient descent mechanism of YOLOv8 model and train our optimized algorithm on two statistically different datasets in terms of image viewpoint and number of classes to achieve a 21.08% improvement in F1 score and a 31.86% improvement in mAP as compared to a custom YOLOv8 framework trained on a new dataset, thus overcoming the issue of catastrophic forgetting, which occurs when deep models are trained on statistically different types of datasets.

[439] U-RWKV: Lightweight medical image segmentation with direction-adaptive RWKV

Hongbo Ye, Fenghe Tang, Peiang Zhao, Zhen Huang, Dexin Zhao, Minghao Bian, S. Kevin Zhou

Main category: eess.IV

TL;DR: U-RWKV is a novel medical image segmentation framework leveraging RWKV architecture for efficient long-range modeling, introducing DARM and SASE modules to improve performance and computational efficiency.

Details

Motivation: Addressing the limitations of existing methods like U-Net in capturing long-range dependencies due to limited global ERFs, especially in resource-limited healthcare settings.

Method: Proposes U-RWKV with DARM (Direction-Adaptive RWKV Module) for contextual cue aggregation and SASE (Stage-Adaptive Squeeze-and-Excitation Module) for dynamic architecture adaptation.

Result: Achieves state-of-the-art segmentation performance with high computational efficiency.

Conclusion: U-RWKV offers a practical solution for democratizing advanced medical imaging in resource-constrained environments.

Abstract: Achieving equity in healthcare accessibility requires lightweight yet high-performance solutions for medical image segmentation, particularly in resource-limited settings. Existing methods like U-Net and its variants often suffer from limited global Effective Receptive Fields (ERFs), hindering their ability to capture long-range dependencies. To address this, we propose U-RWKV, a novel framework leveraging the Recurrent Weighted Key-Value(RWKV) architecture, which achieves efficient long-range modeling at O(N) computational cost. The framework introduces two key innovations: the Direction-Adaptive RWKV Module(DARM) and the Stage-Adaptive Squeeze-and-Excitation Module(SASE). DARM employs Dual-RWKV and QuadScan mechanisms to aggregate contextual cues across images, mitigating directional bias while preserving global context and maintaining high computational efficiency. SASE dynamically adapts its architecture to different feature extraction stages, balancing high-resolution detail preservation and semantic relationship capture. Experiments demonstrate that U-RWKV achieves state-of-the-art segmentation performance with high computational efficiency, offering a practical solution for democratizing advanced medical imaging technologies in resource-constrained environments. The code is available at https://github.com/hbyecoding/U-RWKV.

[440] Precision Spatio-Temporal Feature Fusion for Robust Remote Sensing Change Detection

Buddhi Wijenayake, Athulya Ratnayake, Praveen Sumanasekara, Nichula Wasalathilaka, Mathivathanan Piratheepan, Roshan Godaliyadda, Mervyn Ekanayake, Vijitha Herath

Main category: eess.IV

TL;DR: The paper proposes an enhanced method for remote sensing change detection using precision fusion blocks and an optimized decoder pipeline, outperforming state-of-the-art models in accuracy and efficiency.

Details

Motivation: Traditional methods and early deep learning models struggle with long-range dependencies and noise sensitivity in remote sensing change detection. Transformer-based models, while effective, are computationally heavy.

Method: The approach leverages ChangeMamba architecture with precision fusion blocks for temporal variations and per-pixel differences, an enhanced decoder pipeline, and an optimized loss function combining Cross Entropy, Dice, and Lovasz objectives.

Result: Evaluations on SYSU-CD, LEVIR-CD+, and WHU-CD datasets show superior precision, recall, F1 score, IoU, and overall accuracy compared to existing methods.

Conclusion: The proposed method is robust and efficient for remote sensing change detection, with publicly available code and pretrained models.

Abstract: Remote sensing change detection is vital for monitoring environmental and urban transformations but faces challenges like manual feature extraction and sensitivity to noise. Traditional methods and early deep learning models, such as convolutional neural networks (CNNs), struggle to capture long-range dependencies and global context essential for accurate change detection in complex scenes. While Transformer-based models mitigate these issues, their computational complexity limits their applicability in high-resolution remote sensing. Building upon ChangeMamba architecture, which leverages state space models for efficient global context modeling, this paper proposes precision fusion blocks to capture channel-wise temporal variations and per-pixel differences for fine-grained change detection. An enhanced decoder pipeline, incorporating lightweight channel reduction mechanisms, preserves local details with minimal computational cost. Additionally, an optimized loss function combining Cross Entropy, Dice and Lovasz objectives addresses class imbalance and boosts Intersection-over-Union (IoU). Evaluations on SYSU-CD, LEVIR-CD+, and WHU-CD datasets demonstrate superior precision, recall, F1 score, IoU, and overall accuracy compared to state-of-the-art methods, highlighting the approach’s robustness for remote sensing change detection. For complete transparency, the codes and pretrained models are accessible at https://github.com/Buddhi19/MambaCD.git

[441] The Utility of the Virtual Imaging Trials Methodology for Objective Characterization of AI Systems and Training Data

Fakrul Islam Tushar, Lavsen Dahal, Saman Sotoudeh-Paima, Ehsan Abadi, W. Paul Segars, Ehsan Samei, Joseph Y. Lo

Main category: eess.IV

TL;DR: The study explores Virtual Imaging Trials (VIT) to assess AI model credibility in medical imaging, focusing on COVID-19 diagnosis using CT and CXR with CNNs. Diverse datasets improved performance, but external validation showed significant drops, emphasizing the need for comprehensive data. VIT provided insights into model utility and factors affecting AI efficacy.

Details

Motivation: To address the challenge of AI model credibility in medical imaging by leveraging VIT methodologies for objective assessment.

Method: Developed and tested multiple AI models (3D ResNet-like and 2D EfficientNetv2) using diverse datasets for COVID-19 diagnosis via CT and CXR. Evaluated performance using AUC and DeLong method.

Result: Models trained on diverse datasets performed best (AUC 0.73-0.76 for CT, 0.70-0.73 for CXR). External validation showed performance drops (AUC 0.77-0.85 for CT, 0.77-1.0 for CXR), highlighting data diversity’s importance.

Conclusion: VIT enhances AI model transparency and reliability, offering insights into performance drivers and bridging experimental-clinical gaps.

Abstract: The credibility of Artificial Intelligence (AI) models for medical imaging continues to be a challenge, affected by the diversity of models, the data used to train the models, and applicability of their combination to produce reproducible results for new data. In this work we aimed to explore if the emerging Virtual Imaging Trials (VIT) methodologies can provide an objective resource to approach this challenge. The study was conducted for the case example of COVID-19 diagnosis using clinical and virtual computed tomography (CT) and chest radiography (CXR) processed with convolutional neural networks (CNNs). Multiple AI models were developed and tested using 3D ResNet-like and 2D EfficientNetv2 architectures across diverse datasets. The performance differences were evaluated in terms of the area under the curve (AUC) and the DeLong method for AUC confidence intervals. The models trained on the most diverse datasets showed the highest external testing performance, with AUC values ranging from 0.73 to 0.76 for CT and 0.70 to 0.73 for CXR. Internal testing yielded higher AUC values (0.77 to 0.85 for CT and 0.77 to 1.0 for CXR), highlighting a substantial drop in performance during external validation, which underscores the importance of diverse and comprehensive training and testing data. Most notably, the VIT approach provided objective assessment of the utility of diverse models and datasets while further providing insight into the influence of dataset characteristics, patient factors, and imaging physics on AI efficacy. The VIT approach can be used to enhance model transparency and reliability, offering nuanced insights into the factors driving AI performance and bridging the gap between experimental and clinical settings.

[442] Moner: Motion Correction in Undersampled Radial MRI with Unsupervised Neural Representation

Qing Wu, Chenhe Du, Xuanyu Tian, Jingyi Yu, Yuyao Zhang, Hongjiang Wei

Main category: eess.IV

TL;DR: Moner is an unsupervised motion correction method for radial MRI that avoids the need for pre-training data by using implicit neural representation and a coarse-to-fine hash encoding strategy.

Details

Motivation: Current motion correction methods require large datasets for pre-training, increasing costs and limiting generalization. Moner addresses this by eliminating the need for training data.

Method: Moner integrates a quasi-static motion model into implicit neural representation (INR) and reformulates reconstruction using the Fourier-slice theorem. A coarse-to-fine hash encoding strategy enhances accuracy.

Result: Moner matches state-of-the-art performance on in-domain data and outperforms on out-of-domain data.

Conclusion: Moner provides a cost-effective, generalizable solution for motion correction in radial MRI without requiring pre-training data.

Abstract: Motion correction (MoCo) in radial MRI is a particularly challenging problem due to the unpredictability of subject movement. Current state-of-the-art (SOTA) MoCo algorithms often rely on extensive high-quality MR images to pre-train neural networks, which constrains the solution space and leads to outstanding image reconstruction results. However, the need for large-scale datasets significantly increases costs and limits model generalization. In this work, we propose Moner, an unsupervised MoCo method that jointly reconstructs artifact-free MR images and estimates accurate motion from undersampled, rigid motion-corrupted k-space data, without requiring any training data. Our core idea is to leverage the continuous prior of implicit neural representation (INR) to constrain this ill-posed inverse problem, facilitating optimal solutions. Specifically, we integrate a quasi-static motion model into the INR, granting its ability to correct subject’s motion. To stabilize model optimization, we reformulate radial MRI reconstruction as a back-projection problem using the Fourier-slice theorem. Additionally, we propose a novel coarse-to-fine hash encoding strategy, significantly enhancing MoCo accuracy. Experiments on multiple MRI datasets show our Moner achieves performance comparable to SOTA MoCo techniques on in-domain data, while demonstrating significant improvements on out-of-domain data. The code is available at: https://github.com/iwuqing/Moner

[443] From Real Artifacts to Virtual Reference: A Robust Framework for Translating Endoscopic Images

Junyang Wu, Fangfang Xie, Jiayuan Sun, Yun Gu, Guang-Zhong Yang

Main category: eess.IV

TL;DR: The paper introduces an artifact-resilient image translation method for aligning noisy endoscopic videos with clean virtual images in medical domain adaptation, featuring a local-global framework and noise-resilient feature extraction.

Details

Motivation: Addressing distribution shifts caused by in vivo artifacts in endoscopic imaging to improve surgical planning and navigation by aligning noisy intra-operative videos with pre-operative data.

Method: Proposes a local-global translation framework (local feature denoising and global style transfer) and a noise-resilient contrastive learning strategy for feature extraction.

Result: Validation on clinical datasets shows significant performance improvement over state-of-the-art methods.

Conclusion: The method effectively bridges domain gaps in multimodal medical imaging, enhancing robustness for intraoperative guidance.

Abstract: Domain adaptation, which bridges the distributions across different modalities, plays a crucial role in multimodal medical image analysis. In endoscopic imaging, combining pre-operative data with intra-operative imaging is important for surgical planning and navigation. However, existing domain adaptation methods are hampered by distribution shift caused by in vivo artifacts, necessitating robust techniques for aligning noisy and artifact abundant patient endoscopic videos with clean virtual images reconstructed from pre-operative tomographic data for pose estimation during intraoperative guidance. This paper presents an artifact-resilient image translation method and an associated benchmark for this purpose. The method incorporates a novel ``local-global’’ translation framework and a noise-resilient feature extraction strategy. For the former, it decouples the image translation process into a local step for feature denoising, and a global step for global style transfer. For feature extraction, a new contrastive learning strategy is proposed, which can extract noise-resilient features for establishing robust correspondence across domains. Detailed validation on both public and in-house clinical datasets has been conducted, demonstrating significantly improved performance compared to the current state-of-the-art.

[444] 360-Degree Video Super Resolution and Quality Enhancement Challenge: Methods and Results

Ahmed Telili, Wassim Hamidouche, Ibrahim Farhat, Hadi Amirpour, Christian Timmerer, Ibrahim Khadraoui, Jiajie Lu, The Van Le, Jeonneung Baek, Jin Young Lee, Yiying Wei, Xiaopeng Sun, Yu Gao, JianCheng Huangl, Yujie Zhong

Main category: eess.IV

TL;DR: The paper introduces a challenge to enhance 360-degree video quality for real-time streaming, focusing on super-resolution solutions to overcome bandwidth and latency constraints.

Details

Motivation: The rise of immersive technologies like VR and XR demands high-quality 360-degree video streaming, but current methods compromise quality due to bandwidth and latency issues.

Method: The 360-degree Video Super Resolution and Quality Enhancement Challenge was initiated, with two tracks (2x and 4x SR), to develop machine learning solutions for improving low-bitrate compressed videos.

Result: Top-performing models were evaluated for quality enhancement, bitrate gain, and computational efficiency within a unified framework.

Conclusion: The challenge aims to foster innovation in real-time 360-degree video streaming, enhancing the quality and accessibility of immersive experiences.

Abstract: Omnidirectional (360-degree) video is rapidly gaining popularity due to advancements in immersive technologies like virtual reality (VR) and extended reality (XR). However, real-time streaming of such videos, especially in live mobile scenarios like unmanned aerial vehicles (UAVs), is challenged by limited bandwidth and strict latency constraints. Traditional methods, such as compression and adaptive resolution, help but often compromise video quality and introduce artifacts that degrade the viewer experience. Additionally, the unique spherical geometry of 360-degree video presents challenges not encountered in traditional 2D video. To address these issues, we initiated the 360-degree Video Super Resolution and Quality Enhancement Challenge. This competition encourages participants to develop efficient machine learning solutions to enhance the quality of low-bitrate compressed 360-degree videos, with two tracks focusing on 2x and 4x super-resolution (SR). In this paper, we outline the challenge framework, detailing the two competition tracks and highlighting the SR solutions proposed by the top-performing models. We assess these models within a unified framework, considering quality enhancement, bitrate gain, and computational efficiency. This challenge aims to drive innovation in real-time 360-degree video streaming, improving the quality and accessibility of immersive visual experiences.

[445] Partition Map-Based Fast Block Partitioning for VVC Inter Coding

Xinmin Feng, Zhuoyuan Li, Li Li, Dong Liu, Feng Wu

Main category: eess.IV

TL;DR: Proposes a partition map-based algorithm for fast block partitioning in VVC inter coding, reducing encoder complexity while maintaining performance.

Details

Motivation: The QT+MTT block structure in VVC increases encoder complexity due to recursive partition search, necessitating efficient solutions.

Method: Develops a neural network using spatial/temporal features for partition map prediction, with MTT mask for early termination and dual-threshold decision for complexity-performance trade-off.

Result: Achieves 51.30% encoding time saving with 2.12% BDBR increase under random access configuration.

Conclusion: The method effectively balances complexity reduction and RD performance in VVC inter coding.

Abstract: Among the new techniques of Versatile Video Coding (VVC), the quadtree with nested multi-type tree (QT+MTT) block structure yields significant coding gains by providing more flexible block partitioning patterns. However, the recursive partition search in the VVC encoder increases the encoder complexity substantially. To address this issue, we propose a partition map-based algorithm to pursue fast block partitioning in inter coding. Based on our previous work on partition map-based methods for intra coding, we analyze the characteristics of VVC inter coding, and thus improve the partition map by incorporating an MTT mask for early termination. Next, we develop a neural network that uses both spatial and temporal features to predict the partition map. It consists of several special designs including stacked top-down and bottom-up processing, quantization parameter modulation layers, and partitioning-adaptive warping. Furthermore, we present a dual-threshold decision scheme to achieve a fine-grained trade-off between complexity reduction and rate-distortion (RD) performance loss. The experimental results demonstrate that the proposed method achieves an average 51.30% encoding time saving with a 2.12% Bjontegaard Delta Bit Rate (BDBR) under the random access configuration.

[446] petBrain: A New Pipeline for Amyloid, Tau Tangles and Neurodegeneration Quantification Using PET and MRI

Pierrick Coupé, Boris Mansencal, Floréal Morandat, Sergio Morell-Ortega, Nicolas Villain, Jose V. Manjón, Vincent Planche

Main category: eess.IV

TL;DR: petBrain is a new web-based pipeline for AD biomarker analysis, offering fast, reliable quantification of amyloid-PET, tau-PET, and MRI data without requiring local infrastructure.

Details

Motivation: Existing pipelines for AD biomarker analysis are slow, inconsistent, and struggle with multimodal integration, necessitating a more efficient solution.

Method: petBrain uses deep learning for segmentation, standardized biomarker quantification (Centiloid, CenTauR, HAVAs), and simultaneous A/T2/N estimation, all accessible via a web platform.

Result: petBrain matches existing pipelines in reliability for A and T2, aligns with ADNI data, and correlates well with CSF/plasma biomarkers and clinical outcomes.

Conclusion: petBrain is a robust, accessible tool for standardized AD biomarker analysis, enhancing clinical research.

Abstract: INTRODUCTION: Quantification of amyloid plaques (A), neurofibrillary tangles (T2), and neurodegeneration (N) using PET and MRI is critical for Alzheimer’s disease (AD) diagnosis and prognosis. Existing pipelines face limitations regarding processing time, variability in tracer types, and challenges in multimodal integration. METHODS: We developed petBrain, a novel end-to-end processing pipeline for amyloid-PET, tau-PET, and structural MRI. It leverages deep learning-based segmentation, standardized biomarker quantification (Centiloid, CenTauR, HAVAs), and simultaneous estimation of A, T2, and N biomarkers. The pipeline is implemented as a web-based platform, requiring no local computational infrastructure or specialized software knowledge. RESULTS: petBrain provides reliable and rapid biomarker quantification, with results comparable to existing pipelines for A and T2. It shows strong concordance with data processed in ADNI databases. The staging and quantification of A/T2/N by petBrain demonstrated good agreement with CSF/plasma biomarkers, clinical status, and cognitive performance. DISCUSSION: petBrain represents a powerful and openly accessible platform for standardized AD biomarker analysis, facilitating applications in clinical research.

[447] IM-LUT: Interpolation Mixing Look-Up Tables for Image Super-Resolution

Sejin Park, Sangmin Lee, Kyong Hwan Jin, Seung-Won Jung

Main category: eess.IV

TL;DR: The paper introduces IM-LUT, a framework for arbitrary-scale image super-resolution (ASISR) that blends interpolation functions efficiently using LUTs, outperforming existing methods in quality and efficiency.

Details

Motivation: Existing LUT-based SR methods are limited to fixed scales, while ASISR techniques using implicit neural representations are computationally expensive. IM-LUT addresses these issues by combining interpolation functions for flexibility and efficiency.

Method: Proposes IM-LUT, which learns to blend interpolation functions via IM-Net (predicting mixing weights) and replaces costly operations with LUTs for fast CPU inference.

Result: IM-LUT achieves superior balance between image quality and efficiency on benchmark datasets, outperforming existing methods.

Conclusion: IM-LUT is a promising solution for resource-constrained ASISR applications due to its lightweight and efficient performance.

Abstract: Super-resolution (SR) has been a pivotal task in image processing, aimed at enhancing image resolution across various applications. Recently, look-up table (LUT)-based approaches have attracted interest due to their efficiency and performance. However, these methods are typically designed for fixed scale factors, making them unsuitable for arbitrary-scale image SR (ASISR). Existing ASISR techniques often employ implicit neural representations, which come with considerable computational cost and memory demands. To address these limitations, we propose Interpolation Mixing LUT (IM-LUT), a novel framework that operates ASISR by learning to blend multiple interpolation functions to maximize their representational capacity. Specifically, we introduce IM-Net, a network trained to predict mixing weights for interpolation functions based on local image patterns and the target scale factor. To enhance efficiency of interpolation-based methods, IM-Net is transformed into IM-LUT, where LUTs are employed to replace computationally expensive operations, enabling lightweight and fast inference on CPUs while preserving reconstruction quality. Experimental results on several benchmark datasets demonstrate that IM-LUT consistently achieves a superior balance between image quality and efficiency compared to existing methods, highlighting its potential as a promising solution for resource-constrained applications.

Guohao Huo, Ruiting Dai, Hao Tang

Main category: eess.IV

TL;DR: EdgeIMLocSys integrates continuous learning from clinician feedback to improve brain tumor segmentation across MRI scanners, using GMLN-BTS for efficient, high-accuracy results.

Details

Motivation: Variability in MRI scanner imaging quality challenges model generalization in brain tumor segmentation.

Method: Proposes GMLN-BTS with M2AE for feature extraction, G2MCIM for cross-modal interaction, and VRUM for boundary refinement.

Result: Achieves 85.1% Dice score on BraTS2017 with 98% fewer parameters than 3D Transformers.

Conclusion: Demonstrates high-accuracy, resource-efficient segmentation suitable for clinical deployment.

Abstract: Brain tumor segmentation plays a critical role in clinical diagnosis and treatment planning, yet the variability in imaging quality across different MRI scanners presents significant challenges to model generalization. To address this, we propose the Edge Iterative MRI Lesion Localization System (EdgeIMLocSys), which integrates Continuous Learning from Human Feedback to adaptively fine-tune segmentation models based on clinician feedback, thereby enhancing robustness to scanner-specific imaging characteristics. Central to this system is the Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS), which employs a Modality-Aware Adaptive Encoder (M2AE) to extract multi-scale semantic features efficiently, and a Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) to model complementary cross-modal relationships via graph structures. Additionally, we introduce a novel Voxel Refinement UpSampling Module (VRUM) that synergistically combines linear interpolation and multi-scale transposed convolutions to suppress artifacts while preserving high-frequency details, improving segmentation boundary accuracy. Our proposed GMLN-BTS model achieves a Dice score of 85.1% on the BraTS2017 dataset with only 4.58 million parameters, representing a 98% reduction compared to mainstream 3D Transformer models, and significantly outperforms existing lightweight approaches. This work demonstrates a synergistic breakthrough in achieving high-accuracy, resource-efficient brain tumor segmentation suitable for deployment in resource-constrained clinical environments.

[449] Latent Space Consistency for Sparse-View CT Reconstruction

Duoyou Chen, Yunqing Chen, Can Zhang, Zhou Wang, Cheng Chen, Ruoxiu Xiao

Main category: eess.IV

TL;DR: The paper proposes CLS-DM, a model for 3D CT reconstruction from sparse 2D X-ray images, addressing latent space misalignment and outperforming existing methods.

Details

Motivation: CT imaging faces challenges like high time and radiation costs. Sparse-view reconstruction offers a solution, but latent space misalignment hinders performance.

Method: The CLS-DM model uses cross-modal feature contrastive learning to align 2D X-ray and 3D CT latent spaces for effective reconstruction.

Result: CLS-DM achieves superior performance in voxel-level metrics (PSNR, SSIM) on LIDC-IDRI and CTSpine1K datasets.

Conclusion: CLS-DM improves sparse X-ray CT reconstruction and can be generalized to other cross-modal tasks, with code made available for further research.

Abstract: Computed Tomography (CT) is a widely utilized imaging modality in clinical settings. Using densely acquired rotational X-ray arrays, CT can capture 3D spatial features. However, it is confronted with challenged such as significant time consumption and high radiation exposure. CT reconstruction methods based on sparse-view X-ray images have garnered substantial attention from researchers as they present a means to mitigate costs and risks. In recent years, diffusion models, particularly the Latent Diffusion Model (LDM), have demonstrated promising potential in the domain of 3D CT reconstruction. Nonetheless, due to the substantial differences between the 2D latent representation of X-ray modalities and the 3D latent representation of CT modalities, the vanilla LDM is incapable of achieving effective alignment within the latent space. To address this issue, we propose the Consistent Latent Space Diffusion Model (CLS-DM), which incorporates cross-modal feature contrastive learning to efficiently extract latent 3D information from 2D X-ray images and achieve latent space alignment between modalities. Experimental results indicate that CLS-DM outperforms classical and state-of-the-art generative models in terms of standard voxel-level metrics (PSNR, SSIM) on the LIDC-IDRI and CTSpine1K datasets. This methodology not only aids in enhancing the effectiveness and economic viability of sparse X-ray reconstructed CT but can also be generalized to other cross-modal transformation tasks, such as text-to-image synthesis. We have made our code publicly available at https://anonymous.4open.science/r/CLS-DM-50D6/ to facilitate further research and applications in other domains.

[450] 3D Magnetic Inverse Routine for Single-Segment Magnetic Field Images

J. Senthilnath, Chen Hao, F. C. Wellstood

Main category: eess.IV

TL;DR: The paper introduces 3D MIR, a method combining deep learning and physics-driven optimization to accurately recover 3D current flow parameters in semiconductor packaging using magnetic field images.

Details

Motivation: Accurate 3D information recovery is essential for non-destructive testing to locate circuit defects in semiconductor packaging.

Method: 3D MIR uses a CNN for initial predictions, spatial-physics constraints for parameter estimates, and optimization to refine parameters.

Result: The method achieves high precision in 3D information recovery, setting a new benchmark for magnetic image reconstruction.

Conclusion: Combining DL and physics-driven optimization shows great potential for practical applications in semiconductor testing.

Abstract: In semiconductor packaging, accurately recovering 3D information is crucial for non-destructive testing (NDT) to localize circuit defects. This paper presents a novel approach called the 3D Magnetic Inverse Routine (3D MIR), which leverages Magnetic Field Images (MFI) to retrieve the parameters for the 3D current flow of a single-segment. The 3D MIR integrates a deep learning (DL)-based Convolutional Neural Network (CNN), spatial-physics-based constraints, and optimization techniques. The method operates in three stages: i) The CNN model processes the MFI data to predict ($\ell/z_o$), where $\ell$ is the wire length and $z_o$ is the wire’s vertical depth beneath the magnetic sensors and classify segment type ($c$). ii) By leveraging spatial-physics-based constraints, the routine provides initial estimates for the position ($x_o$, $y_o$, $z_o$), length ($\ell$), current ($I$), and current flow direction (positive or negative) of the current segment. iii) An optimizer then adjusts these five parameters ($x_o$, $y_o$, $z_o$, $\ell$, $I$) to minimize the difference between the reconstructed MFI and the actual MFI. The results demonstrate that the 3D MIR method accurately recovers 3D information with high precision, setting a new benchmark for magnetic image reconstruction in semiconductor packaging. This method highlights the potential of combining DL and physics-driven optimization in practical applications.

[451] HANS-Net: Hyperbolic Convolution and Adaptive Temporal Attention for Accurate and Generalizable Liver and Tumor Segmentation in CT Imaging

Arefin Ittesafun Abian, Ripon Kumar Debnath, Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Asif Karim, Reem E. Mohamed, Sami Azam

Main category: eess.IV

TL;DR: HANS-Net is a novel framework for liver and tumor segmentation on CT images, combining hyperbolic convolutions, wavelet-inspired decomposition, synaptic plasticity, and implicit neural representation. It achieves high accuracy and generalization across datasets.

Details

Motivation: Accurate liver and tumor segmentation is crucial for diagnosis and treatment but is challenging due to anatomical complexity, tumor variability, and limited annotated data.

Method: HANS-Net integrates hyperbolic convolutions, wavelet-inspired decomposition, synaptic plasticity, implicit neural representation, uncertainty-aware Monte Carlo dropout, and lightweight temporal attention.

Result: Achieves mean Dice score of 93.26%, IoU of 88.09%, ASSD of 0.72 mm, and VOE of 11.91% on LiTS dataset, with strong cross-dataset performance.

Conclusion: HANS-Net is effective and robust for accurate, anatomically consistent, and confident liver and tumor segmentation.

Abstract: Accurate liver and tumor segmentation on abdominal CT images is critical for reliable diagnosis and treatment planning, but remains challenging due to complex anatomical structures, variability in tumor appearance, and limited annotated data. To address these issues, we introduce Hyperbolic-convolutions Adaptive-temporal-attention with Neural-representation and Synaptic-plasticity Network (HANS-Net), a novel segmentation framework that synergistically combines hyperbolic convolutions for hierarchical geometric representation, a wavelet-inspired decomposition module for multi-scale texture learning, a biologically motivated synaptic plasticity mechanism for adaptive feature enhancement, and an implicit neural representation branch to model fine-grained and continuous anatomical boundaries. Additionally, we incorporate uncertainty-aware Monte Carlo dropout to quantify prediction confidence and lightweight temporal attention to improve inter-slice consistency without sacrificing efficiency. Extensive evaluations of the LiTS dataset demonstrate that HANS-Net achieves a mean Dice score of 93.26%, an IoU of 88.09%, an average symmetric surface distance (ASSD) of 0.72 mm, and a volume overlap error (VOE) of 11.91%. Furthermore, cross-dataset validation on the 3D-IRCADb-01 dataset obtains an average Dice of 87.45%, IoU of 80.30%, ASSD of 1.525 mm, and VOE of 19.71%, indicating strong generalization across different datasets. These results confirm the effectiveness and robustness of HANS-Net in providing anatomically consistent, accurate, and confident liver and tumor segmentation.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Teach Me Sign: Stepwise Prompting LLM for Sign Language Production

[2] Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions

[3] An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation

[4] Transforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis

[5] A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations

[6] AutoRAG-LoRA: Hallucination-Triggered Knowledge Retuning via Lightweight Adapters

[7] Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing

[8] PLEX: Perturbation-free Local Explanations for LLM-Based Text Classification

[9] Emergence of Hierarchical Emotion Organization in Large Language Models

[10] Language Models for Adult Service Website Text Analysis

[11] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

[12] Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

[13] Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler

[14] LLMs on Trial: Evaluating Judicial Fairness for Large Language Models

[15] How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations

[16] HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

[17] Modeling Understanding of Story-Based Analogies Using Large Language Models

[18] DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models

[19] Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection

[20] Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification

[21] Journalism-Guided Agentic In-Context Learning for News Stance Detection

[22] LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP

[23] Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach

[24] Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification

[25] The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

[26] Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs

[27] MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models

[28] What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests

[29] Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

[30] EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering

[31] An Agentic Flow for Finite State Machine Extraction using Prompt Chaining

[32] Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages

[33] KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding

[34] FMC: Formalization of Natural Language Mathematical Competition Problems

[35] Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources, Coded Term Lexicon, and Enhanced Detection Frameworks

[36] Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian

[37] Internal Value Alignment in Large Language Models through Controlled Value Vector Activation

[38] Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge

[39] What is the Best Process Model Representation? A Comparative Analysis for Process Modeling with Large Language Models

[40] Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss

[41] DCR: Quantifying Data Contamination in LLMs Evaluation

[42] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

[43] KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?

[44] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

[45] Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?

[46] HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong

[47] Real-World Summarization: When Evaluation Reaches Its Limits

[48] Fine-grained Stateful Knowledge Exploration: Effective and Efficient Graph Retrieval with Large Language Models

[49] GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

[50] Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

[51] AIDE: Attribute-Guided MultI-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning

[52] Understanding the Dark Side of LLMs’ Intrinsic Self-Correction

[53] Plancraft: an evaluation dataset for planning with LLM agents

[54] Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction

[55] A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens

[56] Shared Global and Local Geometry of Language Model Embeddings

[57] Style over Substance: Distilled Language Models Reason Via Stylistic Replication

[58] Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

[59] SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

[60] Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions

[61] Leveraging Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media

[62] Block Circulant Adapter for Large Language Models

[63] Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

[64] Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models

[65] FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

[66] Is Compression Really Linear with Code Intelligence?

[67] Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs

[68] Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion

[69] Gaussian mixture models as a proxy for interacting language models

[70] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

[71] A quantum semantic framework for natural language processing

[72] ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge

[73] Evaluating Multimodal Large Language Models on Educational Textbook Question Answering

[74] Jan-nano Technical Report

[75] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

[76] Stylometry recognizes human and LLM-generated texts in short samples

[77] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons