Daily arXiv Papers - 2025-08-11

Summaries of research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] PEACH: A sentence-aligned Parallel English-Arabic Corpus for Healthcare

Rania Al-Sabbagh

Main category: cs.CL

TL;DR: PEACH is a manually aligned English-Arabic healthcare corpus with 51,671 parallel sentences, useful for linguistics, translation, and NLP tasks.

DetailsMotivation: To provide a high-quality parallel corpus for healthcare texts to support research and applications in contrastive linguistics, translation studies, and NLP.

Method: Creation of a manually aligned corpus (PEACH) with 51,671 parallel sentences from patient information leaflets and educational materials.

Result: PEACH contains 590,517 English and 567,707 Arabic word tokens, with average sentence lengths of 9.52 to 11.83 words.

Conclusion: PEACH is a valuable, publicly accessible resource for bilingual lexicons, machine translation, readability assessment, and education in translation studies.

Abstract: This paper introduces PEACH, a sentence-aligned parallel English-Arabic corpus of healthcare texts encompassing patient information leaflets and educational materials. The corpus contains 51,671 parallel sentences, totaling approximately 590,517 English and 567,707 Arabic word tokens. Sentence lengths vary between 9.52 and 11.83 words on average. As a manually aligned corpus, PEACH is a gold-standard corpus, aiding researchers in contrastive linguistics, translation studies, and natural language processing. It can be used to derive bilingual lexicons, adapt large language models for domain-specific machine translation, evaluate user perceptions of machine translation in healthcare, assess patient information leaflets and educational materials’ readability and lay-friendliness, and as an educational resource in translation studies. PEACH is publicly accessible.

[2] Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation

Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, Zhuo Lu

Main category: cs.CL

TL;DR: The paper reviews the dual role of LLMs in beneficial applications and harmful content generation, proposing a taxonomy of harms and defenses, and assessing mitigation techniques like RLHF and prompt engineering.

DetailsMotivation: To address the sociotechnical challenge posed by LLMs, which offer powerful benefits but also risks of toxic or biased content.

Method: Systematic review of recent studies on LLM-related harms (toxicity, jailbreaking) and defenses (RLHF, prompt engineering, safety alignment).

Result: A unified taxonomy of LLM harms and defenses, analysis of emerging threats, and evaluation of mitigation efforts.

Conclusion: The paper highlights the evolving landscape of LLM safety, identifies current limitations, and suggests future research directions for ethical alignment.

Abstract: Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (Q&A), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.

[3] FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

Xiangyan Chen, Yufeng Li, Yujian Gan, Arkaitz Zubiaga, Matthew Purver

Main category: cs.CL

TL;DR: The paper introduces FineDialFact, a benchmark for fine-grained dialogue fact verification, addressing the challenge of detecting hallucinations in LLM-generated responses. It evaluates baseline methods, showing CoT reasoning improves performance, but the task remains difficult (best F1-score: 0.75).

DetailsMotivation: LLMs often produce factually incorrect or fabricated information (hallucinations), posing challenges for NLP applications like dialogue systems. Current detection methods are too coarse-grained.

Method: The authors create FineDialFact, a benchmark for verifying atomic facts in dialogue responses, using a dataset from public dialogue sources. They evaluate baseline methods, including those with Chain-of-Thought reasoning.

Result: CoT reasoning improves performance, but the best F1-score is only 0.75 on HybriDialogue, indicating the task’s difficulty.

Conclusion: FineDialFact is a challenging benchmark for future research. The dataset and code will be publicly available.

Abstract: Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.

[4] Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

Abishek Thamma, Micha Heilbron

Main category: cs.CL

TL;DR: Fleeting memory improves language learning in transformers but impairs prediction of human reading times, challenging prior assumptions.

DetailsMotivation: To investigate whether fleeting memory benefits language learning, as suggested by cognitive science, and how it impacts transformer models.

Method: Training transformers with and without fleeting memory on a developmentally realistic dataset, evaluating language modeling and human reading time prediction.

Result: Fleeting memory improved language modeling but worsened surprisal-based reading time prediction, contradicting prior explanations.

Conclusion: Memory limitations aid neural network language learning but not behavioral prediction, highlighting a nuanced role of memory in models.

Abstract: Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.

[5] “Mirror” Language AI Models of Depression are Criterion-Contaminated

Tong Li, Rasiq Hussain, Mehak Gupta, Joshua R. Oltmanns

Main category: cs.CL

TL;DR: The study compares Mirror and Non-Mirror LLM models for predicting depression scores, finding Mirror models inflate effect sizes due to criterion contamination, while Non-Mirror models offer more generalizable insights.

DetailsMotivation: To address the issue of criterion contamination in Mirror models, which artificially inflates effect sizes and reduces generalizability in depression prediction.

Method: Used GPT-4, GPT-4o, and LLaMA3-70B to predict depression scores from structured diagnostic (Mirror) and life history (Non-Mirror) interviews, comparing their performance and correlations with self-reported symptoms.

Result: Mirror models showed inflated effect sizes (R² = .80), while Non-Mirror models had smaller but still significant effect sizes (R² = .27). Both correlated similarly with self-reported symptoms (r ≈ .54).

Conclusion: Non-Mirror models provide more generalizable and interpretable features for real-world depression assessment, avoiding the bias of Mirror models.

Abstract: A growing number of studies show near-perfect LLM language-based prediction of depression assessment scores (up to R2 of .70). However, many develop these models directly from language responses to depression assessments. These “Mirror models” suffer from “criterion contamination”, which arises when a predicted score depends in part on the predictors themselves. This causes artificial effect size inflation which reduces model generalizability. The present study compares the performance of Mirror models versus “Non-Mirror models”, which are developed from language that does not mirror the assessment they are developed to predict. N = 110 research participants completed two different interviews: structured diagnostic and life history interviews. GPT-4, GPT-4o and LLaMA3-70B were then prompted to predict structured diagnostic interview depression scores from the two transcripts separately. Mirror models (using structured diagnostic data) showed very large effect sizes (e.g., R2 = .80). As expected, NonMirror models (using life history data) demonstrated smaller effect sizes, but were relatively large (e.g., R2 = .27). When Mirror and Non-Mirror model-predicted structured interview depression scores were correlated with self-reported depression symptoms, Mirror and NonMirror performed the same (e.g., r = ~.54), indicating that Mirror models contain bias perhaps due to criterion contamination. Topic modeling identified clusters across Mirror and Non-Mirror models, as well as between true-positive and false-positive predictions. In this head-to-head comparison study, Mirror language AI models of depression showed artificially inflated effect sizes and less generalizability. As language AI models for depression continue to evolve, incorporating Non-Mirror models may identify interpretable, and generalizable semantic features that have unique utility in real-world psychological assessment.

[6] Discovering Properties of Inflectional Morphology in Neural Emergent Communication

Miles Gilberti, Shane Storks, Huteng Dai

Main category: cs.CL

TL;DR: The paper reinterprets emergent communication (EmCom) by introducing a small-vocabulary constraint to simulate double articulation and explores inflectional morphology, revealing insights into natural language tendencies.

DetailsMotivation: To address the limitations of current EmCom research, which focuses on narrow goals and metrics, by simulating naturalistic inflectional morphology for meaningful comparisons to human language.

Method: Reinterpret the attribute-value reconstruction game with a small-vocabulary constraint, introduce new metrics, and explore variations inspired by inflectional morphology (concatenativity and fusionality).

Result: Simulated phonological constraints promote concatenative morphology, and emergent languages mimic natural languages’ tendency to fuse grammatical attributes.

Conclusion: The study advances EmCom by aligning it more closely with natural language properties, offering new metrics and insights into language emergence.

Abstract: Emergent communication (EmCom) with deep neural network-based agents promises to yield insights into the nature of human language, but remains focused primarily on a few subfield-specific goals and metrics that prioritize communication schemes which represent attributes with unique characters one-to-one and compose them syntactically. We thus reinterpret a common EmCom setting, the attribute-value reconstruction game, by imposing a small-vocabulary constraint to simulate double articulation, and formulating a novel setting analogous to naturalistic inflectional morphology (enabling meaningful comparison to natural language communication schemes). We develop new metrics and explore variations of this game motivated by real properties of inflectional morphology: concatenativity and fusionality. Through our experiments, we discover that simulated phonological constraints encourage concatenative morphology, and emergent languages replicate the tendency of natural languages to fuse grammatical attributes.

[7] Large Language Model Data Generation for Enhanced Intent Recognition in German Speech

Theresa Pekarek Rosin, Burak Can Kaplan, Stefan Wermter

Main category: cs.CL

TL;DR: The paper proposes a novel intent recognition (IR) approach for elderly German speakers, combining a fine-tuned Whisper ASR model with Transformer-based language models trained on synthetic data from LLMs (LeoLM, Llama3, ChatGPT). Synthetic data boosts performance, with LeoLM outperforming ChatGPT.

DetailsMotivation: Address limitations of existing IR systems, which are limited to short commands and English, by focusing on elderly German speakers.

Method: Combine fine-tuned Whisper ASR with Transformer-based models trained on synthetic text from LLMs (LeoLM, Llama3, ChatGPT). Evaluate robustness using synthetic speech and cross-dataset testing.

Result: Synthetic LLM-generated data improves classification and robustness. LeoLM (13B) outperforms ChatGPT (175B) for German IR.

Conclusion: Generative AI can bridge data gaps in low-resource domains. The approach is transparent and reproducible.

Abstract: Intent recognition (IR) for speech commands is essential for artificial intelligence (AI) assistant systems; however, most existing approaches are limited to short commands and are predominantly developed for English. This paper addresses these limitations by focusing on IR from speech by elderly German speakers. We propose a novel approach that combines an adapted Whisper ASR model, fine-tuned on elderly German speech (SVC-de), with Transformer-based language models trained on synthetic text datasets generated by three well-known large language models (LLMs): LeoLM, Llama3, and ChatGPT. To evaluate the robustness of our approach, we generate synthetic speech with a text-to-speech model and conduct extensive cross-dataset testing. Our results show that synthetic LLM-generated data significantly boosts classification performance and robustness to different speaking styles and unseen vocabulary. Notably, we find that LeoLM, a smaller, domain-specific 13B LLM, surpasses the much larger ChatGPT (175B) in dataset quality for German intent recognition. Our approach demonstrates that generative AI can effectively bridge data gaps in low-resource domains. We provide detailed documentation of our data generation and training process to ensure transparency and reproducibility.

[8] Do Machines Think Emotionally? Cognitive Appraisal Analysis of Large Language Models

Sree Bhattacharyya, Lucas Craig, Tharun Dilliraj, Jia Li, James Z. Wang

Main category: cs.CL

TL;DR: The paper investigates how LLMs reason about emotions using cognitive appraisal theory, introducing the CoRE benchmark to evaluate their implicit cognitive structures for emotional reasoning.

DetailsMotivation: To move beyond superficial emotion tasks and explore deeper cognitive reasoning in LLMs for emotionally charged stimuli.

Method: Introduces the CoRE benchmark, evaluates LLMs using cognitive appraisal theory, and conducts experiments to analyze reasoning patterns.

Result: Reveals diverse reasoning patterns across LLMs and identifies key cognitive dimensions for specific emotions.

Conclusion: The study provides insights into LLMs’ cognitive reasoning about emotions and releases the CoRE benchmark for public use.

Abstract: Affective Computing has been established as a crucial field of inquiry to advance the holistic development of Artificial Intelligence (AI) systems. Foundation models – especially Large Language Models (LLMs) – have been evaluated, trained, or instruction-tuned in several past works, to become better predictors or generators of emotion. Most of these studies, however, approach emotion-related tasks in a supervised manner, assessing or training the capabilities of LLMs using discrete emotion labels associated with stimuli (e.g., text, images, video, audio). Evaluation studies, in particular, have often been limited to standard and superficial emotion-related tasks, such as the recognition of evoked or expressed emotions. In this paper, we move beyond surface-level emotion tasks to investigate how LLMs reason about emotions through cognitive dimensions. Drawing from cognitive appraisal theory, we examine whether LLMs produce coherent and plausible cognitive reasoning when reasoning about emotionally charged stimuli. We introduce a large-scale benchmark on Cognitive Reasoning for Emotions - CoRE - to evaluate internal cognitive structures implicitly used by LLMs for emotional reasoning. Through a plethora of evaluation experiments and analysis, we seek to answer: (a) Are models more likely to implicitly rely on specific cognitive appraisal dimensions?, (b) What cognitive dimensions are important for characterizing specific emotions?, and, (c) Can the internal representations of different emotion categories in LLMs be interpreted through cognitive appraisal dimensions? Our results and analyses reveal diverse reasoning patterns across different LLMs. Our benchmark and code will be made publicly available.

[9] Spectrum Projection Score: Aligning Retrieved Summaries with Reader Models in Retrieval-Augmented Generation

Zhanghao Hu, Qinglin Zhu, Siya Qi, Yulan He, Hanqi Yan, Lin Gui

Main category: cs.CL

TL;DR: The paper introduces Spectrum Projection Score (SPS) to measure retrieval relevance in RAG systems and presents xCompress for dynamic summary compression, improving performance in QA tasks.

DetailsMotivation: Prior work evaluates RAG holistically, making it hard to isolate retrieval's contribution due to LLM prompt sensitivity.

Method: Introduces SPS, a metric for semantic alignment of retrieved summaries, and xCompress, a framework for dynamic summary compression.

Result: Experiments on QA benchmarks show SPS enhances performance and provides insights into retrieval-generation interaction.

Conclusion: SPS and xCompress improve RAG systems by better measuring retrieval relevance and optimizing summary usage.

Abstract: Large Language Models (LLMs) have shown improved generation performance through retrieval-augmented generation (RAG) following the retriever-reader paradigm, which supplements model inputs with externally retrieved knowledge. However, prior work often evaluates RAG holistically, assessing the retriever and reader jointly, making it difficult to isolate the true contribution of retrieval, particularly given the prompt sensitivity of LLMs used as readers. We introduce Spectrum Projection Score (SPS), a lightweight, supervision-free metric that allows the reader to gauge the semantic alignment of a retrieved summary with its hidden representation by comparing the area formed by generated tokens from the summary, and the principal directions of subspace in the reader and to measure the relevance. Building on SPS we present xCompress, an inference time controller framework that dynamically samples, ranks, and compresses retrieval summary candidates. Extensive experiments on five QA benchmarks with four open source LLMs show that SPS not only enhances performance across a range of tasks but also provides a principled perspective on the interaction between retrieval and generation.

[10] Prosocial Behavior Detection in Player Game Chat: From Aligning Human-AI Definitions to Efficient Annotation at Scale

Rafal Kocielnik, Min Kim, Penphob, Boonyarungsrit, Fereshteh Soltani, Deshawn Sambrano, Animashree Anandkumar, R. Michael Alvarez

Main category: cs.CL

TL;DR: A three-stage pipeline for scalable, high-precision prosocial content classification, combining LLM-based labeling, human-AI refinement, and cost-efficient inference.

DetailsMotivation: Prosociality detection is a novel challenge lacking definitions and labeled data, requiring innovative approaches for trust and safety systems.

Method: A pipeline with LLM-based labeling, human-AI refinement, and a two-stage inference system (lightweight classifier + GPT-4 for ambiguous cases).

Result: Achieves high precision (~0.90) with ~70% cost reduction by escalating only ~35% of ambiguous cases to GPT-4.

Conclusion: Targeted human-AI interaction and deployment-aware design enable scalable solutions for novel responsible AI tasks.

Abstract: Detecting prosociality in text–communication intended to affirm, support, or improve others’ behavior–is a novel and increasingly important challenge for trust and safety systems. Unlike toxic content detection, prosociality lacks well-established definitions and labeled data, requiring new approaches to both annotation and deployment. We present a practical, three-stage pipeline that enables scalable, high-precision prosocial content classification while minimizing human labeling effort and inference costs. First, we identify the best LLM-based labeling strategy using a small seed set of human-labeled examples. We then introduce a human-AI refinement loop, where annotators review high-disagreement cases between GPT-4 and humans to iteratively clarify and expand the task definition-a critical step for emerging annotation tasks like prosociality. This process results in improved label quality and definition alignment. Finally, we synthesize 10k high-quality labels using GPT-4 and train a two-stage inference system: a lightweight classifier handles high-confidence predictions, while only $\sim$35% of ambiguous instances are escalated to GPT-4o. This architecture reduces inference costs by $\sim$70% while achieving high precision ($\sim$0.90). Our pipeline demonstrates how targeted human-AI interaction, careful task formulation, and deployment-aware architecture design can unlock scalable solutions for novel responsible AI tasks.

[11] Adversarial Topic-aware Prompt-tuning for Cross-topic Automated Essay Scoring

Chunyun Zhang, Hongyan Zhao, Chaoran Cui, Qilong Song, Zhiqing Lu, Shuai Gong, Kailin Liu

Main category: cs.CL

TL;DR: ATOP is a novel method for cross-topic AES, combining topic-shared and topic-specific features via adversarial prompt-tuning, outperforming existing methods.

DetailsMotivation: Existing AES methods neglect topic-specific features, limiting their ability to assess critical traits like topic adherence.

Method: ATOP uses adversarial training and topic-aware prompts to learn shared and specific features from PLMs, with a neighbor-based classifier for pseudo-labeling.

Result: ATOP significantly outperforms state-of-the-art methods on the ASAP++ dataset in holistic and multi-trait scoring.

Conclusion: ATOP effectively addresses topic discrepancies in AES, enhancing performance through joint feature learning and adversarial robustness.

Abstract: Cross-topic automated essay scoring (AES) aims to develop a transferable model capable of effectively evaluating essays on a target topic. A significant challenge in this domain arises from the inherent discrepancies between topics. While existing methods predominantly focus on extracting topic-shared features through distribution alignment of source and target topics, they often neglect topic-specific features, limiting their ability to assess critical traits such as topic adherence. To address this limitation, we propose an Adversarial TOpic-aware Prompt-tuning (ATOP), a novel method that jointly learns topic-shared and topic-specific features to improve cross-topic AES. ATOP achieves this by optimizing a learnable topic-aware prompt–comprising both shared and specific components–to elicit relevant knowledge from pre-trained language models (PLMs). To enhance the robustness of topic-shared prompt learning and mitigate feature scale sensitivity introduced by topic alignment, we incorporate adversarial training within a unified regression and classification framework. In addition, we employ a neighbor-based classifier to model the local structure of essay representations and generate pseudo-labels for target-topic essays. These pseudo-labels are then used to guide the supervised learning of topic-specific prompts tailored to the target topic. Extensive experiments on the publicly available ASAP++ dataset demonstrate that ATOP significantly outperforms existing state-of-the-art methods in both holistic and multi-trait essay scoring. The implementation of our method is publicly available at: https://anonymous.4open.science/r/ATOP-A271.

[12] Scaling Personality Control in LLMs with Big Five Scaler Prompts

Gunhee Cho, Yun-Gyung Cheong

Main category: cs.CL

TL;DR: Big5-Scaler is a prompt-based framework for controlling Big Five personality traits in LLMs without extra training, showing consistent results across tasks.

DetailsMotivation: To enable fine-grained personality control in large language models (LLMs) without requiring additional training.

Method: Embedding numeric trait values into natural language prompts to condition LLMs.

Result: Induces consistent and distinguishable personality traits, with performance varying by prompt type and scale. Concise prompts and lower trait intensities are most effective.

Conclusion: Big5-Scaler provides an efficient approach for creating personality-aware dialogue agents.

Abstract: We present Big5-Scaler, a prompt-based framework for conditioning large language models (LLMs) with controllable Big Five personality traits. By embedding numeric trait values into natural language prompts, our method enables fine-grained personality control without additional training. We evaluate Big5-Scaler across trait expression, dialogue generation, and human trait imitation tasks. Results show that it induces consistent and distinguishable personality traits across models, with performance varying by prompt type and scale. Our analysis highlights the effectiveness of concise prompts and lower trait intensities, providing a efficient approach for building personality-aware dialogue agents.

[13] Crisp Attention: Regularizing Transformers via Structured Sparsity

Sagar Gandhi, Vishal Gandhi

Main category: cs.CL

TL;DR: Introducing structured sparsity in attention during fine-tuning improves DistilBERT’s accuracy on SST-2, challenging the assumption that sparsity harms model performance.

DetailsMotivation: Address the quadratic cost of self-attention in Transformers and explore if sparsity can improve accuracy, not just efficiency.

Method: Applied structured, post-hoc sparsity to DistilBERT’s attention during fine-tuning on SST-2 sentiment analysis.

Result: 80% sparsity achieved 91.59% validation accuracy, a 0.97% improvement over the dense baseline.

Conclusion: Attention sparsity can enhance model generalization and performance, acting as an implicit regularizer.

Abstract: The quadratic computational cost of the self-attention mechanism is a primary challenge in scaling Transformer models. While attention sparsity is widely studied as a technique to improve computational efficiency, it is almost universally assumed to come at the cost of model accuracy. In this paper, we report a surprising counter-example to this common wisdom. By introducing structured, post-hoc sparsity to the attention mechanism of a DistilBERT model during fine-tuning on the SST-2 sentiment analysis task, we find that model accuracy improves significantly. Our model with 80% attention sparsity achieves a validation accuracy of 91.59%, a 0.97% absolute improvement over the dense baseline. We hypothesize that this phenomenon is due to sparsity acting as a powerful implicit regularizer, preventing the model from overfitting by forcing it to make predictions with a more constrained and robust set of features. Our work recasts attention sparsity not just as a tool for computational efficiency, but as a potential method for improving the generalization and performance of Transformer models.

[14] ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Kaizhi Qian, Xulin Fan, Junrui Ni, Slava Shechtman, Mark Hasegawa-Johnson, Chuang Gan, Yang Zhang

Main category: cs.CL

TL;DR: ProsodyLM improves speech language models by introducing a tokenization scheme that better captures prosody, enabling diverse prosody processing capabilities through pre-training.

DetailsMotivation: Existing speech language models struggle to capture the interdependency between content and prosody due to sub-optimal tokenization methods.

Method: ProsodyLM transcribes speech into text and adds word-level prosody tokens, retaining more prosody information and making it understandable to text-based LLMs.

Result: ProsodyLM learns diverse prosody processing capabilities, such as handling contrastive focus, understanding emotion and stress, and maintaining prosody consistency in long contexts.

Conclusion: ProsodyLM demonstrates that a simple tokenization scheme can significantly enhance prosody learning in speech language models.

Abstract: Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody. The existing mainstream paradigm of training speech language models, which converts speech into discrete tokens before feeding them into LLMs, is sub-optimal in learning prosody information – we find that the resulting LLMs do not exhibit obvious emerging prosody processing capabilities via pre-training alone. To overcome this, we propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody. Each speech utterance is first transcribed into text, followed by a sequence of word-level prosody tokens. Compared with conventional speech tokenization schemes, the proposed tokenization scheme retains more complete prosody information, and is more understandable to text-based LLMs. We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone, ranging from harnessing the prosody nuances in generated speech, such as contrastive focus, understanding emotion and stress in an utterance, to maintaining prosody consistency in long contexts.

[15] Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

Yidong Wang, Xin Wang, Cunxiang Wang, Junfeng Fang, Qiufeng Wang, Jianing Chu, Xuran Meng, Shuxun Yang, Libo Qin, Yue Zhang, Wei Ye, Shikun Zhang

Main category: cs.CL

TL;DR: The paper introduces Temporal Self-Rewarding Language Models to address the limitation of synchronized improvement in existing Self-Rewarding paradigms, improving preference learning by coordinating past, present, and future model generations.

DetailsMotivation: Existing Self-Rewarding paradigms suffer from synchronized improvement of chosen and rejected responses, reducing representational differences and undermining preference learning.

Method: Proposes a dual-phase framework: (1) Anchored Rejection (fixing rejected responses using past model outputs) and (2) Future-Guided Chosen (dynamically curating chosen samples using next-generation model predictions).

Result: Significant improvements across model families (Llama, Qwen, Mistral) and sizes, e.g., Llama3.1-8B achieves a 29.44 win rate on AlpacaEval 2.0, outperforming the baseline by 9.75. Superior generalization is shown in mathematical reasoning, QA, and code generation tasks.

Conclusion: The proposed method effectively sustains learning signals and outperforms existing Self-Rewarding paradigms, demonstrating robust generalization without task-specific training data.

Abstract: Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through iterative Direct Preference Optimization (DPO). However, our analysis reveals a critical limitation in existing Self-Rewarding paradigms: the synchronized improvement of chosen and rejected responses progressively narrows the representational difference between contrasting samples, undermining effective preference learning. We propose \textbf{Temporal Self-Rewarding Language Models} that strategically coordinate past, present, and future model generations to sustain learning signals. Our dual-phase framework introduces: (1) \textit{Anchored Rejection} - fixing rejected responses using the past initial model’s outputs and (2) \textit{Future-Guided Chosen} - dynamically curating chosen samples using next-generation model predictions. Extensive experiments across three model families (Llama, Qwen, Mistral) and different model sizes (Llama3B/8B/70B) demonstrate significant improvements when trained with our method compared to Self-Rewarding using same computation resources. For example, Llama3.1-8B reaches a 29.44 win rate on AlpacaEval 2.0 with our method, outperforming the Self-Rewarding baseline (19.69) by 9.75. Notably, our method also demonstrates superior out-of-distribution generalization across mathematical reasoning (GSM8K), knowledge-based QA (ARC, TruthfulQA), and code generation (HumanEval) tasks, even though we do not specifically collect such training data.

[16] Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: The paper introduces Memp, a method to enhance LLM-based agents with a learnable, updatable procedural memory, improving task success and efficiency.

DetailsMotivation: LLM-based agents lack robust procedural memory, which is often manually engineered or static, limiting adaptability and performance.

Method: Memp distills past agent trajectories into fine-grained instructions and higher-level abstractions, with strategies for Build, Retrieval, and Update. A dynamic regimen continuously refines the memory.

Result: Agents with refined memory achieve higher success rates and efficiency. Memory from stronger models boosts weaker models’ performance.

Conclusion: Memp demonstrates the value of dynamic procedural memory for enhancing LLM-based agents’ adaptability and task performance.

Abstract: Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model yields substantial performance gains.

[17] Efficient Knowledge Probing of Large Language Models by Adapting Pre-trained Embeddings

Kartik Sharma, Yiqiao Jin, Rakshit Trivedi, Srijan Kumar

Main category: cs.CL

TL;DR: The paper introduces PEEK, a method using proxy embeddings to estimate LLM knowledge without costly forward passes, achieving up to 90% accuracy.

DetailsMotivation: LLMs acquire vast knowledge, but probing their understanding is computationally expensive. PEEK aims to address this by leveraging pre-trained embeddings as proxies.

Method: PEEK identifies facts known by LLMs, adapts embedding models to predict LLM outputs, and evaluates performance on Wikipedia datasets and multiple LLMs.

Result: Embeddings predict LLM knowledge with up to 90% accuracy, with sentence embeddings outperforming graph embeddings.

Conclusion: PEEK offers a scalable way to identify LLM knowledge gaps and insights into their inductive biases, with code and data publicly available.

Abstract: Large language models (LLMs) acquire knowledge across diverse domains such as science, history, and geography encountered during generative pre-training. However, due to their stochasticity, it is difficult to predict what LLMs have acquired. Prior work has developed different ways to probe this knowledge by investigating the hidden representations, crafting specific task prompts, curating representative samples, and estimating their uncertainty. However, these methods require making forward passes through the underlying model to probe the LLM’s knowledge about a specific fact, making them computationally expensive and time-consuming. To bridge this gap, we propose $\textbf{PEEK}$ or $\textbf{P}$roxy $\textbf{E}$mbeddings to $\textbf{E}$stimate $\textbf{K}$nowledge of LLMs, by leveraging the pre-trained embedding models that effectively encode factual knowledge as text or graphs as proxies for LLMs. First, we identify a training set of facts known by LLMs through various probing strategies and then adapt embedding models to predict the LLM outputs with a linear decoder layer. Comprehensive evaluation on $3$ Wikipedia-derived datasets, $4$ LLMs, and $7$ embedding models shows that embeddings can predict LLM knowledge on a held-out set with up to 90 % accuracy. Furthermore, we find that sentence embedding models are more suitable than graph embeddings to predict LLM knowledge, shedding light on the underlying representation of the factual landscape. Thus, we believe that knowledge-adapted embeddings can be used to identify knowledge gaps in LLMs at scale and can provide deeper insights into LLMs’ internal inductive bias. The code and data are made available at https://github.com/claws-lab/peek.

[18] EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Xinda Wang, Zhengxu Hou, Yangshijie Zhang, Bingren Yan, Zhibo Yang, Xingsheng Zhang, Luxi Xing, Qiang Zhou, Chen Zhang

Main category: cs.CL

TL;DR: The paper proposes the EvolvR framework to improve LLM-based story evaluation by using self-synthesized, score-aligned Chain-of-Thought data and multi-agent filtering, achieving SOTA results and enhancing story generation.

DetailsMotivation: Current LLM-based methods for story evaluation struggle with adaptability (closed-source models) or reasoning (open-source models), limiting their effectiveness in open-ended tasks.

Method: The EvolvR framework uses pairwise comparison, self-synthesized CoT data, and multi-agent filtering to train an evaluator, which then guides story generation.

Result: EvolvR achieves SOTA performance on benchmarks (StoryER, HANNA, OpenMEVA) and improves generated story quality as a reward model.

Conclusion: The self-evolving approach of EvolvR is superior, validating its effectiveness for story evaluation and generation enhancement.

Abstract: Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.

[19] MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation

George Wang, Jiaqian Hu, Safinah Ali

Main category: cs.CL

TL;DR: MAATS is a Multi Agent Automated Translation System using MQM for error detection, outperforming single-agent methods in accuracy and fluency.

DetailsMotivation: To improve translation quality by leveraging specialized AI agents for distinct error categories, moving beyond surface fluency to deeper semantic fidelity.

Method: Uses multiple AI agents, each focused on a specific MQM category (e.g., Accuracy, Fluency), and a synthesis agent to refine translations iteratively.

Result: Outperforms zero-shot and single-agent baselines, excelling in semantic accuracy, locale adaptation, and distant language pairs.

Conclusion: MAATS bridges the gap between black-box LLMs and human workflows, emphasizing semantic and contextual fidelity over surface fluency.

Abstract: We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine-grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single-agent methods that rely on self-correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero-shot and single-agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi-layered error diagnosis, omission detection across perspectives, and context-aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity.

[20] ConlangCrafter: Constructing Languages with a Multi-Hop LLM Pipeline

Morris Alper, Moran Yanuka, Raja Giryes, Gašper Beguš

Main category: cs.CL

TL;DR: ConlangCrafter uses modern LLMs to automate the creation of constructed languages (conlangs) through a modular pipeline, ensuring coherence and diversity without human expertise.

DetailsMotivation: To leverage LLMs for computational creativity in conlang creation, addressing the complexity of language design.

Method: A multi-hop pipeline (phonology, morphology, syntax, lexicon, translation) using LLMs for meta-linguistic reasoning, randomness, and self-refinement.

Result: Produces coherent and typologically diverse conlangs without human input.

Conclusion: ConlangCrafter demonstrates the potential of LLMs as tools for automated, creative language design.

Abstract: Constructed languages (conlangs) such as Esperanto and Quenya have played diverse roles in art, philosophy, and international communication. Meanwhile, large-scale foundation models have revolutionized creative generation in text, images, and beyond. In this work, we leverage modern LLMs as computational creativity aids for end-to-end conlang creation. We introduce ConlangCrafter, a multi-hop pipeline that decomposes language design into modular stages – phonology, morphology, syntax, lexicon generation, and translation. At each stage, our method leverages LLMs’ meta-linguistic reasoning capabilities, injecting randomness to encourage diversity and leveraging self-refinement feedback to encourage consistency in the emerging language description. We evaluate ConlangCrafter on metrics measuring coherence and typological diversity, demonstrating its ability to produce coherent and varied conlangs without human linguistic expertise.

[21] Few-Shot Prompting for Extractive Quranic QA with Instruction-Tuned LLMs

Mohamed Basem, Islam Oshallah, Ali Hamdi, Ammar Mohammed

Main category: cs.CL

TL;DR: The paper introduces two methods for extractive QA on the Quran, leveraging large language models with Arabic prompts and post-processing to improve accuracy.

DetailsMotivation: Addressing challenges like complex language and deep meaning in the Quran for QA tasks.

Method: Uses few-shot prompting with models like Gemini and DeepSeek, along with a specialized Arabic prompt framework and post-processing techniques.

Result: Large language models with Arabic instructions outperform traditional methods, achieving a pAP10 score of 0.637.

Conclusion: Prompt-based instruction tuning is effective for low-resource, semantically rich QA tasks.

Abstract: This paper presents two effective approaches for Extractive Question Answering (QA) on the Quran. It addresses challenges related to complex language, unique terminology, and deep meaning in the text. The second uses few-shot prompting with instruction-tuned large language models such as Gemini and DeepSeek. A specialized Arabic prompt framework is developed for span extraction. A strong post-processing system integrates subword alignment, overlap suppression, and semantic filtering. This improves precision and reduces hallucinations. Evaluations show that large language models with Arabic instructions outperform traditional fine-tuned models. The best configuration achieves a pAP10 score of 0.637. The results confirm that prompt-based instruction tuning is effective for low-resource, semantically rich QA tasks.

[22] You Don’t Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures

Shengyuan Chen, Chuang Zhou, Zheng Yuan, Qinggang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, Xiao Huang

Main category: cs.CL

TL;DR: LogicRAG dynamically extracts reasoning structures at inference time to guide adaptive retrieval, avoiding costly pre-built graphs and improving efficiency.

DetailsMotivation: Addressing the limitations of GraphRAG, such as high token costs and misalignment with query logic, by enabling dynamic reasoning structure extraction.

Method: Decomposes queries into subproblems, constructs a DAG for logical dependencies, linearizes it, and prunes redundant retrieval and irrelevant context.

Result: Outperforms state-of-the-art baselines in performance and efficiency.

Conclusion: LogicRAG offers a scalable and effective solution for retrieval-augmented generation without relying on pre-built graphs.

Abstract: Large language models (LLMs) often suffer from hallucination, generating factually incorrect statements when handling questions beyond their knowledge and perception. Retrieval-augmented generation (RAG) addresses this by retrieving query-relevant contexts from knowledge bases to support LLM reasoning. Recent advances leverage pre-constructed graphs to capture the relational connections among distributed documents, showing remarkable performance in complex tasks. However, existing Graph-based RAG (GraphRAG) methods rely on a costly process to transform the corpus into a graph, introducing overwhelming token cost and update latency. Moreover, real-world queries vary in type and complexity, requiring different logic structures for accurate reasoning. The pre-built graph may not align with these required structures, resulting in ineffective knowledge retrieval. To this end, we propose a \textbf{\underline{Logic}}-aware \textbf{\underline{R}}etrieval-\textbf{\underline{A}}ugmented \textbf{\underline{G}}eneration framework (\textbf{LogicRAG}) that dynamically extracts reasoning structures at inference time to guide adaptive retrieval without any pre-built graph. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order. Besides, LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. Extensive experiments demonstrate that LogicRAG achieves both superior performance and efficiency compared to state-of-the-art baselines.

[23] AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models

Sayantan Adak, Pratyush Chatterjee, Somnath Banerjee, Rima Hazra, Somak Aditya, Animesh Mukherjee

Main category: cs.CL

TL;DR: AURA is a multi-layered framework using Process Reward Models (PRMs) to enhance LLM safety by evaluating logical coherence and safety-awareness at each reasoning step, outperforming traditional methods.

DetailsMotivation: Current LLMs struggle with affordance-based safety risks, where outputs unintentionally enable harm due to overlooked logical implications. Existing safety methods lack granularity and proactivity.

Method: AURA integrates introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically guide models toward safer reasoning.

Result: Empirical results show AURA significantly improves logical integrity and safety-awareness in model outputs, surpassing traditional approaches.

Conclusion: AURA sets a new benchmark for safer, more responsible AI, advancing alignment-sensitive applications.

Abstract: Present day LLMs face the challenge of managing affordance-based safety risks-situations where outputs inadvertently facilitate harmful actions due to overlooked logical implications. Traditional safety solutions, such as scalar outcome-based reward models, parameter tuning, or heuristic decoding strategies, lack the granularity and proactive nature needed to reliably detect and intervene during subtle yet crucial reasoning steps. Addressing this fundamental gap, we introduce AURA, an innovative, multi-layered framework centered around Process Reward Models (PRMs), providing comprehensive, step level evaluations across logical coherence and safety-awareness. Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding to dynamically and proactively guide models toward safer reasoning trajectories. Empirical evidence clearly demonstrates that this approach significantly surpasses existing methods, significantly improving the logical integrity and affordance-sensitive safety of model outputs. This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.

[24] Less is More: Selective Reflection for Compatible and Efficient Knowledge Distillation in Large Language Models

Lingyuan Liu, Mengxiang Zhang

Main category: cs.CL

TL;DR: Selective Reflection Distillation (SRD) improves Knowledge Distillation (KD) by refining training data quality and student-model compatibility, reducing computational costs and enhancing performance.

DetailsMotivation: Existing KD methods overlook training data quality and student-model compatibility, limiting effectiveness.

Method: SRD uses student model reflections to curate high-quality, compatible training data and employs curriculum scheduling.

Result: SRD improves distilled model performance and reduces training runtime by up to 39%.

Conclusion: Data quality and compatibility are crucial for effective KD; SRD provides a practical framework to achieve both.

Abstract: Knowledge Distillation (KD) is a fundamental technique for compressing large language models (LLMs) into compact, efficient student models. However, existing white-box KD methods mainly focus on balancing ground truth and student-generated responses while overlooking two critical factors: training data quality and student-model compatibility. To address these limitations, we propose Selective Reflection Distillation (SRD), a novel data curation framework that leverages reflections from student models to systematically refine training data. SRD dynamically evaluates and selects prompt-response pairs by comparing ground truth data with student model outputs, selectively curating high-quality, student-compatible training instances through automated ranking based on difficulty. Furthermore, after selecting the training data, a curriculum scheduling strategy is employed to incrementally introduce these curated subsets into the distillation process at fixed intervals. As a plug-and-play enhancement, SRD consistently improves distillation outcomes across diverse white-box KD approaches and model architectures, as well as decreases computational cost significantly during KD training. Experiments on a range of language model benchmarks demonstrate SRD’s consistent improvements in distilled model performance, as well as a reduction in training runtime by up to 39%, under diverse KD methods and model families. Notably, SRD operates as a plug-and-play module, enhancing sample efficiency without modifying underlying KD algorithms. Our findings highlight that data quality and compatibility are pivotal to effective and efficient distillation of LLMs, and SRD provides a principled framework to achieve both. This work advances the understanding of data-centric factors in KD and offers practical insights for enhancing the capability and efficiency of compressed LLMs.

[25] Semantic and Structural Analysis of Implicit Biases in Large Language Models: An Interpretable Approach

Renhan Zhang, Lian Lian, Zhen Qi, Guiran Liu

Main category: cs.CL

TL;DR: The paper proposes an interpretable method for detecting implicit social biases in large language model outputs, using nested semantic representation and contextual contrast. It validates the method on the StereoSet dataset, showing high accuracy and interpretability.

DetailsMotivation: Address implicit stereotypes in language model outputs that are not easily captured by explicit features, aiming for transparent and reliable bias detection.

Method: Combines nested semantic representation with contextual contrast, extracts latent bias features from vector space, and uses attention weight perturbation to analyze bias pathways.

Result: Achieves strong detection performance across multiple stereotype dimensions, with high accuracy, semantic consistency, and interpretability.

Conclusion: The method provides a transparent and reliable foundation for bias detection, suitable for real-world applications requiring trustworthy content.

Abstract: This paper addresses the issue of implicit stereotypes that may arise during the generation process of large language models. It proposes an interpretable bias detection method aimed at identifying hidden social biases in model outputs, especially those semantic tendencies that are not easily captured through explicit linguistic features. The method combines nested semantic representation with a contextual contrast mechanism. It extracts latent bias features from the vector space structure of model outputs. Using attention weight perturbation, it analyzes the model’s sensitivity to specific social attribute terms, thereby revealing the semantic pathways through which bias is formed. To validate the effectiveness of the method, this study uses the StereoSet dataset, which covers multiple stereotype dimensions including gender, profession, religion, and race. The evaluation focuses on several key metrics, such as bias detection accuracy, semantic consistency, and contextual sensitivity. Experimental results show that the proposed method achieves strong detection performance across various dimensions. It can accurately identify bias differences between semantically similar texts while maintaining high semantic alignment and output stability. The method also demonstrates high interpretability in its structural design. It helps uncover the internal bias association mechanisms within language models. This provides a more transparent and reliable technical foundation for bias detection. The approach is suitable for real-world applications where high trustworthiness of generated content is required.

[26] One Size Does Not Fit All: A Distribution-Aware Sparsification for More Precise Model Merging

Yingfeng Luo, Dingyang Lin, Junxin Wang, Ziqiang Xu, Kaiyan Chang, Tong Zheng, Bei Li, Anxiang Ma, Tong Xiao, Zhengtao Yu, Jingbo Zhu

Main category: cs.CL

TL;DR: TADrop introduces an adaptive sparsification strategy for model merging, improving performance by tailoring sparsity levels to parameter tensors.

DetailsMotivation: Existing model merging methods use uniform sparsity ratios, ignoring parameter heterogeneity, leading to suboptimal results.

Method: TADrop assigns custom sparsity levels to each parameter tensor based on distributional properties, preserving critical parameters.

Result: TADrop boosts performance across diverse tasks and models, e.g., a 2.0% gain for a leading merging method on ViT-B/32 tasks.

Conclusion: TADrop offers a more effective, structure-aware approach to model merging, setting a new performance benchmark.

Abstract: Model merging has emerged as a compelling data-free paradigm for multi-task learning, enabling the fusion of multiple fine-tuned models into a single, powerful entity. A key technique in merging methods is sparsification, which prunes redundant parameters from task vectors to mitigate interference. However, prevailing approaches employ a ``one-size-fits-all’’ strategy, applying a uniform sparsity ratio that overlooks the inherent structural and statistical heterogeneity of model parameters. This often leads to a suboptimal trade-off, where critical parameters are inadvertently pruned while less useful ones are retained. To address this limitation, we introduce \textbf{TADrop} (\textbf{T}ensor-wise \textbf{A}daptive \textbf{Drop}), an adaptive sparsification strategy that respects this heterogeneity. Instead of a global ratio, TADrop assigns a tailored sparsity level to each parameter tensor based on its distributional properties. The core intuition is that tensors with denser, more redundant distributions can be pruned aggressively, while sparser, more critical ones are preserved. As a simple and plug-and-play module, we validate TADrop by integrating it with foundational, classic, and SOTA merging methods. Extensive experiments across diverse tasks (vision, language, and multimodal) and models (ViT, BEiT) demonstrate that TADrop consistently and significantly boosts their performance. For instance, when enhancing a leading merging method, it achieves an average performance gain of 2.0% across 8 ViT-B/32 tasks. TADrop provides a more effective way to mitigate parameter interference by tailoring sparsification to the model’s structure, offering a new baseline for high-performance model merging.

[27] UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu

Main category: cs.CL

TL;DR: UR2 unifies retrieval and reasoning in LLMs through reinforcement learning, outperforming existing methods and matching GPT-4 variants on benchmarks.

DetailsMotivation: Existing RAG and RLVR methods are isolated, limiting generalization. UR2 aims to integrate them for broader applicability.

Method: UR2 uses difficulty-aware curriculum training and hybrid knowledge access (domain corpora + LLM summaries) for dynamic retrieval-reasoning coordination.

Result: UR2 outperforms RAG and RL methods, matching GPT-4 variants on open-domain QA, MMLU-Pro, medical, and math tasks.

Conclusion: UR2 bridges the gap between retrieval and reasoning, offering a flexible, high-performing framework for diverse tasks.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR), which optimizes complex reasoning abilities. However, these two capabilities are often developed in isolation, and existing efforts to unify them remain narrow in scope-typically limited to open-domain QA with fixed retrieval settings and task-specific assumptions. This lack of integration constrains generalization and limits the applicability of RAG-RL methods to broader domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a general framework that unifies retrieval and reasoning through reinforcement learning. UR2 introduces two key contributions: a difficulty-aware curriculum training that selectively invokes retrieval only for challenging problems, and a hybrid knowledge access strategy combining domain-specific offline corpora with LLM-generated summaries. These components are designed to enable dynamic coordination between retrieval and reasoning, improving adaptability across a diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks demonstrate that UR2 (built on Qwen2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on several benchmarks. We have released all code, models, and data at https://github.com/Tsinghua-dhy/UR2.

[28] Pragmatics beyond humans: meaning, communication, and LLMs

Vít Gvoždiak

Main category: cs.CL

TL;DR: The paper redefines pragmatics as a dynamic interface for social action, critiques traditional semiotic hierarchies in light of LLMs, and proposes the Human-Machine Communication framework. It highlights tensions between human-centric pragmatics and machine-centric LLMs, advocating for probabilistic pragmatics. It also addresses substitutionalism biases and introduces ‘context frustration’ to describe LLM-user interactions, suggesting updates to pragmatic theory for AI communication.

DetailsMotivation: The paper aims to refine pragmatics to better accommodate the role of large language models (LLMs) in communication, challenging traditional frameworks that are human-centric and ill-suited for machine interactions.

Method: The paper critiques traditional semiotic hierarchies, proposes the Human-Machine Communication (HMC) framework, and evaluates probabilistic pragmatics (e.g., Rational Speech Act framework) as more compatible with LLMs. It also analyzes substitutionalism biases and introduces ‘context frustration.’

Result: The paper argues that traditional pragmatic theories are inadequate for LLMs, advocating for probabilistic approaches and highlighting biases in LLM evaluation. It identifies ‘context frustration’ as a key issue in human-AI communication.

Conclusion: Pragmatic theory must evolve to address communication involving generative AI, incorporating probabilistic frameworks and recognizing the unique challenges posed by LLMs, such as context frustration and substitutionalism biases.

Abstract: The paper reconceptualizes pragmatics not as a subordinate, third dimension of meaning, but as a dynamic interface through which language operates as a socially embedded tool for action. With the emergence of large language models (LLMs) in communicative contexts, this understanding needs to be further refined and methodologically reconsidered. The first section challenges the traditional semiotic trichotomy, arguing that connectionist LLM architectures destabilize established hierarchies of meaning, and proposes the Human-Machine Communication (HMC) framework as a more suitable alternative. The second section examines the tension between human-centred pragmatic theories and the machine-centred nature of LLMs. While traditional, Gricean-inspired pragmatics continue to dominate, it relies on human-specific assumptions ill-suited to predictive systems like LLMs. Probabilistic pragmatics, particularly the Rational Speech Act framework, offers a more compatible teleology by focusing on optimization rather than truth-evaluation. The third section addresses the issue of substitutionalism in three forms - generalizing, linguistic, and communicative - highlighting the anthropomorphic biases that distort LLM evaluation and obscure the role of human communicative subjects. Finally, the paper introduces the concept of context frustration to describe the paradox of increased contextual input paired with a collapse in contextual understanding, emphasizing how users are compelled to co-construct pragmatic conditions both for the model and themselves. These arguments suggest that pragmatic theory may need to be adjusted or expanded to better account for communication involving generative AI.

[29] Comparing Knowledge Injection Methods for LLMs in a Low-Resource Regime

Hugo Abonizio, Thales Almeida, Roberto Lotufo, Rodrigo Nogueira

Main category: cs.CL

TL;DR: The paper explores methods for injecting small, unstructured information into LLMs, addressing challenges like catastrophic forgetting and limited data. It compares continued pre-training with synthetic data augmentation, finding diversity improves learning. It also examines RAG sensitivity and suggests self-generated synthetic data as a solution.

DetailsMotivation: To address the challenge of updating LLMs with limited data while avoiding catastrophic forgetting and improving knowledge acquisition.

Method: The study uses a dataset of recent news to evaluate knowledge acquisition via question-answer pairs. It compares continued pre-training with synthetic data augmentation and analyzes RAG-based approaches.

Result: Diverse textual variations improve learning, while RAG methods degrade performance. Self-generated synthetic data shows promise for model updates.

Conclusion: The study highlights the balance between learning new content and retaining existing knowledge, proposing synthetic data as a viable solution for efficient knowledge injection.

Abstract: Large language models (LLMs) often require vast amounts of text to effectively acquire new knowledge. While continuing pre-training on large corpora or employing retrieval-augmented generation (RAG) has proven successful, updating an LLM with only a few thousand or million tokens remains challenging. In this work, we investigate the task of injecting small, unstructured information into LLMs and its relation to the catastrophic forgetting phenomenon. We use a dataset of recent news – ensuring no overlap with the model’s pre-training data – to evaluate the knowledge acquisition by probing the model with question-answer pairs related the learned information. Starting from a continued pre-training baseline, we explored different augmentation algorithms to generate synthetic data to improve the knowledge acquisition capabilities. Our experiments show that simply continuing pre-training on limited data yields modest improvements, whereas exposing the model to diverse textual variations significantly improves the learning of new facts – particularly with methods that induce greater variability through diverse prompting. Furthermore, we shed light on the forgetting phenomenon in small-data regimes, illustrating the delicate balance between learning new content and retaining existing capabilities. We also confirm the sensitivity of RAG-based approaches for knowledge injection, which often lead to greater degradation on control datasets compared to parametric methods. Finally, we demonstrate that models can generate effective synthetic training data themselves, suggesting a pathway toward self-improving model updates. All code and generated data used in our experiments are publicly available, providing a resource for studying efficient knowledge injection in LLMs with limited data at https://github.com/hugoabonizio/knowledge-injection-methods.

[30] DKG-LLM : A Framework for Medical Diagnosis and Personalized Treatment Recommendations via Dynamic Knowledge Graph and Large Language Model Integration

Ali Sarabadani, Maryam Abdollahi Shamami, Hamidreza Sadeghsalehi, Borhan Asadi, Saba Hesaraki

Main category: cs.CL

TL;DR: The paper introduces DKG-LLM, a framework combining dynamic knowledge graphs with LLMs for medical diagnosis and treatment recommendations, achieving high accuracy and semantic coverage.

DetailsMotivation: To enhance medical diagnosis and personalized treatment by integrating dynamic knowledge graphs with large language models, addressing noisy data and complex diseases.

Method: Uses the Adaptive Semantic Fusion Algorithm (ASFA) to dynamically generate and update a knowledge graph from heterogeneous medical data, integrating it with the Grok 3 LLM.

Result: Achieves 84.19% diagnostic accuracy, 89.63% treatment recommendation accuracy, and 93.48% semantic coverage.

Conclusion: DKG-LLM is a reliable, scalable tool for medical applications, capable of handling complex data and improving with physician feedback.

Abstract: Large Language Models (LLMs) have grown exponentially since the release of ChatGPT. These models have gained attention due to their robust performance on various tasks, including language processing tasks. These models achieve understanding and comprehension of tasks by training billions of parameters. The development of these models is a transformative force in enhancing natural language understanding and has taken a significant step towards artificial general intelligence (AGI). In this study, we aim to present the DKG-LLM framework. The DKG-LLM framework introduces a groundbreaking approach to medical diagnosis and personalized treatment recommendations by integrating a dynamic knowledge graph (DKG) with the Grok 3 large language model. Using the Adaptive Semantic Fusion Algorithm (ASFA), heterogeneous medical data (including clinical reports and PubMed articles) and patient records dynamically generate a knowledge graph consisting of 15,964 nodes in 13 distinct types (e.g., diseases, symptoms, treatments, patient profiles) and 127,392 edges in 26 relationship types (e.g., causal, therapeutic, association). ASFA utilizes advanced probabilistic models, Bayesian inference, and graph optimization to extract semantic information, dynamically updating the graph with approximately 150 new nodes and edges in each data category while maintaining scalability with up to 987,654 edges. Real-world datasets, including MIMIC-III and PubMed, were utilized to evaluate the proposed architecture. The evaluation results show that DKG-LLM achieves a diagnostic accuracy of 84.19%. The model also has a treatment recommendation accuracy of 89.63% and a semantic coverage of 93.48%. DKG-LLM is a reliable and transformative tool that handles noisy data and complex multi-symptom diseases, along with feedback-based learning from physician input.

[31] Beyond Uniform Criteria: Scenario-Adaptive Multi-Dimensional Jailbreak Evaluation

Lai Jiang, Yuekang Li, Xiaohan Zhang, Youtao Ding, Li Pan

Main category: cs.CL

TL;DR: SceneJailEval introduces a scenario-adaptive multi-dimensional framework for jailbreak evaluation, outperforming existing methods with higher precision and adaptability.

DetailsMotivation: Current jailbreak evaluation methods lack precision due to binary classification and uniform criteria, leading to scenario-specific mismatches.

Method: SceneJailEval uses a scenario-adaptive framework and a 14-scenario dataset for evaluation.

Result: Achieves F1 scores of 0.917 (full-scenario) and 0.995 (JBB), surpassing prior methods.

Conclusion: SceneJailEval addresses limitations of existing methods, offering superior accuracy and adaptability for jailbreak evaluation.

Abstract: Precise jailbreak evaluation is vital for LLM red teaming and jailbreak research. Current approaches employ binary classification ( e.g., string matching, toxic text classifiers, LLM-driven methods), yielding only “yes/no” labels without quantifying harm intensity. Existing multi-dimensional frameworks ( e.g., Security Violation, Relative Truthfulness, Informativeness) apply uniform evaluation criteria across scenarios, resulting in scenario-specific mismatches–for instance, “Relative Truthfulness” is irrelevant to “hate speech”–which compromise evaluation precision. To tackle these limitations, we introduce SceneJailEval, with key contributions: (1) A groundbreaking scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical “one-size-fits-all” constraint of existing multi-dimensional methods, and featuring strong extensibility to flexibly adapt to customized or emerging scenarios. (2) A comprehensive 14-scenario dataset with diverse jailbreak variants and regional cases, filling the long-standing gap in high-quality, holistic benchmarks for scenario-adaptive evaluation. (3) SceneJailEval achieves state-of-the-art results, with an F1 score of 0.917 on our full-scenario dataset (+6% over prior SOTA) and 0.995 on JBB (+3% over prior SOTA), surpassing accuracy limits of existing evaluation methods in heterogeneous scenarios and confirming its advantage.

[32] EICAP: Deep Dive in Assessment and Enhancement of Large Language Models in Emotional Intelligence through Multi-Turn Conversations

Nizi Nazar, Ehsaneddin Asgari

Main category: cs.CL

TL;DR: The paper introduces a four-layer EI taxonomy for LLMs, presents the EICAP-Bench benchmark, evaluates six LLMs, and fine-tunes Qwen2.5 models, finding limited EI improvement with current methods.

DetailsMotivation: To address the underexplored role of Emotional Intelligence (EI) in LLMs by creating a taxonomy and benchmark for evaluation.

Method: Develops a four-layer EI taxonomy, introduces EICAP-Bench, evaluates six LLMs, and fine-tunes Qwen2.5 models using LoRA adapters on UltraChat.

Result: Qwen2.5-Instruct performs best; fine-tuning improves only the Appraisal layer, revealing limitations in current EI training.

Conclusion: Current methods fall short in enhancing EI in LLMs, calling for targeted data and modeling strategies.

Abstract: Emotional Intelligence (EI) is a critical yet underexplored dimension in the development of human-aligned LLMs. To address this gap, we introduce a unified, psychologically grounded four-layer taxonomy of EI tailored for large language models (LLMs), encompassing emotional tracking, cause inference, appraisal, and emotionally appropriate response generation. Building on this framework, we present EICAP-Bench, a novel MCQ style multi-turn benchmark designed to evaluate EI capabilities in open-source LLMs across diverse linguistic and cultural contexts. We evaluate six LLMs: LLaMA3 (8B), LLaMA3-Instruct, Gemma (9B), Gemma-Instruct, Qwen2.5 (7B), and Qwen2.5-Instruct on EmoCap-Bench, identifying Qwen2.5-Instruct as the strongest baseline. To assess the potential for enhancing EI capabilities, we fine-tune both Qwen2.5-Base and Qwen2.5-Instruct using LoRA adapters on UltraChat (UC), a large-scale, instruction-tuned dialogue dataset, in both English and Arabic. Our statistical analysis reveals that among the five EI layers, only the Appraisal layer shows significant improvement through UC-based fine-tuning. These findings highlight the limitations of existing pretraining and instruction-tuning paradigms in equipping LLMs with deeper emotional reasoning and underscore the need for targeted data and modeling strategies for comprehensive EI alignment.

[33] Classification is a RAG problem: A case study on hate speech detection

Richard Willats, Josh Pennington, Aravind Mohan, Bertie Vidgen

Main category: cs.CL

TL;DR: The paper introduces a Retrieval-Augmented Generation (RAG) approach for content moderation, enabling dynamic policy updates without retraining, with strong accuracy and explainability.

DetailsMotivation: To create adaptable classification systems for content moderation that can evolve with policies without costly retraining.

Method: Uses a Contextual Policy Engine (CPE), an agentic RAG system, to evaluate content against retrieved contextual knowledge instead of pre-trained parameters.

Result: Demonstrates robust accuracy, explainability via policy segments, and dynamic updates without retraining, with fine-grained policy control in experiments.

Conclusion: RAG transforms classification into a flexible, transparent, and adaptable process for moderation and broader applications.

Abstract: Robust content moderation requires classification systems that can quickly adapt to evolving policies without costly retraining. We present classification using Retrieval-Augmented Generation (RAG), which shifts traditional classification tasks from determining the correct category in accordance with pre-trained parameters to evaluating content in relation to contextual knowledge retrieved at inference. In hate speech detection, this transforms the task from “is this hate speech?” to “does this violate the hate speech policy?” Our Contextual Policy Engine (CPE) - an agentic RAG system - demonstrates this approach and offers three key advantages: (1) robust classification accuracy comparable to leading commercial systems, (2) inherent explainability via retrieved policy segments, and (3) dynamic policy updates without model retraining. Through three experiments, we demonstrate strong baseline performance and show that the system can apply fine-grained policy control by correctly adjusting protection for specific identity groups without requiring retraining or compromising overall performance. These findings establish that RAG can transform classification into a more flexible, transparent, and adaptable process for content moderation and wider classification problems.

[34] InfoCausalQA:Can Models Perform Non-explicit Causal Reasoning Based on Infographic?

Keummin Ka, Junhyeong Park, Jahyun Jeon, Youngjae Yu

Main category: cs.CL

TL;DR: The paper introduces InfoCausalQA, a benchmark to evaluate causal reasoning in multimodal settings using infographics. Current VLMs show limited capabilities in causal inference compared to humans.

DetailsMotivation: To address the underexplored ability of VLMs in causal reasoning, particularly in multimodal contexts.

Method: Developed InfoCausalQA with two tasks (quantitative and semantic causal reasoning) using 494 infographic-text pairs and 1,482 GPT-4o-generated QA pairs, later refined by humans.

Result: Current VLMs perform poorly in computational and semantic causal reasoning, lagging behind human performance.

Conclusion: InfoCausalQA underscores the need to improve causal reasoning in multimodal AI systems.

Abstract: Recent advances in Vision-Language Models (VLMs) have demonstrated impressive capabilities in perception and reasoning. However, the ability to perform causal inference – a core aspect of human cognition – remains underexplored, particularly in multimodal settings. In this study, we introduce InfoCausalQA, a novel benchmark designed to evaluate causal reasoning grounded in infographics that combine structured visual data with textual context. The benchmark comprises two tasks: Task 1 focuses on quantitative causal reasoning based on inferred numerical trends, while Task 2 targets semantic causal reasoning involving five types of causal relations: cause, effect, intervention, counterfactual, and temporal. We manually collected 494 infographic-text pairs from four public sources and used GPT-4o to generate 1,482 high-quality multiple-choice QA pairs. These questions were then carefully revised by humans to ensure they cannot be answered based on surface-level cues alone but instead require genuine visual grounding. Our experimental results reveal that current VLMs exhibit limited capability in computational reasoning and even more pronounced limitations in semantic causal reasoning. Their significantly lower performance compared to humans indicates a substantial gap in leveraging infographic-based information for causal inference. Through InfoCausalQA, we highlight the need for advancing the causal reasoning abilities of multimodal AI systems.

[35] Matrix-Driven Instant Review: Confident Detection and Reconstruction of LLM Plagiarism on PC

Ruichong Zhang

Main category: cs.CL

TL;DR: MDIR is a new method for detecting plagiarism in LLMs using matrix analysis and Large Deviation Theory, offering accurate weight reconstruction, p-value estimation, and efficiency.

DetailsMotivation: Growing concerns about IP theft in LLMs and shortcomings of existing plagiarism detection methods.

Method: Matrix-Driven Instant Review (MDIR) leverages matrix analysis and Large Deviation Theory to analyze weight relationships without full model inference.

Result: MDIR reliably detects plagiarism even after extensive transformations and operates efficiently on a single PC.

Conclusion: MDIR addresses key limitations of existing methods, providing a robust and accessible solution for LLM plagiarism detection.

Abstract: In recent years, concerns about intellectual property (IP) in large language models (LLMs) have grown significantly. Plagiarizing other LLMs (through direct weight copying, upcycling, pruning, or continual pretraining) and claiming authorship without properly attributing to the original license, is a serious misconduct that can lead to significant financial and reputational harm to the original developers. However, existing methods for detecting LLM plagiarism fall short in key areas. They fail to accurately reconstruct weight correspondences, lack the ability to compute statistical significance measures such as $p$-values, and may mistakenly flag models trained on similar data as being related. To address these limitations, we propose Matrix-Driven Instant Review (MDIR), a novel method that leverages matrix analysis and Large Deviation Theory. MDIR achieves accurate reconstruction of weight relationships, provides rigorous $p$-value estimation, and focuses exclusively on weight similarity without requiring full model inference. Experimental results demonstrate that MDIR reliably detects plagiarism even after extensive transformations, such as random permutations and continual pretraining with trillions of tokens. Moreover, all detections can be performed on a single PC within an hour, making MDIR both efficient and accessible.

[36] Harnessing Adaptive Topology Representations for Zero-Shot Graph Question Answering

Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James T. Kwok, Yu Zhang

Main category: cs.CL

TL;DR: The paper introduces DynamicTRF, a framework to improve zero-shot graph QA by tailoring graph representations and introducing a new metric (GRE) for balancing performance and brevity.

DetailsMotivation: Current approaches use a single graph representation, leading to incorrect or overly long responses. The paper aims to address this by analyzing weaknesses and designing tailored representations.

Method: The authors design a set of tailored graph representations ($F_{ZS}$), introduce the GRE metric, and develop DynamicTRF, which includes a TRF Preference dataset and a TRF router for adaptive representation selection.

Result: DynamicTRF significantly improves zero-shot graph QA accuracy across 7 in-domain and 2 out-of-domain tasks.

Conclusion: DynamicTRF enhances LMMs’ zero-shot graph QA by dynamically selecting the best graph representation, improving both accuracy and conciseness.

Abstract: Large Multimodal Models (LMMs) have shown generalized zero-shot capabilities in diverse domain question-answering (QA) tasks, including graph QA that involves complex graph topologies. However, most current approaches use only a single type of graph representation, namely Topology Representation Form (TRF), such as prompt-unified text descriptions or style-fixed visual styles. Those “one-size-fits-all” approaches fail to consider the specific preferences of different models or tasks, often leading to incorrect or overly long responses. To address this, we first analyze the characteristics and weaknesses of existing TRFs, and then design a set of TRFs, denoted by $F_{ZS}$, tailored to zero-shot graph QA. We then introduce a new metric, Graph Response Efficiency (GRE), which measures the balance between the performance and the brevity in graph QA. Built on these, we develop the DynamicTRF framework, which aims to improve both the accuracy and conciseness of graph QA. To be specific, DynamicTRF first creates a TRF Preference (TRFP) dataset that ranks TRFs based on their GRE scores, to probe the question-specific TRF preferences. Then it trains a TRF router on the TRFP dataset, to adaptively assign the best TRF from $F_{ZS}$ for each question during the inference. Extensive experiments across 7 in-domain algorithmic graph QA tasks and 2 out-of-domain downstream tasks show that DynamicTRF significantly enhances the zero-shot graph QA of LMMs in terms of accuracy

[37] Cyberbullying Detection via Aggression-Enhanced Prompting

Aisha Saeid, Anu Sabu, Girish A. Koushik, Ferrante Neri, Diptesh Kanojia

Main category: cs.CL

TL;DR: The study explores integrating aggression detection as an auxiliary task to improve LLMs’ performance in cyberbullying detection, proposing an enriched prompt pipeline that outperforms standard methods.

DetailsMotivation: Cyberbullying detection is challenging due to subtle expressions; leveraging aggression detection could enhance LLM performance.

Method: Evaluated zero-shot, few-shot, LoRA fine-tuning, and MTL; proposed an enriched prompt pipeline embedding aggression predictions.

Result: Enriched prompt pipeline outperforms standard LoRA fine-tuning, showing aggression context improves detection.

Conclusion: Auxiliary tasks like aggression detection can enhance LLM generalization for safety-critical social media applications.

Abstract: Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.

[38] Evaluating Style-Personalized Text Generation: Challenges and Directions

Anubhav Jangra, Bahareh Sarrafzadeh, Adrian de Wynter, Silviu Cucerzan, Sujay Kumar Jauhar

Main category: cs.CL

TL;DR: The paper critiques existing metrics like BLEU and ROUGE for evaluating style-personalized text generation, proposing alternatives like style embeddings and LLM-as-judge, and advocates for ensemble metrics.

DetailsMotivation: Limited exploration of evaluation in low-resource author style personalized text generation prompted the need for better metrics.

Method: Evaluated BLEU, ROUGE, style embeddings, and LLM-as-judge using a style discrimination benchmark across eight writing tasks and three settings.

Result: Found ensemble of diverse metrics most effective for evaluating style-personalized text generation.

Conclusion: Recommends adopting ensemble metrics for holistic evaluation of style-personalized text generation.

Abstract: While prior research has built tools and benchmarks towards style personalized text generation, there has been limited exploration of evaluation in low-resource author style personalized text generation space. Through this work, we question the effectiveness of the widely adopted evaluation metrics like BLEU and ROUGE, and explore other evaluation paradigms such as style embeddings and LLM-as-judge to holistically evaluate the style personalized text generation task. We evaluate these metrics and their ensembles using our style discrimination benchmark, that spans eight writing tasks, and evaluates across three settings, domain discrimination, authorship attribution, and LLM personalized vs non-personalized discrimination. We provide conclusive evidence to adopt ensemble of diverse evaluation metrics to effectively evaluate style personalized text generation.

[39] LLMs vs. Chinese Anime Enthusiasts: A Comparative Study on Emotionally Supportive Role-Playing

Lanlan Qiu, Xiao Pu, Yeqi Feng, Tianxing He

Main category: cs.CL

TL;DR: The paper introduces ChatAnime, a dataset for evaluating LLMs in emotionally supportive role-playing (ESRP) with anime characters, showing top LLMs outperform humans in role-playing and emotional support but lag in diversity.

DetailsMotivation: To bridge the gap in combining LLMs' role-playing and emotional support capabilities, focusing on anime characters for their defined traits and fan appeal.

Method: Created ChatAnime dataset with 20 anime characters, 60 scenarios, and 40 enthusiasts. Collected dialogue from 10 LLMs and humans, evaluated using 9 metrics.

Result: Top LLMs excel in role-playing and emotional support, while humans lead in response diversity.

Conclusion: The work provides resources for optimizing LLMs in ESRP, with datasets available for future research.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing conversations and providing emotional support as separate research directions. However, there remains a significant research gap in combining these capabilities to enable emotionally supportive interactions with virtual characters. To address this research gap, we focus on anime characters as a case study because of their well-defined personalities and large fan bases. This choice enables us to effectively evaluate how well LLMs can provide emotional support while maintaining specific character traits. We introduce ChatAnime, the first Emotionally Supportive Role-Playing (ESRP) dataset. We first thoughtfully select 20 top-tier characters from popular anime communities and design 60 emotion-centric real-world scenario questions. Then, we execute a nationwide selection process to identify 40 Chinese anime enthusiasts with profound knowledge of specific characters and extensive experience in role-playing. Next, we systematically collect two rounds of dialogue data from 10 LLMs and these 40 Chinese anime enthusiasts. To evaluate the ESRP performance of LLMs, we design a user experience-oriented evaluation system featuring 9 fine-grained metrics across three dimensions: basic dialogue, role-playing and emotional support, along with an overall metric for response diversity. In total, the dataset comprises 2,400 human-written and 24,000 LLM-generated answers, supported by over 132,000 human annotations. Experimental results show that top-performing LLMs surpass human fans in role-playing and emotional support, while humans still lead in response diversity. We hope this work can provide valuable resources and insights for future research on optimizing LLMs in ESRP. Our datasets are available at https://github.com/LanlanQiu/ChatAnime.

[40] Quantifying Conversation Drift in MCP via Latent Polytope

Haoran Shi, Hongwei Yao, Shuo Shao, Shaopeng Jiao, Ziqi Peng, Zhan Qin, Cong Wang

Main category: cs.CL

TL;DR: SecMCP is a secure framework addressing security risks in Model Context Protocol (MCP) by detecting and quantifying adversarial-induced conversation drift in LLMs.

DetailsMotivation: MCP's non-isolated execution context introduces security and privacy risks like tool poisoning and indirect prompt injection, which existing defenses fail to address adequately.

Method: SecMCP models LLM activation vectors in a latent polytope space to detect anomalous shifts in conversational dynamics.

Result: Evaluated on three LLMs and benchmark datasets, SecMCP achieves AUROC scores >0.915, proving robust detection without compromising usability.

Conclusion: SecMCP effectively mitigates MCP security threats, offering a novel methodology for quantifying conversation drift and validated empirical efficacy.

Abstract: The Model Context Protocol (MCP) enhances large language models (LLMs) by integrating external tools, enabling dynamic aggregation of real-time data to improve task execution. However, its non-isolated execution context introduces critical security and privacy risks. In particular, adversarially crafted content can induce tool poisoning or indirect prompt injection, leading to conversation hijacking, misinformation propagation, or data exfiltration. Existing defenses, such as rule-based filters or LLM-driven detection, remain inadequate due to their reliance on static signatures, computational inefficiency, and inability to quantify conversational hijacking. To address these limitations, we propose SecMCP, a secure framework that detects and quantifies conversation drift, deviations in latent space trajectories induced by adversarial external knowledge. By modeling LLM activation vectors within a latent polytope space, SecMCP identifies anomalous shifts in conversational dynamics, enabling proactive detection of hijacking, misleading, and data exfiltration. We evaluate SecMCP on three state-of-the-art LLMs (Llama3, Vicuna, Mistral) across benchmark datasets (MS MARCO, HotpotQA, FinQA), demonstrating robust detection with AUROC scores exceeding 0.915 while maintaining system usability. Our contributions include a systematic categorization of MCP security threats, a novel latent polytope-based methodology for quantifying conversation drift, and empirical validation of SecMCP’s efficacy.

[41] Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

Andrea Nasuto, Stefano Maria Iacus, Francisco Rowe, Devika Jain

Main category: cs.CL

TL;DR: LLMs fine-tuned in one or two languages can classify immigration-related content in unseen languages, but multilingual fine-tuning improves stance detection. Minimal exposure to underrepresented languages corrects pre-training biases.

DetailsMotivation: To investigate if knowledge from fine-tuning in a few languages transfers to unseen languages and if pre-training biases can be corrected with minimal intervention.

Method: Fine-tuned lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual datasets to classify immigration-related tweets across 13 languages.

Result: LLMs fine-tuned in one or two languages reliably classify content in unseen languages. Multilingual fine-tuning improves stance detection. Minimal exposure to underrepresented languages corrects biases.

Conclusion: Limited language coverage suffices for topic-level generalization, and lightweight interventions can correct structural biases. Open-source models offer scalable, cost-effective alternatives to proprietary LLMs.

Abstract: Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain characterised by polarised, culturally specific discourse. We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases. Results show that LLMs fine-tuned in one or two languages can reliably classify immigration-related content in unseen languages. However, identifying whether a tweet expresses a pro- or anti-immigration stance benefits from multilingual fine-tuning. Pre-training bias favours dominant languages, but even minimal exposure to under-represented languages during fine-tuning (as little as $9.62\times10^{-11}$ of the original pre-training token volume) yields significant gains. These findings challenge the assumption that cross-lingual mastery requires extensive multilingual training: limited language coverage suffices for topic-level generalisation, and structural biases can be corrected with lightweight interventions. By releasing 4-bit-quantised, LoRA fine-tuned models, we provide an open-source, reproducible alternative to proprietary LLMs that delivers 35 times faster inference at just 0.00000989% of the dollar cost of the OpenAI GPT-4o model, enabling scalable, inclusive research.

[42] Echoes of Automation: The Increasing Use of LLMs in Newsmaking

Abolfazl Ansari, Delvin Ce Zhang, Nafis Irtiza Tripto, Dongwon Lee

Main category: cs.CL

TL;DR: The study analyzes AI-generated content in news articles, finding increased GenAI use, especially in local and college media, with AI often used in introductions but not conclusions. GenAI improves readability but reduces formality.

DetailsMotivation: To assess the impact of Generative AI (GenAI) on journalistic integrity and authorship by examining its prevalence and effects in news articles.

Method: Analyzed over 40,000 news articles using three AI-text detectors (Binoculars, Fast-Detect GPT, GPTZero) and conducted sentence-level and linguistic analysis.

Result: Substantial rise in GenAI use, especially in local and college news; AI often used in introductions, not conclusions; GenAI boosts readability but reduces formality.

Conclusion: GenAI is increasingly used in journalism, altering writing styles but raising concerns about uniformity and integrity.

Abstract: The rapid rise of Generative AI (GenAI), particularly LLMs, poses concerns for journalistic integrity and authorship. This study examines AI-generated content across over 40,000 news articles from major, local, and college news media, in various media formats. Using three advanced AI-text detectors (e.g., Binoculars, Fast-Detect GPT, and GPTZero), we find substantial increase of GenAI use in recent years, especially in local and college news. Sentence-level analysis reveals LLMs are often used in the introduction of news, while conclusions usually written manually. Linguistic analysis shows GenAI boosts word richness and readability but lowers formality, leading to more uniform writing styles, particularly in local media.

[43] SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang

Main category: cs.CL

TL;DR: SlimInfer accelerates LLM inference by pruning less critical prompt tokens dynamically, achieving significant speedup without performance loss.

DetailsMotivation: High computational demands limit long-context inference in LLMs, even with optimized attention methods. SlimInfer addresses this by pruning redundant tokens.

Method: Proposes a dynamic fine-grained pruning mechanism for hidden states, leveraging information diffusion to maintain semantic integrity while removing redundant tokens.

Result: Achieves up to 2.53× TTFT speedup and 1.88× latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, with no performance drop on LongBench.

Conclusion: SlimInfer effectively reduces computational overhead and memory usage, offering a practical solution for efficient long-context LLM inference.

Abstract: Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.

[44] GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4. 5 Team, :, Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge, Chenghua Huang, Chenhui Zhang, Chenxi Xu, Chenzheng Zhu, Chuang Li, Congfeng Yin, Daoyan Lin, Dayong Yang, Dazhi Jiang, Ding Ai, Erle Zhu, Fei Wang, Gengzheng Pan, Guo Wang, Hailong Sun, Haitao Li, Haiyang Li, Haiyi Hu, Hanyu Zhang, Hao Peng, Hao Tai, Haoke Zhang, Haoran Wang, Haoyu Yang, He Liu, He Zhao, Hongwei Liu, Hongxi Yan, Huan Liu, Huilong Chen, Ji Li, Jiajing Zhao, Jiamin Ren, Jian Jiao, Jiani Zhao, Jianyang Yan, Jiaqi Wang, Jiayi Gui, Jiayue Zhao, Jie Liu, Jijie Li, Jing Li, Jing Lu, Jingsen Wang, Jingwei Yuan, Jingxuan Li, Jingzhao Du, Jinhua Du, Jinxin Liu, Junkai Zhi, Junli Gao, Ke Wang, Lekang Yang, Liang Xu, Lin Fan, Lindong Wu, Lintao Ding, Lu Wang, Man Zhang, Minghao Li, Minghuan Xu, Mingming Zhao, Mingshu Zhai, Pengfan Du, Qian Dong, Shangde Lei, Shangqing Tu, Shangtong Yang, Shaoyou Lu, Shijie Li, Shuang Li, Shuang-Li, Shuxun Yang, Sibo Yi, Tianshu Yu, Wei Tian, Weihan Wang, Wenbo Yu, Weng Lam Tam, Wenjie Liang, Wentao Liu, Xiao Wang, Xiaohan Jia, Xiaotao Gu, Xiaoying Ling, Xin Wang, Xing Fan, Xingru Pan, Xinyuan Zhang, Xinze Zhang, Xiuqing Fu, Xunkai Zhang, Yabo Xu, Yandong Wu, Yida Lu, Yidong Wang, Yilin Zhou, Yiming Pan, Ying Zhang, Yingli Wang, Yingru Li, Yinpei Su, Yipeng Geng, Yitong Zhu, Yongkun Yang, Yuhang Li, Yuhao Wu, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yuxuan Zhang, Zezhen Liu, Zhen Yang, Zhengda Zhou, Zhongpei Qiao, Zhuoer Feng, Zhuorui Liu, Zichen Zhang, Zihan Wang, Zijun Yao, Zikang Wang, Ziqiang Liu, Ziwei Chai, Zixuan Li, Zuodong Zhao, Wenguang Chen, Jidong Zhai, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang

Main category: cs.CL

TL;DR: GLM-4.5 is an open-source MoE model with 355B parameters, excelling in reasoning and agentic tasks, ranking 3rd overall and 2nd in agentic benchmarks.

DetailsMotivation: To advance research in reasoning and agentic AI systems by developing a high-performing, efficient model.

Method: Multi-stage training on 23T tokens, hybrid reasoning, expert iteration, and reinforcement learning.

Result: Scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified.

Conclusion: GLM-4.5 and its compact version, GLM-4.5-Air, are released to foster further research.

Abstract: We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.

[45] HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning

Guimin Hu, Daniel Hershcovich, Hasti Seifi

Main category: cs.CL

TL;DR: The paper introduces HapticLLaMA, a multimodal model for generating natural language descriptions from haptic signals, achieving strong performance in haptic captioning tasks.

DetailsMotivation: To address the underexplored area of haptic signals in multimodal research, particularly for applications like virtual reality and accessibility.

Method: Proposes HapticLLaMA, using frequency-based and EnCodec-based tokenizers, trained via supervised fine-tuning and RLHF.

Result: Achieves METEOR score of 59.98 and BLEU-4 score of 32.06, with human ratings showing 61% above 3.5 (7-point scale).

Conclusion: Demonstrates the potential of large language models to adapt to sensory data, improving alignment with human perception.

Abstract: Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA’s captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.

[46] Post-training for Efficient Communication via Convention Formation

Yilun Hua, Evan Wang, Yoav Artzi

Main category: cs.CL

TL;DR: Post-training process improves LLMs’ ability to form conventions in multi-turn interactions, evaluated via new benchmarks.

DetailsMotivation: Humans adapt language and form conventions efficiently in interactions, but LLMs lack this ability naturally.

Method: Targeted fine-tuning on heuristically identified demonstrations of convention formation.

Result: Post-trained LLMs show significantly improved convention formation abilities in two new benchmarks.

Conclusion: The post-training process effectively enhances LLMs’ convention formation capabilities.

Abstract: Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this behavior. We develop a post-training process to develop this ability through targeted fine-tuning on heuristically identified demonstrations of convention formation. We evaluate with two new benchmarks focused on this capability. First, we design a focused, cognitively-motivated interaction benchmark that consistently elicits strong convention formation trends in humans. Second, we create a new document-grounded reference completion task that reflects in-the-wild convention formation behavior. Our studies show significantly improved convention formation abilities in post-trained LLMs across the two evaluation methods.

Prathamesh Kalamkar, Janani Venugopalan Ph. D., Vivek Raghavan Ph. D

Main category: cs.CL

TL;DR: The paper emphasizes the need for specialized NLP benchmarks for Indian Legal Text to advance AI in this field.

DetailsMotivation: Legal Text differs significantly from normal English, requiring tailored NLP benchmarks to address specific legal tasks and spur innovation.

Method: The authors review existing work and propose ideas for creating new benchmarks for Indian Legal NLP.

Result: The paper highlights the gap in benchmarks for Indian Legal Text and suggests solutions.

Conclusion: Creating specialized benchmarks will benefit both the AI community and the legal fraternity by fostering innovation in NLP for Indian Legal Text.

Abstract: Availability of challenging benchmarks is the key to advancement of AI in a specific field.Since Legal Text is significantly different than normal English text, there is a need to create separate Natural Language Processing benchmarks for Indian Legal Text which are challenging and focus on tasks specific to Legal Systems. This will spur innovation in applications of Natural language Processing for Indian Legal Text and will benefit AI community and Legal fraternity. We review the existing work in this area and propose ideas to create new benchmarks for Indian Legal Natural Language Processing.

[48] Benchmarking LLMs on the Semantic Overlap Summarization Task

John Salvador, Naman Bansal, Mousumi Akter, Souvika Sarkar, Anupam Das, Shubhra Kanti Karmaker

Main category: cs.CL

TL;DR: The paper benchmarks LLMs on Semantic Overlap Summarization (SOS), introduces the PrivacyPolicyPairs dataset, and evaluates 905,216 summaries using TELeR prompting, with human validation on 540 samples.

DetailsMotivation: To assess LLMs' performance on SOS tasks and expand benchmarks with diverse datasets.

Method: Used TELeR prompting to generate and evaluate summaries across two SOS datasets, including human evaluation.

Result: Analyzed LLM performances and reliability of automatic evaluation.

Conclusion: The study provides insights into LLM capabilities for SOS tasks and evaluation reliability, with open-sourced code and datasets.

Abstract: Semantic Overlap Summarization (SOS) is a constrained multi-document summarization task, where the constraint is to capture the common/overlapping information between two alternative narratives. In this work, we perform a benchmarking study of popular Large Language Models (LLMs) exclusively on the SOS task. Additionally, we introduce the PrivacyPolicyPairs (3P) dataset to expand the space of SOS benchmarks in terms of quantity and variety. This dataset provides 135 high-quality SOS data samples sourced from privacy policy documents. We then use a standard prompting taxonomy called TELeR to create and evaluate 905,216 distinct LLM-generated summaries over two SOS datasets from different domains, and we further conduct human evaluation on a subset of 540 samples. We conclude the paper by analyzing models’ performances and the reliability of automatic evaluation. The code and datasets used to conduct this study are available at https://anonymous.4open.science/r/llm_eval-E16D.

[49] Towards Pareto Optimal Throughput in Small Language Model Serving

Pol G. Recasens, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral

Main category: cs.CL

TL;DR: Benchmarking Small Language Models (SLMs) for efficient inference, showing their Pareto-optimal throughput and benefits of model replication.

DetailsMotivation: To explore the potential of SLMs for resource-constrained users, offering high performance with lower computational and memory demands compared to LLMs.

Method: Conducted experiments benchmarking SLM inference, focusing on performance and energy levels, and analyzed model replication for resource utilization.

Result: SLMs achieve Pareto-optimal throughput within a single accelerator’s capacity, and model replication enhances resource utilization.

Conclusion: SLMs are a viable alternative to LLMs for efficient serving, with model replication further optimizing resource use.

Abstract: Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

[50] Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework

Yuping Wu, Hao Li, Goran Nenadic, Xiao-Jun Zeng

Main category: cs.CL

TL;DR: The paper introduces a parameter-free highlight method and a novel extract-and-abstract paradigm (ExtAbs) to improve abstractive summarization by reducing error accumulation and training costs.

DetailsMotivation: To address the issues of error accumulation and additional training costs in the Extract-then-Abstract paradigm by integrating extractive and abstractive tasks seamlessly.

Method: Proposes a saliency mask for the encoder-decoder framework and the ExtAbs paradigm, which jointly performs extractive and abstractive tasks within a single model.

Result: ExtAbs outperforms baselines on extractive tasks and matches or surpasses vanilla models on abstractive tasks across three datasets.

Conclusion: The ExtAbs paradigm and saliency mask effectively improve summarization performance while reducing training complexity.

Abstract: Extract-then-Abstract is a naturally coherent paradigm to conduct abstractive summarization with the help of salient information identified by the extractive model. Previous works that adopt this paradigm train the extractor and abstractor separately and introduce extra parameters to highlight the extracted salients to the abstractor, which results in error accumulation and additional training costs. In this paper, we first introduce a parameter-free highlight method into the encoder-decoder framework: replacing the encoder attention mask with a saliency mask in the cross-attention module to force the decoder to focus only on salient parts of the input. A preliminary analysis compares different highlight methods, demonstrating the effectiveness of our saliency mask. We further propose the novel extract-and-abstract paradigm, ExtAbs., which jointly and seamlessly performs Extractive and Abstractive summarization tasks within single encoder-decoder model to reduce error accumulation. In ExtAbs, the vanilla encoder is augmented to extract salients, and the vanilla decoder is modified with the proposed saliency mask to generate summaries. Built upon BART and PEGASUS, experiments on three datasets show that ExtAbs can achieve superior performance than baselines on the extractive task and performs comparable, or even better than the vanilla models on the abstractive task.

[51] Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Rachneet Sachdeva, Rima Hazra, Iryna Gurevych

Main category: cs.CL

TL;DR: POATE is a novel jailbreak technique exploiting contrastive reasoning to provoke unethical responses in large language models, achieving high success rates. Countermeasures like Intent-Aware CoT and Reverse Thinking CoT are proposed to detect and reject harmful outputs.

DetailsMotivation: Large language models, despite alignment efforts, remain vulnerable to subtle reasoning-driven jailbreak attacks, which existing safety measures fail to address.

Method: POATE uses polar opposite query generation, adversarial template construction, and elaboration to exploit reasoning vulnerabilities. Countermeasures involve decomposing queries (Intent-Aware CoT) and reverse reasoning (Reverse Thinking CoT).

Result: POATE achieves ~44% attack success rate across six model families, outperforming existing methods. Countermeasures improve reasoning robustness and defense.

Conclusion: POATE highlights the need for advanced defenses against reasoning-driven attacks, with proposed methods offering effective mitigation strategies.

Abstract: Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model’s defense against adversarial exploits.

[52] The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs

Nitay Calderon, Roi Reichart, Rotem Dror

Main category: cs.CL

TL;DR: The paper introduces the Alternative Annotator Test (alt-test) to rigorously evaluate if LLMs can replace human annotators, using a subset of annotated examples. It also proposes a measure for comparing LLM annotators and judges, demonstrating effectiveness with diverse datasets and LLMs.

DetailsMotivation: LLMs are increasingly used as annotators and judges, but there's no standard to assess their suitability for replacing humans. This gap motivates the need for a rigorous evaluation method.

Method: The authors propose the Alternative Annotator Test (alt-test) and a versatile measure for comparing LLM annotators. They test these on ten diverse datasets using six LLMs and four prompting techniques.

Result: Results show closed-source LLMs (e.g., GPT-4o) can sometimes replace humans, outperforming open-source LLMs, and prompting techniques vary in judge quality.

Conclusion: The study advocates for more rigorous practices in using LLMs as annotators and judges, demonstrating the alt-test’s effectiveness in justifying their use.

Abstract: The “LLM-as-an-annotator” and “LLM-as-a-judge” paradigms employ Large Language Models (LLMs) as annotators, judges, and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM annotators and judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming the open-source LLMs we examine, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.

[53] Neural Contextual Reinforcement Framework for Logical Structure Language Generation

Marcus Irvin, William Cooper, Edward Hughes, Jessica Morgan, Christopher Hamilton

Main category: cs.CL

TL;DR: The Neural Contextual Reinforcement Framework improves text coherence in large language models using reinforcement learning, custom rewards, and dynamic context alignment, outperforming baselines in coherence, perplexity, and semantic alignment.

DetailsMotivation: To enhance logical coherence and structural consistency in text generated by large language models, addressing challenges like long-range dependencies.

Method: Integrates reinforcement learning with custom reward functions, dynamic context alignment, multi-head attention, and hierarchical encoding.

Result: Shows improvements in coherence metrics, perplexity reduction, semantic alignment, narrative clarity, and robustness to noisy inputs.

Conclusion: The framework is versatile, scalable, and efficient, with strong cross-lingual performance and practical deployment potential.

Abstract: The Neural Contextual Reinforcement Framework introduces an innovative approach to enhancing the logical coherence and structural consistency of text generated by large language models. Leveraging reinforcement learning principles, the framework integrates custom reward functions and dynamic context alignment mechanisms to address challenges inherent in maintaining long-range dependencies across extended sequences. The architecture incorporates multi-head attention layers and hierarchical encoding modules, enabling the model to produce outputs that align closely with human expectations of logical structure and semantic flow. Quantitative evaluations across diverse datasets demonstrate substantial improvements in coherence metrics, perplexity reduction, and semantic alignment, showcasing the framework’s ability to outperform baseline models in both general and domain-specific tasks. Qualitative analyses further highlight the framework’s capacity to generate text with improved narrative clarity and reduced redundancy, reflecting its effectiveness in balancing fluency with structural precision. In addition to its performance gains, the framework exhibits robustness in handling noisy input data and scalability across varying model sizes, reinforcing its versatility in practical applications. Experimental results reveal that optimal context window sizes significantly influence coherence outcomes, showing the importance of architectural flexibility in adapting to diverse linguistic structures. Cross-lingual performance evaluations affirm the framework’s adaptability to multiple languages, extending its utility beyond monolingual contexts. Resource efficiency analyses indicate a reduction in computational overhead compared to traditional approaches, emphasizing the practicality of the framework for large-scale deployment.

[54] Architectural Fusion Through Contextual Partitioning in Large Language Models: A Novel Approach to Parameterized Knowledge Integration

Offa Kingsleigh, Alfred Abercrombie, David Woolstencroft, Beorhtric Meadowcroft, Marcus Irvin

Main category: cs.CL

TL;DR: Contextual Partitioning dynamically segments parameters in computational models for task-specific specialization, improving accuracy, efficiency, and scalability without external fine-tuning.

DetailsMotivation: Enhance architectural design of large-scale models by addressing redundancy and inefficiency in conventional parameter optimization techniques.

Method: Uses adaptive parameter allocation aligned with linguistic features and gradient-driven segmentation for dynamic recalibration.

Result: Substantial improvements in accuracy, perplexity, and contextual coherence; reduced memory usage and training times.

Conclusion: Contextual Partitioning redefines scalability and adaptability of language architectures, expanding applications in complex domains.

Abstract: Contextual Partitioning introduces an innovative approach to enhancing the architectural design of large-scale computational models through the dynamic segmentation of parameters into context-aware regions. This methodology emphasizes the importance of task-specific specialization, achieved through adaptive parameter allocation mechanisms that align with the linguistic features of input data. Experimental evaluations demonstrated substantial improvements in accuracy, perplexity, and contextual coherence across a variety of linguistic tasks, highlighting the adaptability and scalability of the proposed framework. By reducing redundancy and enhancing computational efficiency, Contextual Partitioning not only streamlines model operations but also expands the scope of applications for advanced language processing systems. The approach operates autonomously, requiring no external fine-tuning, thereby addressing a significant limitation in conventional parameter optimization techniques. Empirical results demonstrate the effectiveness of gradient-driven segmentation, enabling models to dynamically recalibrate and specialize in response to task-specific demands. Furthermore, resource utilization metrics reveal notable reductions in memory usage and training times, confirming the efficiency of the approach. Observations from qualitative analyses illustrate improved contextual coherence and logical flow in generated outputs, reinforcing the practical value of this technique. The findings collectively demonstrate the potential for Contextual Partitioning to redefine the scalability and adaptability of computational language architectures in diverse and complex domains.

[55] Autonomous Structural Memory Manipulation for Large Language Models Using Hierarchical Embedding Augmentation

Derek Yotheringhay, Alistair Kirkland, Humphrey Kirkbride, Josiah Whitesteeple

Main category: cs.CL

TL;DR: The paper introduces hierarchical embedding augmentation and dynamic memory manipulation to improve token representation and computational efficiency, showing significant gains in accuracy and adaptability.

DetailsMotivation: To address limitations of static memory architectures and enhance adaptability to complex linguistic inputs and diverse tasks.

Method: Uses hierarchical embedding augmentation and autonomous structural memory manipulation for dynamic token representation and memory reallocation.

Result: Substantial improvements in computational efficiency, accuracy, and interpretability, especially for complex contextual tasks.

Conclusion: The proposed framework effectively combines embedding and memory strategies, offering scalability and robustness for diverse applications.

Abstract: Transformative innovations in model architectures have introduced hierarchical embedding augmentation as a means to redefine the representation of tokens through multi-level semantic structures, offering enhanced adaptability to complex linguistic inputs. Autonomous structural memory manipulation further advances this paradigm through dynamic memory reallocation mechanisms that prioritize critical contextual features while suppressing less relevant information, enabling scalable and efficient performance across diverse tasks. Experimental results reveal substantial improvements in computational efficiency, with marked reductions in processing overhead for longer input sequences, achieved through memory reorganization strategies that adapt to evolving contextual requirements. Hierarchical embeddings not only improved contextual alignment but also facilitated task generalization by capturing relationships at varying semantic granularities, ensuring coherence across layers without introducing significant computational redundancies. Comparative analysis against baseline models demonstrated unique advantages in accuracy, efficiency, and interpretability, particularly in tasks requiring complex contextual understanding or domain-specific adaptability. The ability to dynamically adjust token representations and memory configurations contributed to the model’s robustness under varied and unpredictable input conditions. Applications benefiting from these advancements include multi-domain generalization, interactive systems, and scenarios involving real-time decision-making, where traditional static memory architectures often face limitations. The proposed methodology combines advanced embedding and memory management strategies into a cohesive framework that addresses scalability challenges while preserving task-specific relevance.

[56] Contextual Reinforcement in Multimodal Token Compression for Large Language Models

Naderdel Piero, Zacharias Cromwell, Nathaniel Wainwright, Matthias Nethercott

Main category: cs.CL

TL;DR: A novel contextual reinforcement mechanism dynamically adjusts token importance, reducing token usage while preserving quality and coherence, with improved accuracy and efficiency across domains.

DetailsMotivation: Addressing the challenge of effective token compression for scaling models to handle complex and diverse datasets.

Method: Uses contextual reinforcement with graph-based algorithms and adaptive weighting to capture contextual relationships in textual and multimodal data.

Result: Significant improvements in accuracy, semantic retention, and computational efficiency, with reduced semantic loss and syntactic inconsistencies.

Conclusion: Contextual reinforcement redefines token management and advances large-scale model design, with scalable implementation potential.

Abstract: Effective token compression remains a critical challenge for scaling models to handle increasingly complex and diverse datasets. A novel mechanism based on contextual reinforcement is introduced, dynamically adjusting token importance through interdependencies and semantic relevance. This approach enables substantial reductions in token usage while preserving the quality and coherence of information representation. Incorporating graph-based algorithms and adaptive weighting, the method captures subtle contextual relationships across textual and multimodal data, ensuring robust alignment and performance in downstream tasks. Evaluations across varied domains reveal significant improvements in accuracy and semantic retention, particularly for tasks requiring detailed cross-modal interactions. Memory usage analyses demonstrate improved computational efficiency, with minimal overhead despite the additional reinforcement processes. Performance gains are further validated through error distribution analyses, showing reduced semantic loss and syntactic inconsistencies compared to baseline models. The modular architecture ensures compatibility with a wide range of open-source frameworks, facilitating scalable implementation for real-world applications. These findings highlight the potential of contextual reinforcement in redefining token management strategies and advancing large-scale model design.

[57] Structural Embedding Projection for Contextual Large Language Model Inference

Vincent Enoasmo, Cedric Featherstonehaugh, Xavier Konstantinopoulos, Zacharias Huntington

Main category: cs.CL

TL;DR: Structured Embedding Projection (SEP) refines token representations using projection matrices, improving semantic fidelity and coherence in language models without major computational overhead.

DetailsMotivation: To enhance the efficiency and coherence of language model inference by integrating hierarchical and relational dependencies into token representations.

Method: Introduces Structural Embedding Projection (SEP), a mechanism using projection matrices to refine embeddings, capturing structured contextual relationships.

Result: SEP reduced perplexity, improved contextual coherence, and enhanced narrative consistency in generated text, with dataset-dependent trade-offs in efficiency.

Conclusion: SEP effectively refines language model outputs by balancing representational richness and computational efficiency, though requiring precise optimization.

Abstract: Structured embedding transformations offer a promising approach for enhancing the efficiency and coherence of language model inference. The introduction of Structural Embedding Projection (SEP) provides a mechanism for refining token representations through projection matrices that integrate hierarchical and relational dependencies. The mathematical formulation of SEP enables embedding spaces to capture structured contextual relationships, thereby improving semantic fidelity without significantly increasing computational overhead. Experimental evaluations conducted on a range of linguistic datasets revealed that SEP contributed to reductions in perplexity and enhanced contextual coherence, demonstrating its potential to refine language model outputs. Computational efficiency assessments highlighted variations across different datasets, suggesting that the integration of structured embeddings introduced dataset-dependent trade-offs between inference speed and representational richness. The qualitative analysis of generated responses indicated that SEP enhanced narrative consistency and topic alignment, leading to improved fluency in multi-sentence text generation. The modifications to embedding layers required precise optimization to ensure stable training dynamics, as the introduction of structured transformations altered the traditional representation-learning process. The architectural adjustments necessary for SEP implementation influenced inference latency and memory consumption, requiring a balance between efficiency gains and additional processing demands. The impact of SEP on lexical diversity suggested that embedding modifications influenced the model’s vocabulary usage, reflecting a more context-aware selection of generated tokens.

[58] Context-Preserving Tensorial Reconfiguration in Large Language Model Training

Larin Tonix, Morgana Baskerville, Nathaniel Stourton, Ophelia Tattershall

Main category: cs.CL

TL;DR: CPTR introduces dynamic tensorial reconfiguration to improve long-range dependency handling in neural models, enhancing efficiency and performance.

DetailsMotivation: Addressing challenges in long-range dependency handling and inefficient contextual retention in neural architectures.

Method: Context-Preserving Tensorial Reconfiguration (CPTR) via structured factorization and adaptive contraction.

Result: Improved coherence retention, reduced perplexity, better recall accuracy, and enhanced computational efficiency.

Conclusion: CPTR refines neural architectures for long-range tasks, offering stability and efficiency.

Abstract: Handling long-range dependencies in neural architectures has remained a persistent challenge due to computational limitations and inefficient contextual retention mechanisms. Tensorial operations have provided a foundation for restructuring model representations, yet conventional architectures have struggled to incorporate such techniques without introducing excessive complexity. A novel approach, Context-Preserving Tensorial Reconfiguration (CPTR), enables dynamic reorganization of weight tensors through structured factorization and adaptive contraction, allowing for enhanced contextual integration without substantial computational overhead. Empirical evaluations demonstrate that CPTR improves coherence retention across extended sequences, leading to measurable reductions in perplexity and improved recall accuracy for long-context tasks. Performance comparisons reveal that CPTR-enhanced models exhibit greater computational efficiency and reduced memory consumption while maintaining competitive language generation fluency and accuracy. Gradient stability metrics further validate the improved training efficiency, revealing more controlled variance in weight updates. Comparative studies across baseline and CPTR-enhanced models confirm that tensorial reconfiguration contributes to more stable and computationally efficient language modeling. The findings support the potential of CPTR in refining contemporary neural architectures for tasks requiring long-range contextual understanding and efficient memory utilization.

[59] Contextual Morphogenesis in Large Language Models: A Novel Approach to Self-Organizing Token Representations

Alistair Dombrowski, Beatrix Engelhardt, Dimitri Fairbrother, Henry Evidail

Main category: cs.CL

TL;DR: Contextual morphogenesis introduces dynamic tokenization, improving perplexity and stability in language models while balancing computational overhead.

DetailsMotivation: Conventional tokenization lacks adaptability to evolving contextual relationships, limiting model efficiency and accuracy.

Method: A self-organizing mechanism dynamically adjusts token boundaries based on contextual dependencies, with iterative embedding updates.

Result: Reduced perplexity, improved representational stability, and better alignment with contextual cues, especially in complex linguistic domains.

Conclusion: Contextual morphogenesis is a viable alternative to static tokenization, with hybrid static-dynamic strategies suggested for optimal efficiency.

Abstract: Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The introduction of contextual morphogenesis establishes a self-organizing mechanism that restructures token boundaries based on learned contextual dependencies, allowing embeddings to evolve progressively across iterative processing steps. Empirical evaluations demonstrate that dynamically adjusted tokenization contributes to reductions in perplexity while maintaining representational stability, particularly in linguistically complex domains where static segmentation fails to capture nuanced dependencies. Computational trade-offs associated with self-organizing token structures indicate that additional processing overhead remains within feasible limits, provided that optimization strategies account for segmentation update efficiency. Comparative assessments across different linguistic corpora suggest that adaptive tokenization preserves interpretability while improving alignment with contextual cues, reinforcing the potential of morphogenetic segmentation mechanisms to refine predictive accuracy. Stability analyses confirm that evolving token structures maintain consistent segmentation behaviors across varied text distributions, ensuring that representational adaptations remain linguistically coherent. The effectiveness of contextual morphogenesis in refining structural stability and predictive performance highlights its viability as an alternative to traditional tokenization methods. Further analysis of computational efficiency considerations suggests that hybrid strategies integrating both static and dynamic segmentation techniques may offer a balanced approach to optimizing representational flexibility while maintaining inference efficiency.

[60] Context-Aware Hierarchical Merging for Long Document Summarization

Litu Ou, Mirella Lapata

Main category: cs.CL

TL;DR: The paper proposes methods to reduce hallucinations in hierarchical merging for long-text summarization by enriching intermediate summaries with source context, showing improved performance over baselines.

DetailsMotivation: To address the issue of amplified hallucinations in hierarchical merging due to recursive summarization, the paper aims to mitigate inaccuracies by incorporating source document context.

Method: The authors propose contextual augmentation techniques: replacing, refining, or aligning intermediate summaries with source context. Experiments use legal and narrative datasets with the Llama 3.1 model.

Result: Contextual augmentation outperforms zero-shot and hierarchical merging baselines, with refinement methods performing best when paired with extractive summarization.

Conclusion: Enriching hierarchical merging with source context effectively reduces hallucinations, with refinement methods showing the most promise.

Abstract: Hierarchical Merging is a technique commonly used to summarize very long texts ($>$100K tokens) by breaking down the input into smaller sections, summarizing those sections individually, and then merging or combining those summaries into a final coherent summary. Although it helps address the limitations of large language models (LLMs) with fixed input length constraints, the recursive merging process can amplify LLM hallucinations, increasing the risk of factual inaccuracies. In this paper, we seek to mitigate hallucinations by enriching hierarchical merging with context from the source document. Specifically, we propose different approaches to contextual augmentation ranging from \emph{replacing} intermediate summaries with relevant input context, to \emph{refining} them while using the context as supporting evidence, and \emph{aligning} them implicitly (via citations) to the input. Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines for the Llama 3.1 model family. Our analysis further reveals that refinement methods tend to perform best when paired with extractive summarization for identifying relevant input.

[61] Gradient-Regularized Latent Space Modulation in Large Language Models for Structured Contextual Synthesis

Derek Yotheringhay, Beatrix Nightingale, Maximilian Featherstone, Edmund Worthington, Hugo Ashdown

Main category: cs.CL

TL;DR: The paper introduces Gradient-Regularized Latent Space Modulation (GRLSM) to improve structured text generation by enforcing constraints in latent space, enhancing coherence and stability.

DetailsMotivation: Conventional methods for structured text generation lack flexibility and generalizability, prompting the need for a novel approach like GRLSM.

Method: GRLSM applies gradient-based regularization in latent space to ensure smoother encoding, structural consistency, and logical progression in generated text.

Result: GRLSM reduces perplexity, increases coherence scores, improves structural alignment, and enhances semantic consistency under perturbations.

Conclusion: GRLSM refines text organization, boosts interpretability, and maintains generative flexibility while reducing structural inconsistencies.

Abstract: Generating structured textual content requires mechanisms that enforce coherence, stability, and adherence to predefined constraints while maintaining semantic fidelity. Conventional approaches often rely on rule-based heuristics or fine-tuning strategies that lack flexibility and generalizability across diverse tasks. The incorporation of Gradient-Regularized Latent Space Modulation (GRLSM) introduces a novel paradigm for guiding text generation through the application of structured constraints within the latent space. The integration of gradient-based regularization mitigates abrupt variations in latent representations, ensuring a smoother encoding process that enhances structural consistency and logical progression within generated sequences. Comparative evaluations demonstrate that latent space modulation leads to a reduction in perplexity, increased coherence scores, and improved structural alignment across multiple domains. Stability assessments further indicate that the imposition of spectral norm constraints facilitates more controlled variations in generated text, preserving semantic consistency under input perturbations. Empirical results confirm that structured latent space constraints not only refine the organization of generated outputs but also enhance interpretability through more predictable and reliable synthesis patterns. Performance metrics illustrate that the GRLSM framework substantially reduces structural inconsistencies while preserving the generative flexibility inherent in neural models.

[62] Latent Structure Modulation in Large Language Models Through Stochastic Concept Embedding Transitions

Stefan Whitaker, Colin Sisate, Marcel Windsor, Nikolai Fairweather, Tarquin Goldborough, Oskar Lindenfeld

Main category: cs.CL

TL;DR: Stochastic embedding transitions dynamically adjust token representations probabilistically during inference, improving lexical diversity, coherence, and low-frequency vocabulary retention while maintaining semantic integrity.

DetailsMotivation: To address the limitations of static or deterministic embeddings by introducing adaptability without losing semantic coherence.

Method: A probabilistic transition framework for token embeddings, evaluated empirically for lexical diversity, coherence, and embedding drift.

Result: Improved generative coherence, lexical diversity, and text completion accuracy with minor computational overhead.

Conclusion: Stochastic transitions enhance representation expressiveness and adaptability while preserving linguistic coherence, making them feasible for large-scale applications.

Abstract: Stochastic embedding transitions introduce a probabilistic mechanism for adjusting token representations dynamically during inference, mitigating the constraints imposed through static or deterministic embeddings. A transition framework was proposed in which each token embedding evolved through probabilistic updates, ensuring adaptability while preserving semantic integrity across linguistic contexts. Empirical evaluations demonstrated that models incorporating stochastic transitions exhibited greater lexical diversity, improved generative coherence, and enhanced retention of low-frequency vocabulary, contributing to more varied sentence structures and reduced reliance on high-probability token selections. Statistical analyses of embedding drift across transformer layers indicated that representations evolved more flexibly without losing coherence, supporting the hypothesis that controlled stochasticity facilitated context-sensitive representation learning. Experimental results revealed that probabilistic embeddings introduced minor computational overhead while maintaining generative efficiency, reinforcing their feasibility in large-scale applications. A comparative study with traditional embedding approaches highlighted measurable gains in text completion accuracy, dialogue coherence, and structural complexity, confirming the effectiveness of stochastic transitions in enhancing representation expressiveness. Clustering patterns in the embedding space suggested that probabilistic updates preserved meaningful semantic groupings while enabling context-driven shifts, further validating the stability of the transition mechanism. Performance metrics indicated that stochastic transitions balanced adaptability and control, ensuring that generative outputs remained linguistically coherent without excessive randomness.

[63] Structural Perturbation in Large Language Model Representations through Recursive Symbolic Regeneration

Kathlyn Eaglewood, Tobias Featherington, Dorian Mayfair, Sylvester Grimshaw, James Pettigrew

Main category: cs.CL

TL;DR: Symbolic perturbations influence neural representations by altering latent embeddings, improving attention dynamics and lexical diversity without retraining.

DetailsMotivation: To explore how symbolic-level modifications can adjust model behavior and enhance adaptability in domain-specific applications without direct parameter changes.

Method: Recursive regeneration of symbolic structures introduces structured variations in latent embeddings, affecting attention dynamics and lexical diversity.

Result: Symbolic perturbations induce distinct variations in contextual sensitivity, maintain fluency, and refine long-form text generation.

Conclusion: Symbolic-level modifications offer interpretable and controlled adjustments in automated text generation, enhancing adaptability without retraining.

Abstract: Symbolic perturbations offer a novel approach for influencing neural representations without requiring direct modification of model parameters. The recursive regeneration of symbolic structures introduces structured variations in latent embeddings, leading to controlled shifts in attention dynamics and lexical diversity across sequential generations. A comparative analysis with conventional fine-tuning techniques reveals that structural modifications at the symbolic level induce distinct variations in contextual sensitivity while maintaining overall model fluency and coherence. Shifts in attention weight distributions highlight the role of symbolic modifications in adjusting token dependencies, influencing response variability, and refining long-form text generation. Experimental findings suggest that symbolic perturbations can enhance adaptability in domain-specific applications, allowing modifications in model behavior without retraining. Evaluations of semantic drift indicate that recursive regeneration alters long-range token dependencies, affecting topic coherence across extended text sequences. Results from lexical variability assessments further support the conclusion that symbolic-level modifications introduce interpretable variations in generated responses, potentially enabling more controlled stylistic adjustments in automated text generation.

[64] Structural Reformation of Large Language Model Neuron Encapsulation for Divergent Information Aggregation

Denis Bakushev, Gideon Boultinghouse, Harriet Oppenheimer, Sebastian Gillingwater, Valentina Ashington, Wilfred Stanborough

Main category: cs.CL

TL;DR: Structured neuron encapsulation improves deep learning models by enhancing information aggregation and specialization, leading to better language representation and generation.

DetailsMotivation: To address inefficiencies in deep learning architectures by introducing a modular framework for structured parameter distribution.

Method: Implemented structured neuron encapsulation to modify models, analyzed perplexity scores, lexical variability, logical reasoning, attention weights, and computational trade-offs.

Result: Improved perplexity, lexical variability, logical consistency, and reduced redundancy in text generation. Attention weights showed specialized neuron roles.

Conclusion: Structured encapsulation enhances language models by promoting specialization and efficiency, despite minor computational overhead.

Abstract: Structured neuron encapsulation introduces a modular framework that enables more effective aggregation and specialization of information within deep learning architectures. A model modified through this framework demonstrated improved perplexity scores, greater lexical variability, and enhanced consistency in logical reasoning, suggesting that structured parameter distribution contributes to more efficient language representation. Statistical analyses of generated text highlighted a wider range of sentence structures and reduced redundancy in token selection, indicating that encapsulation fosters more adaptable language generation. A detailed evaluation of attention weight distributions revealed that the experimental model exhibited greater divergence in cross-layer activations, supporting the hypothesis that encapsulated neurons assume specialized processing roles. Logical consistency assessments further demonstrated that modular architectures mitigate contradictory outputs, reducing internal conflicts in inferred relationships between linguistic constructs. Computational trade-offs were analyzed, with results showing a minor increase in processing overhead, though improvements in parameter efficiency and structured decision-making compensated for the additional complexity. The mathematical formulation of the encapsulation mechanism confirmed that modular aggregation maintains stable convergence properties while promoting distinct functional roles for different neuron clusters.

[65] Structured Convergence in Large Language Model Representations via Hierarchical Latent Space Folding

Fenella Harcourt, Naderdel Piero, Gilbert Sutherland, Daphne Holloway, Harriet Bracknell, Julian Ormsby

Main category: cs.CL

TL;DR: Hierarchical latent space folding improves token representation efficiency and coherence in high-dimensional spaces, enhancing computational efficiency and contextual alignment.

DetailsMotivation: Address redundancy in token representations and improve structural coherence across model layers for better computational efficiency and contextual distinctions.

Method: Introduces dynamic folding operations for multi-scale organization of embeddings, refining compactness and preserving context.

Result: Reduces representational variance, stabilizes perplexity, improves predictive confidence, and optimizes computational resource allocation.

Conclusion: Hierarchical latent space folding enhances model performance by structuring representations and boosting computational efficiency.

Abstract: Token representations in high-dimensional latent spaces often exhibit redundancy, limiting computational efficiency and reducing structural coherence across model layers. Hierarchical latent space folding introduces a structured transformation mechanism that enforces a multi-scale organization within learned embeddings, refining representational compactness while preserving essential contextual distinctions. The proposed approach incorporates dynamic folding operations that iteratively adjust token embeddings through structured transformations, influencing both short-range and long-range dependencies in sequential processing tasks. Empirical evaluation demonstrates a reduction in representational variance across layers, contributing to more stable perplexity distributions and enhancing predictive confidence in text generation. The structured redistribution of attention head utilization leads to more efficient allocation of computational resources, particularly in deeper layers, where hierarchical refinements improve contextual abstraction. Comparative analysis of activation sparsity patterns suggests that hierarchical adjustments selectively reinforce critical pathways while reducing computational overhead in non-essential regions of the model. Statistical assessments of token reordering frequencies reveal that hierarchical modifications introduce subtle shifts in sequential dependencies, improving contextual alignment while maintaining syntactic correctness. Computational trade-offs associated with hierarchical folding introduce marginal increases in training time per epoch, yet empirical findings indicate that inference efficiency benefits from the structured representation adjustments. The results highlight the impact of hierarchical latent space folding on optimizing model performance through improved representation structuring and computational efficiency.

[66] Statistical Coherence Alignment for Large Language Model Representation Learning Through Tensor Field Convergence

Jonathan Gale, Godfrey Aldington, Harriet Thistlewood, Thomas Tattershall, Basil Wentworth, Vincent Enoasmo

Main category: cs.CL

TL;DR: The paper introduces Statistical Coherence Alignment to improve token representations in language models by enforcing structured embeddings through tensor field convergence, enhancing coherence and contextual consistency.

DetailsMotivation: The motivation is to improve the coherence and contextual consistency of generated text by structuring internal embeddings to better capture linguistic statistical properties.

Method: The method involves a mathematical framework for coherence alignment, integrating a loss function to optimize representational consistency during training.

Result: Empirical results show improved perplexity, classification accuracy, and rare word embeddings, along with a more interpretable internal structure and balanced organization of embeddings.

Conclusion: The coherence alignment method effectively optimizes token representations, leveraging statistical dependencies to enhance language model training, despite additional computational costs.

Abstract: Representation learning plays a central role in structuring internal embeddings to capture the statistical properties of language, influencing the coherence and contextual consistency of generated text. Statistical Coherence Alignment is introduced as a method to enforce structured token representations through tensor field convergence, guiding embeddings to reflect statistical dependencies inherent in linguistic data. A mathematical framework is established to quantify coherence alignment, integrating a loss function that optimizes representational consistency across training iterations. Empirical evaluations demonstrate that applying coherence constraints improves perplexity, enhances classification accuracy, and refines rare word embeddings, contributing to a more stable representation space. Comparative analyses with baseline models reveal that the proposed method fosters a more interpretable internal structure, ensuring that embeddings retain contextual dependencies while mitigating representation collapse. The impact on coherence score distributions suggests that the alignment mechanism strengthens semantic integrity across diverse linguistic constructs, leading to a more balanced organization of learned embeddings. Computational assessments indicate that while the method introduces additional memory and training costs, the structured optimization process justifies the trade-offs in applications requiring heightened contextual fidelity. Experimental results validate the effectiveness of coherence alignment in optimizing token representations, providing insights into how statistical dependencies can be leveraged to improve language model training.

[67] Exploring Synaptic Resonance in Large Language Models: A Novel Approach to Contextual Memory Integration

George Applegarth, Christian Weatherstone, Maximilian Hollingsworth, Henry Middlebrook, Marcus Irvin

Main category: cs.CL

TL;DR: The paper introduces Synaptic Resonance, a novel mechanism inspired by synaptic plasticity, to enhance long-range contextual memory in language models, improving coherence and reducing perplexity.

DetailsMotivation: Addressing the challenge of maintaining coherence in long sequences, traditional methods like self-attention often fail in long-range dependencies, leading to fragmentation.

Method: Proposes Synaptic Resonance, dynamically adjusting synaptic weights based on contextual relevance to reinforce memory pathways during training and inference.

Result: Demonstrates reduced perplexity, improved coherence, and robustness against noise, with higher memory retention efficiency than baselines.

Conclusion: Synaptic Resonance offers a scalable and effective alternative for long-term contextual modeling, benefiting applications like dialogue systems and summarization.

Abstract: Contextual memory integration remains a high challenge in the development of language models, particularly in tasks that require maintaining coherence over extended sequences. Traditional approaches, such as self-attention mechanisms and memory-augmented architectures, often prioritize short-term dependencies, leading to fragmentation and inconsistency in long-range contextual understanding. Inspired by principles of synaptic plasticity observed in biological neural systems, a novel mechanism, Synaptic Resonance, is introduced to dynamically reinforce relevant memory pathways during training and inference. Unlike static memory representations, this mechanism continuously adjusts synaptic weight matrices based on contextual relevance, allowing for improved information retention without excessive computational overhead. Evaluations conducted on an open-source language model demonstrate reductions in perplexity, enhancements in contextual coherence, and increased robustness against input noise, highlighting the effectiveness of reinforcement-driven memory modulation. Comparative analysis against baseline models further reveals that the proposed approach achieves higher memory retention efficiency while maintaining computational feasibility. The architectural modifications integrate seamlessly into existing transformer-based frameworks, ensuring stable convergence and efficient inference without sacrificing scalability. Applications benefiting from improved long-term contextual consistency, such as dialogue systems and document summarization, stand to gain from this approach. Empirical findings suggest that dynamically reinforced memory pathways offer a promising alternative to conventional memory mechanisms, addressing longstanding limitations in extended sequence modeling.

[68] Exploring Contextual Flux in Large Language Models: A Novel Approach to Self-Modulating Semantic Networks

Henry Evidail, Zachary Mountebank, Alistair Hathersage, Peter Stanhope, Basil Ravenscroft, Tobias Waddingham

Main category: cs.CL

TL;DR: The paper explores self-modulating mechanisms in language models, using Contextual Flux to dynamically adjust token embeddings, improving text generation consistency and thematic retention while addressing computational demands.

DetailsMotivation: To enhance language models' dynamic adaptation capabilities by modulating token embeddings based on evolving contextual dependencies.

Method: Integrates an auxiliary gating mechanism within self-attention to dynamically adjust token representations, evaluated through entropy variations, latent space realignments, and coherence stability.

Result: Embedding shifts improve structured adaptation in long sequences, reducing redundancy and enhancing thematic retention, though stability varies with linguistic structures. Computational demands are noted.

Conclusion: Adaptive embedding updates enhance coherence but depend on model capacity and input complexity, requiring optimization for scalability.

Abstract: Self-modulating mechanisms introduce dynamic adaptation capabilities within language models through contextual realignment strategies that influence token embedding trajectories across extended sequences. Contextual Flux is explored as an approach to embedding modulation, integrating an auxiliary gating mechanism within the self-attention framework to dynamically adjust token representations based on evolving contextual dependencies. The empirical analysis evaluates entropy variations, latent space realignments, and coherence stability to assess the extent to which self-regulation enhances text generation consistency while preserving generative flexibility. Quantitative assessments suggest that embedding shifts contribute to more structured adaptation in long-form sequences, with measured reductions in redundant phrase repetitions and improvements in thematic retention. Variability in contextual weight computation affects modulation stability, leading to differing levels of adaptation across diverse linguistic structures. The computational demands introduced through real-time embedding reconfiguration are examined in relation to model scalability, emphasizing the need for optimization strategies in high-volume generative applications. The findings suggest that while adaptive embedding updates improve certain aspects of coherence, their impact remains contingent on model capacity and input complexity.

[69] Topic Over Source: The Key to Effective Data Mixing for Language Models Pre-training

Jiahui Peng, Xinlin Zhuang, Jiantao Qiu, Ren Ma, Jing Yu, He Zhu, Conghui He

Main category: cs.CL

TL;DR: The paper proposes a topic-based data mixing strategy for LLMs, showing it outperforms source-based methods across various mixing techniques.

DetailsMotivation: Address the gap in prior research by focusing on topic-level data characteristics for better LLM performance.

Method: Uses a multi-stage process (unsupervised clustering, LLM-based summarization, supervised classifier training) to generate topic labels for data mixing.

Result: Topic-based mixing consistently outperforms source-based methods, achieving lower validation loss and better optimization.

Conclusion: Topic-based data mixing is superior for LLM pre-training, and the study provides resources (code, datasets, models) for further research.

Abstract: The performance of large language models (LLMs) is significantly affected by the quality and composition of their pre-training data, which is inherently diverse, spanning various languages, sources, and topics. Effectively integrating these heterogeneous data groups is crucial for optimizing LLM performance. Previous research has predominantly concentrated on source-based data mixing, often neglecting the nuanced topic-level characteristics of the data. To address this gap, we propose a topic-based data mixing strategy that utilizes detailed topic labels generated through a multi-stage process combining unsupervised clustering, LLM-based summarization, and supervised classifier training. With this strategy, we conduct the first comprehensive comparison of topic-based versus source-based partitioning across multiple mixing strategies. We demonstrate that language models pretrained on data mixed by topics consistently outperform those trained on data mixed by sources across multiple methods including RegMix, DoReMi,temperature-based sampling, and a manual mixing method based on downstream task performance. Our theoretical analysis reveals that topic-based data achieves significantly lower validation loss compared to source-based approaches, creating a better optimization landscape for model training. We will make our code, annotated datasets, and topic classification models publicly available to facilitate further research.

[70] One ruler to measure them all: Benchmarking multilingual long-context language models

Yekyung Kim, Jenna Russell, Marzena Karpinska, Mohit Iyyer

Main category: cs.CL

TL;DR: ONERULER is a multilingual benchmark for evaluating long-context language models across 26 languages, revealing performance gaps and surprising language rankings.

DetailsMotivation: To extend the English-only RULER benchmark to multilingual contexts and assess long-context model performance across diverse languages.

Method: Adapts RULER by adding synthetic tasks, translating instructions into 25 languages, and testing models with varying context lengths.

Result: Performance gaps widen between low- and high-resource languages; Polish outperforms English; models often incorrectly predict answer absence.

Conclusion: ONERULER aids research in multilingual and cross-lingual long-context training, highlighting challenges and opportunities.

Abstract: We present ONERULER, a multilingual benchmark designed to evaluate long-context language models across 26 languages. ONERULER adapts the English-only RULER benchmark (Hsieh et al., 2024) by including seven synthetic tasks that test both retrieval and aggregation, including new variations of the “needle-in-a-haystack” task that allow for the possibility of a nonexistent needle. We create ONERULER through a two-step process, first writing English instructions for each task and then collaborating with native speakers to translate them into 25 additional languages. Experiments with both open-weight and closed LLMs reveal a widening performance gap between low- and high-resource languages as context length increases from 8K to 128K tokens. Surprisingly, English is not the top-performing language on long-context tasks (ranked 6th out of 26), with Polish emerging as the top language. Our experiments also show that many LLMs (particularly OpenAI’s o3-mini-high) incorrectly predict the absence of an answer, even in high-resource languages. Finally, in cross-lingual scenarios where instructions and context appear in different languages, performance can fluctuate by up to 20% depending on the instruction language. We hope the release of ONERULER will facilitate future research into improving multilingual and cross-lingual long-context training pipelines.

[71] OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg

Main category: cs.CL

TL;DR: The paper introduces a superior supervised fine-tuning (SFT) dataset for distilling reasoning capabilities into student models, achieving state-of-the-art coding results without reinforcement learning. It analyzes data sources, filtering impacts, and instruction diversity, prioritizing the latter over correctness.

DetailsMotivation: To bridge the gap in reasoning-based LLMs for coding tasks by addressing the lack of transparency in proprietary datasets and detailing data curation and training methods.

Method: Constructs a high-quality SFT dataset, evaluates its impact on model performance, and analyzes factors like execution filtering and instruction diversity.

Result: Distilled models achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, outperforming reinforcement learning alternatives. Execution filtering is found to reduce accuracy.

Conclusion: Instruction diversity is prioritized over solution correctness for better performance. The dataset and models will be open-sourced.

Abstract: Since the advent of reasoning-based large language models, many have found great success from distilling reasoning capabilities into student models. Such techniques have significantly bridged the gap between reasoning and standard LLMs on coding tasks. Despite this, much of the progress on distilling reasoning models remains locked behind proprietary datasets or lacks details on data curation, filtering and subsequent training. To address this, we construct a superior supervised fine-tuning (SFT) dataset that we use to achieve state-of-the-art coding capability results in models of various sizes. Our distilled models use only SFT to achieve 61.8% on LiveCodeBench and 24.6% on CodeContests, surpassing alternatives trained with reinforcement learning. We then perform analysis on the data sources used to construct our dataset, the impact of code execution filtering, and the importance of instruction/solution diversity. We observe that execution filtering negatively affected benchmark accuracy, leading us to prioritize instruction diversity over solution correctness. Finally, we also analyze the token efficiency and reasoning patterns utilized by these models. We will open-source these datasets and distilled models to the community.

[72] Single-Pass Document Scanning for Question Answering

Weili Cao, Jianyou Wang, Youze Zheng, Longtian Bao, Qirui Zheng, Taylor Berg-Kirkpatrick, Ramamohan Paturi, Leon Bergen

Main category: cs.CL

TL;DR: A single-pass document scanning method for QA outperforms chunk-based embeddings and competes with large models at lower cost by preserving global coherence.

DetailsMotivation: Handling large documents in QA is challenging due to lost global context in chunk-based methods and high computational costs of full-context transformers.

Method: Proposes a single-pass document scanning approach that processes text linearly, preserving global coherence and selecting relevant sentences for the query.

Result: Outperforms chunk-based methods on 41 QA benchmarks and competes with large models at reduced computational cost.

Conclusion: Single-pass scanning is a simple, effective solution for QA over massive text, with code and resources available.

Abstract: Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at https://github.com/MambaRetriever/MambaRetriever

[73] Not All Data Are Unlearned Equally

Aravind Krishnan, Siva Reddy, Marius Mosbach

Main category: cs.CL

TL;DR: The paper explores how the frequency of knowledge in pre-training data affects unlearning in LLMs, showing frequent knowledge is harder to unlearn and highlighting evaluation challenges.

DetailsMotivation: To address the assumption that all data points are equally easy to unlearn in LLMs, focusing on privacy concerns.

Method: Study the impact of knowledge frequency in pre-training data on unlearning success and evaluate misalignment in unlearning metrics.

Result: Frequent knowledge is harder to unlearn, and evaluation misalignment worsens with larger models.

Conclusion: Better evaluation practices and frequency-aware unlearning methods are needed for LLMs.

Abstract: Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.

[74] Self-Steering Language Models

Gabriel Grand, Joshua B. Tenenbaum, Vikash K. Mansinghka, Alexander K. Lew, Jacob Andreas

Main category: cs.CL

TL;DR: DisCIPL enables language models to self-steer by generating task-specific inference programs, improving reasoning efficiency and verifiability without finetuning.

DetailsMotivation: Language models struggle with precise reasoning steps but excel at describing problem structures. DisCIPL leverages this to enhance reasoning.

Method: DisCIPL uses a Planner model to create recursive search procedures executed by Follower models, enabling efficient and verifiable reasoning.

Result: DisCIPL matches or outperforms larger models like GPT-4o on constrained generation tasks, using smaller models like Llama-3.2-1B.

Conclusion: DisCIPL introduces a scalable, parallelized Monte Carlo inference strategy, improving LM reasoning without finetuning.

Abstract: While test-time reasoning enables language models (LMs) to tackle complex tasks, searching or planning in natural language can be slow, costly, and error-prone. But even when LMs struggle to emulate the precise reasoning steps needed to solve a problem, they often excel at describing its abstract structure–both how to verify solutions and how to search for them. This paper introduces DisCIPL, a method for “self-steering” LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models. Our approach equips LMs with the ability to write recursive search procedures that guide LM inference, enabling new forms of verifiable and efficient reasoning. When instantiated with a small Follower (e.g., Llama-3.2-1B or Qwen3-1.7B), DisCIPL matches (and sometimes outperforms) much larger models, including GPT-4o and o1, on challenging constrained generation tasks. Our work opens up a design space of highly-parallelized Monte Carlo inference strategies that outperform standard best-of-N sampling, require no finetuning, and can be implemented automatically by existing LMs.

[75] Layers at Similar Depths Generate Similar Activations Across LLM Architectures

Christopher Wolfram, Aaron Schein

Main category: cs.CL

TL;DR: The study explores how latent spaces in independently-trained LLMs relate, finding shared nearest neighbor relationships across models but variability within layers.

DetailsMotivation: To understand the relationships between latent spaces in different LLMs and how they evolve across layers.

Method: Analyzed nearest neighbor relationships in activations across 24 open-weight LLMs at various layers.

Result: Nearest neighbor relationships vary within a model’s layers but are shared across corresponding layers of different models.

Conclusion: LLMs generate a progression of activation geometries shared across models, adapted to different architectures.

Abstract: How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not “obvious” either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.

[76] EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers

Jianyou Wang, Weili Cao, Kaicheng Wang, Xiaoyue Wang, Ashish Dalvi, Gino Prasad, Qishan Liang, Hsuan-lin Her, Ming Wang, Qin Yang, Gene W. Yeo, David E. Neal, Maxim Khan, Christopher D. Rosin, Ramamohan Paturi, Leon Bergen

Main category: cs.CL

TL;DR: The paper introduces EvidenceBench, a benchmark for evaluating models’ ability to find relevant evidence for biomedical hypotheses, validated by expert annotations. It also releases a larger dataset, EvidenceBench-100k, to aid model development.

DetailsMotivation: Automating the process of finding evidence for biomedical hypotheses is crucial for researchers, but existing models underperform compared to human experts.

Method: A novel pipeline for hypothesis generation and sentence-by-sentence annotation of biomedical papers, validated by expert annotations, is used to create EvidenceBench.

Result: Models evaluated on EvidenceBench perform significantly below expert level. A larger dataset, EvidenceBench-100k, is also introduced.

Conclusion: The proposed pipeline is scalable and valid, providing a valuable resource for improving evidence-finding models in biomedicine.

Abstract: We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline’s validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench

[77] Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance

Takuya Tamura, Taro Yano, Masafumi Enomoto, Masafumi Oyamada

Main category: cs.CL

TL;DR: A novel Lineage-Regularized Matrix Factorization (LRMF) framework improves LLM performance forecasting by incorporating lineage relationships, outperforming baselines and addressing cold-start issues.

DetailsMotivation: To reduce computational costs and development time by accurately predicting LLM performance before fine-tuning or merging, addressing gaps in prior methods that ignore lineage relationships.

Method: Proposes LRMF, which uses a graph Laplacian regularizer to encode ancestral ties among LLMs, leveraging multi-hop parent-child connections for better predictions.

Result: LRMF outperforms conventional methods, achieving 0.15-0.30 higher Pearson correlation coefficients, and effectively handles cold-start scenarios.

Conclusion: LRMF offers a resource-efficient approach for hyperparameter tuning, data selection, and model combination in LLM development.

Abstract: Accurately forecasting the performance of Large Language Models (LLMs) before extensive fine-tuning or merging can substantially reduce both computational expense and development time. Although prior approaches like scaling laws account for global factors such as parameter size or training tokens, they often overlook explicit lineage relationships-i.e., which models are derived or merged from which parents. In this work, we propose a novel Lineage-Regularized Matrix Factorization (LRMF) framework that encodes ancestral ties among LLMs via a graph Laplacian regularizer. By leveraging multi-hop parent-child connections, LRMF consistently outperforms conventional matrix factorization and collaborative filtering methods in both instance-level and benchmark-level performance prediction. Our large-scale study includes 2,934 publicly available Hugging Face models and 21,000+ instances across 6 major benchmarks, showing that the introduction of lineage constraints yields up to 0.15-0.30 higher Pearson correlation coefficients with actual performance compared to baseline methods. Moreover, LRMF effectively addresses the cold-start problem, providing accurate estimates for newly derived or merged models even with minimal data. This lineage-guided strategy thus offers a resource-efficient way to inform hyperparameter tuning, data selection, and model combination in modern LLM development.

[78] No Query, No Access

Wenqiang Wang, Siyuan Liang, Yangshijie Zhang, Xiaojun Jia, Hao Lin, Xiaochun Cao

Main category: cs.CL

TL;DR: VDBA is a novel adversarial attack method using only victim texts, outperforming state-of-the-art methods with a 52.08% ASR improvement and zero queries.

DetailsMotivation: Existing adversarial attacks require model knowledge, extensive queries, or training data, limiting practicality. VDBA aims to overcome these constraints.

Method: VDBA uses victim texts, creates a shadow dataset with pre-trained models, and employs hierarchical substitution models and diverse adversarial example generation.

Result: VDBA achieves a 52.08% ASR improvement and 45.99% ASR on LLMs like Qwen2 and GPT, with zero queries.

Conclusion: VDBA demonstrates serious security risks for advanced NLP models, even without API access.

Abstract: Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary. Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at https://anonymous.4open.science/r/VDBA-Victim-Data-based-Adversarial-Attack-36EC/

[79] The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks

Benedikt Ebing, Goran Glavaš

Main category: cs.CL

TL;DR: The paper revisits word aligners (WAs) for label projection in cross-lingual transfer (XLT) for token classification, optimizing design choices and introducing a new ensembling strategy that outperforms marker-based methods.

DetailsMotivation: To systematically investigate the impact of low-level design decisions in word aligners for label projection in XLT and improve performance.

Method: Revisits WAs, examining (i) label projection algorithms, (ii) filtering strategies, and (iii) pre-tokenization. Introduces an ensembling strategy combining translate-train and translate-test predictions.

Result: Optimized WAs perform comparably to marker-based methods. The new ensembling strategy outperforms marker-based projection and reduces sensitivity to design choices.

Conclusion: Optimized WAs and the proposed ensembling strategy enhance robustness and performance in XLT for token classification tasks.

Abstract: Translation-based strategies for cross-lingual transfer XLT such as translate-train – training on noisy target language data translated from the source language – and translate-test – evaluating on noisy source language data translated from the target language – are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.

[80] CUB: Benchmarking Context Utilisation Techniques for Language Models

Lovisa Hagström, Youna Kim, Haeun Yu, Sang-goo Lee, Richard Johansson, Hyunsoo Cho, Isabelle Augenstein

Main category: cs.CL

TL;DR: The paper introduces CUB, a benchmark for evaluating context utilisation manipulation techniques (CMTs) in retrieval-augmented generation (RAG), revealing gaps in current methods and evaluation practices.

DetailsMotivation: To address the lack of systematic comparison of CMTs in handling diverse context conditions in knowledge-intensive tasks like question answering and fact checking.

Method: Developed CUB, a comprehensive benchmark, and evaluated seven state-of-the-art CMTs across three datasets and nine language models.

Result: Existing CMTs struggle with diverse context types and show inflated performance on synthetic datasets compared to realistic ones.

Conclusion: Highlights the need for holistic testing and robust CMTs capable of handling multiple context types effectively.

Abstract: Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) - the first comprehensive benchmark designed to help practitioners within retrieval-augmented generation (RAG) diagnose CMTs under different context conditions. With this benchmark, we conduct the most extensive evaluation to date of seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results reveal that most existing CMTs struggle to handle the full spectrum of context types encountered in real-world retrieval-augmented scenarios. We also find that many CMTs display inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Our findings expose critical gaps in current CMT evaluation practices and demonstrate the need for holistic testing and the development of CMTs that can robustly handle multiple context types.

[81] Automated Privacy Information Annotation in Large Language Model Interactions

Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Shaojie Tang, Guihai Chen

Main category: cs.CL

TL;DR: The paper addresses privacy risks in real-name LLM interactions by creating a multilingual dataset and automated annotation pipeline for privacy detection, highlighting performance gaps for future research.

DetailsMotivation: Users risk disclosing private info in real-name LLM interactions, but existing PII detection methods are inadequate for this scenario.

Method: A large-scale dataset (249K queries, 154K annotated phrases) was built using an automated annotation pipeline with LLMs. Evaluation metrics and baseline methods (tuning-free/tuning-based) were developed.

Result: Performance gaps were identified, showing current methods fall short of real-world LLM application needs.

Conclusion: The work motivates future research into more effective local privacy detection methods using the provided dataset.

Abstract: Users interacting with large language models (LLMs) under their real identifiers often unknowingly risk disclosing private information. Automatically notifying users whether their queries leak privacy and which phrases leak what private information has therefore become a practical need. Existing privacy detection methods, however, were designed for different objectives and application domains, typically tagging personally identifiable information (PII) in anonymous content, which is insufficient in real-name interaction scenarios with LLMs. In this work, to support the development and evaluation of privacy detection models for LLM interactions that are deployable on local user devices, we construct a large-scale multilingual dataset with 249K user queries and 154K annotated privacy phrases. In particular, we build an automated privacy annotation pipeline with strong LLMs to automatically extract privacy phrases from dialogue datasets and annotate leaked information. We also design evaluation metrics at the levels of privacy leakage, extracted privacy phrase, and privacy information. We further establish baseline methods using light-weight LLMs with both tuning-free and tuning-based methods, and report a comprehensive evaluation of their performance. Evaluation results reveal a gap between current performance and the requirements of real-world LLM applications, motivating future research into more effective local privacy detection methods grounded in our dataset.

[82] DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Jieping Ye

Main category: cs.CL

TL;DR: DrVoice introduces a parallel speech-text model using joint autoregressive modeling and dual-resolution speech representations, achieving SOTA performance with less data.

DetailsMotivation: To improve speech-text generation by enabling mutual modality awareness and reducing input frequency for LLMs.

Method: Joint autoregressive modeling with dual-resolution speech representations (5Hz input frequency).

Result: DrVoice achieves SOTA performance on Spoken Question Answering benchmarks with limited data.

Conclusion: DrVoice demonstrates the effectiveness of joint modeling and dual-resolution for speech-text generation.

Abstract: Recent studies on end-to-end speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Whereas current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz. Experimental results on Spoken Question Answering benchmarks demonstrate that D RVOICE establishes new state-of-the-art (SOTA) performance among similar size speech foundation models with relative small amount of data.

[83] No Universal Prompt: Unifying Reasoning through Adaptive Prompting for Temporal Table Reasoning

Abhishek Rajgaria, Kushagra Dixit, Mayank Vyas, Harshavardhan Kalalbandi, Dan Roth, Vivek Gupta

Main category: cs.CL

TL;DR: SEAR, an adaptive prompting framework, outperforms baseline methods in temporal table reasoning by dynamically adjusting to context and integrating structured reasoning.

DetailsMotivation: Existing prompting methods for temporal table reasoning lack exploration of their impact, and performance varies widely across table types and contexts.

Method: Investigates multiple prompting techniques on diverse table types and introduces SEAR, an adaptive framework inspired by human reasoning.

Result: SEAR achieves superior performance across all table types; table structure refactoring also enhances reasoning.

Conclusion: No single method consistently outperforms others, but SEAR’s adaptive approach and unified table representation improve performance.

Abstract: Temporal Table Reasoning is a critical challenge for Large Language Models (LLMs), requiring effective reasoning to extract relevant insights. Despite existence of multiple prompting methods, their impact on table reasoning remains largely unexplored. Furthermore, model performance varies drastically across different table and context structures, making it difficult to determine an optimal approach. This work investigates multiple prompting technique on diverse table types to determine that performance depends on factors such as entity type, table structure, requirement of additional context and question complexity, with “NO” single method consistently outperforming others. To address this, we introduce SEAR, an adaptive prompting framework inspired by human reasoning that dynamically adjusts to context and integrates structured reasoning. Our results demonstrate that SEAR achieves superior performance across all table types compared to baseline prompting techniques. Additionally, we explore the impact of table structure refactoring, finding that a unified representation enhances model reasoning.

[84] Decompositional Reasoning for Graph Retrieval with Large Language Models

Valentin Six, Evan Dufraisse, Gaël de Chalendar

Main category: cs.CL

TL;DR: A novel retrieval approach integrates textual knowledge graphs into LLMs for multi-hop QA, improving reasoning and factual consistency.

DetailsMotivation: LLMs struggle with multi-hop reasoning and factual consistency in knowledge-intensive tasks like QA. Combining KGs and LLMs shows promise but lacks efficient graph reasoning.

Method: Decomposes complex questions into sub-questions, retrieves relevant textual subgraphs, and composes a question-specific KG for answer generation using a weighted similarity function.

Result: Achieves comparable or superior performance on multi-hop QA benchmarks with smaller models and fewer LLM calls.

Conclusion: The method enhances factual grounding and interpretability while leveraging LLMs’ generative strengths.

Abstract: Large Language Models (LLMs) excel at many NLP tasks, but struggle with multi-hop reasoning and factual consistency, limiting their effectiveness on knowledge-intensive tasks like complex question answering (QA). Linking Knowledge Graphs (KG) and LLMs has shown promising results, but LLMs generally lack the ability to reason efficiently over graph-structured information. To tackle this problem, we propose a novel retrieval approach that integrates textual knowledge graphs into the LLM reasoning process via query decomposition. Our method decomposes complex questions into sub-questions, retrieves relevant textual subgraphs, and composes a question-specific knowledge graph to guide answer generation. For that, we use a weighted similarity function that focuses on both the complex question and the generated subquestions to extract a relevant subgraph, which allows efficient and precise retrieval for complex questions and improves the performance of LLMs on multi-hop QA tasks. This structured reasoning pipeline enhances factual grounding and interpretability while leveraging the generative strengths of LLMs. We evaluate our method on standard multi-hop QA benchmarks and show that it achieves comparable or superior performance to competitive existing methods, using smaller models and fewer LLM calls.

[85] Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy

Main category: cs.CL

TL;DR: FiSCo is a statistical framework for evaluating group-level fairness in LLMs by detecting subtle semantic biases in long-form responses, outperforming existing methods.

DetailsMotivation: Existing evaluation methods overlook biases in long-form LLM responses and fail to address intrinsic variability, undermining reliability.

Method: FiSCo decomposes outputs into claims, uses entailment checks for semantic consistency, and applies statistical testing to compare inter- and intra-group similarities.

Result: FiSCo reliably identifies nuanced biases across demographic groups (gender, race, age) and reduces the impact of LLM variability.

Conclusion: FiSCo provides a robust, fine-grained approach to fairness evaluation in LLMs, addressing limitations of prior work.

Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSco more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

[86] AALC: Large Language Model Efficient Reasoning via Adaptive Accuracy-Length Control

Ruosen Li, Ziming Luo, Quan Zhang, Ruochen Li, Ben Zhou, Ali Payani, Xinya Du

Main category: cs.CL

TL;DR: AALC, a lightweight accuracy-aware length reward, reduces response length by 50% while maintaining or improving accuracy in large reasoning models (LRMs).

DetailsMotivation: LRMs generate lengthy chain-of-thoughts, causing high latency and cost without proportional accuracy gains.

Method: AALC integrates validation accuracy into reinforcement learning, dynamically balancing correctness and brevity with a scheduled length penalty.

Result: Response length reduced by over 50% with maintained or improved accuracy; redundant reasoning patterns are curbed.

Conclusion: Reward-based strategies like AALC can guide LRMs toward efficient, generalizable reasoning, though interpretability may decrease.

Abstract: Large reasoning models (LRMs) achieve impressive reasoning capabilities by generating lengthy chain-of-thoughts, but this “overthinking” incurs high latency and cost without commensurate accuracy gains. In this work, we introduce AALC, a lightweight, accuracy-aware length reward integrated into reinforcement learning that dynamically balances correctness and brevity during training. By incorporating validation accuracy into the reward and employing a smooth, dynamically scheduled length penalty, AALC delays length penalty until target performance is met. Through extensive experiments across standard and out-of-distribution math benchmarks, we show that our approach reduces response length by over 50% while maintaining or even improving the original accuracy. Furthermore, qualitative analysis reveals that our method curbs redundant reasoning patterns such as excessive subgoal setting and verification, leading to structurally refined outputs rather than naive truncation. We also identify that efficiency gains are accompanied by reduced interpretability: models trained with AALC omit some narrative framing and explanatory context. These findings highlight the potential of reward-based strategies to guide LRMs toward more efficient, generalizable reasoning paths.

[87] Humans overrely on overconfident language models, across languages

Neil Rathi, Dan Jurafsky, Kaitlyn Zhou

Main category: cs.CL

TL;DR: The paper examines the risks of multilingual linguistic miscalibration in LLMs, showing overconfidence and overreliance across languages, with variations in epistemic marker usage and human reliance behaviors.

DetailsMotivation: To evaluate LLM safety globally by assessing linguistic calibration, overconfidence, and overreliance risks across languages, given prior evidence of overconfidence in English.

Method: Analyzed LLM-generated epistemic markers in five languages, measured human reliance rates, and compared cross-linguistic variations in marker usage and reliance behaviors.

Result: LLMs exhibit overconfidence across languages, with varying usage of epistemic markers (e.g., more uncertainty in Japanese, more certainty in German/Mandarin). Human reliance behaviors also differ, with higher discounting of uncertainty in Japanese.

Conclusion: Multilingual linguistic calibration is challenging, and culturally/linguistically contextualized safety evaluations are crucial to mitigate overreliance risks.

Abstract: As large language models (LLMs) are deployed globally, it is crucial that their responses are calibrated across languages to accurately convey uncertainty and limitations. Prior work shows that LLMs are linguistically overconfident in English, leading users to overrely on confident generations. However, the usage and interpretation of epistemic markers (e.g., ‘I think it’s’) differs sharply across languages. Here, we study the risks of multilingual linguistic (mis)calibration, overconfidence, and overreliance across five languages to evaluate LLM safety in a global context. Our work finds that overreliance risks are high across languages. We first analyze the distribution of LLM-generated epistemic markers and observe that LLMs are overconfident across languages, frequently generating strengtheners even as part of incorrect responses. Model generations are, however, sensitive to documented cross-linguistic variation in usage: for example, models generate the most markers of uncertainty in Japanese and the most markers of certainty in German and Mandarin. Next, we measure human reliance rates across languages, finding that reliance behaviors differ cross-linguistically: for example, participants are significantly more likely to discount expressions of uncertainty in Japanese than in English (i.e., ignore their ‘hedging’ function and rely on generations that contain them). Taken together, these results indicate a high risk of reliance on overconfident model generations across languages. Our findings highlight the challenges of multilingual linguistic calibration and stress the importance of culturally and linguistically contextualized model safety evaluations.

[88] Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks

Linbo Cao, Jinman Zhao

Main category: cs.CL

TL;DR: A debate-driven evaluation paradigm transforms QA datasets into adversarial debates to assess model reasoning, reducing memorization and contamination risks while reusing existing data.

DetailsMotivation: Addressing concerns of data contamination, memorization, and high costs in QA benchmarks by introducing a more robust evaluation method.

Method: Convert QA tasks into structured adversarial debates where models defend or challenge answers, adjudicated by a blind judge model.

Result: Validated robustness; fine-tuned models showed improved accuracy but performed worse in debates, and weaker judges reliably identified stronger debaters.

Conclusion: The framework offers a sustainable way to measure genuine reasoning in advanced language models, reducing reliance on new benchmarks.

Abstract: As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates–where one model is given the official answer to defend, and another constructs and defends an alternative answer–adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm’s effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination–a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% -> 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that “pretraining on the test set is no longer all you need,” offering a sustainable path for measuring the genuine reasoning ability of advanced language models.

[89] PaPaformer: Language Model from Pre-trained Parallel Paths

Joonas Tapaninaho, Mourad Oussala

Main category: cs.CL

TL;DR: The paper introduces PaPaformer, a decoder-only transformer variant, to reduce training time and parameters while improving performance by training parallel paths individually.

DetailsMotivation: Modern language models require excessive computation and time, even for smaller variants like SLMs. The goal is to reduce training time from days/weeks to hours.

Method: Introduces PaPaformer, a decoder-only transformer with parallel paths trained individually on diverse data and combined into a larger model.

Result: Reduces total parameters and training time while increasing performance. Allows customization of paths for specific tasks.

Conclusion: PaPaformer offers a scalable and efficient method for training language models, with potential for task-specific customization.

Abstract: The training of modern large-language models requires an increasingly amount of computation power and time. Even smaller variants, such as small-language models (SLMs), take several days to train in the best-case scenarios, often requiring multiple GPUs. This paper explores methods to train and evaluate decoder-only transformer-based language models in hours instead of days/weeks. We introduces \textit{PaPaformer}, a decoder-only transformer architecture variant, whose lower-dimensional parallel paths are combined into larger model. The paper shows that these lower-dimensional paths can be trained individually with different types of training data and then combined into one larger model. This method gives the option to reduce the total number of model parameters and the training time with increasing performance. Moreover, the use of parallel path structure opens interesting possibilities to customize paths to accommodate specific task requirements.

[90] CoAct-1: Computer-using Agents with Coding as Actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong

Main category: cs.CL

TL;DR: CoAct-1 introduces a hybrid multi-agent system combining GUI control and programmatic execution, improving efficiency and reliability in computer automation tasks.

DetailsMotivation: Current GUI-based autonomous agents are inefficient and brittle for complex tasks, prompting the need for a more robust approach.

Method: CoAct-1 uses an Orchestrator to delegate tasks between a GUI Operator and a Programmer agent, enabling direct coding (Python/Bash) alongside GUI actions.

Result: Achieves 60.76% success rate on OSWorld benchmark, with 10.15 average steps per task, outperforming GUI-only agents.

Conclusion: Integrating coding as an action enhances power, efficiency, and scalability in computer automation.

Abstract: Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

[91] Evaluation of LLMs in AMR Parsing

Shu Han Ho

Main category: cs.CL

TL;DR: Finetuning decoder-only LLMs like LLaMA 3.2 achieves competitive AMR parsing performance, matching SOTA parsers.

DetailsMotivation: To explore a straightforward finetuning approach for AMR parsing using decoder-only LLMs, avoiding complex architectures.

Method: Finetuned four LLM architectures (Phi 3.5, Gemma 2, LLaMA 3.2, DeepSeek R1) on the LDC2020T02 AMR3.0 test set.

Result: LLaMA 3.2 achieved SMATCH F1: 0.804, comparable to SOTA parsers, with Phi 3.5 excelling in structural validity.

Conclusion: Decoder-only LLMs, especially LLaMA 3.2, offer a promising, simpler alternative to complex AMR parsers.

Abstract: AMR (Abstract Meaning Representation) is a semantic formalism that encodes sentence meaning as rooted, directed, acyclic graphs, where nodes represent concepts and edges denote semantic relations. Finetuning decoder only Large Language Models (LLMs) represent a promising novel straightfoward direction for AMR parsing. This paper presents a comprehensive evaluation of finetuning four distinct LLM architectures, Phi 3.5, Gemma 2, LLaMA 3.2, and DeepSeek R1 LLaMA Distilled using the LDC2020T02 Gold AMR3.0 test set. Our results have shown that straightfoward finetuning of decoder only LLMs can achieve comparable performance to complex State of the Art (SOTA) AMR parsers. Notably, LLaMA 3.2 demonstrates competitive performance against SOTA AMR parsers given a straightforward finetuning approach. We achieved SMATCH F1: 0.804 on the full LDC2020T02 test split, on par with APT + Silver (IBM) at 0.804 and approaching Graphene Smatch (MBSE) at 0.854. Across our analysis, we also observed a consistent pattern where LLaMA 3.2 leads in semantic performance while Phi 3.5 excels in structural validity.

[92] MyCulture: Exploring Malaysia’s Diverse Culture under Low-Resource Language Constraints

Zhong Ken Hew, Jia Xin Low, Sze Jue Yang, Chee Seng Chan

Main category: cs.CL

TL;DR: MyCulture benchmark evaluates LLMs on Malaysian culture in Bahasa Melayu using an open-ended multiple-choice format to reduce bias and improve fairness.

DetailsMotivation: Address cultural biases in LLMs caused by training data dominated by high-resource languages, ensuring accurate representation of diverse cultural contexts, especially in low-resource settings.

Method: Introduce MyCulture benchmark with six cultural pillars, using an open-ended multiple-choice question format without predefined options to mitigate guessing and format bias. Analyze structural and language biases through varied prompts.

Result: Significant disparities in cultural comprehension among LLMs, emphasizing the need for culturally grounded and linguistically inclusive benchmarks.

Conclusion: MyCulture highlights the importance of culturally aware and fair evaluation tools for LLMs to ensure inclusivity and reduce biases.

Abstract: Large Language Models (LLMs) often exhibit cultural biases due to training data dominated by high-resource languages like English and Chinese. This poses challenges for accurately representing and evaluating diverse cultural contexts, particularly in low-resource language settings. To address this, we introduce MyCulture, a benchmark designed to comprehensively evaluate LLMs on Malaysian culture across six pillars: arts, attire, customs, entertainment, food, and religion presented in Bahasa Melayu. Unlike conventional benchmarks, MyCulture employs a novel open-ended multiple-choice question format without predefined options, thereby reducing guessing and mitigating format bias. We provide a theoretical justification for the effectiveness of this open-ended structure in improving both fairness and discriminative power. Furthermore, we analyze structural bias by comparing model performance on structured versus free-form outputs, and assess language bias through multilingual prompt variations. Our evaluation across a range of regional and international LLMs reveals significant disparities in cultural comprehension, highlighting the urgent need for culturally grounded and linguistically inclusive benchmarks in the development and assessment of LLMs.

cs.CV

[93] Boosting Adversarial Transferability via Residual Perturbation Attack

Jinjia Peng, Zeze Tao, Huibing Wang, Meng Wang, Yang Wang

Main category: cs.CV

TL;DR: ResPA improves adversarial example transferability by using residual gradients to guide perturbations toward flat loss regions.

DetailsMotivation: Adversarial examples in flat loss landscapes transfer better, but prior methods ignore perturbation direction, limiting effectiveness.

Method: ResPA uses residual gradients (difference between current and reference gradients) to guide perturbations, leveraging global direction changes.

Result: ResPA outperforms existing transfer-based attacks and further improves when combined with input transformation methods.

Conclusion: ResPA enhances adversarial transferability by better capturing global perturbation directions, validated by superior experimental results.

Abstract: Deep neural networks are susceptible to adversarial examples while suffering from incorrect predictions via imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to target models under black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes exhibit superior transferability to alleviate overfitting on surrogate models. However, the prior arts overlook the influence of perturbation directions, resulting in limited transferability. In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average on the input gradients to obtain the first moment as the reference gradient, which encompasses the direction of historical gradients. Instead of heavily relying on the local flatness that stems from the current gradients as the perturbation direction, ResPA further considers the residual between the current gradient and the reference gradient to capture the changes in the global perturbation direction. The experimental results demonstrate the better transferability of ResPA than the existing typical transfer-based attack methods, while the transferability can be further improved by combining ResPA with the current input transformation methods. The code is available at https://github.com/ZezeTao/ResPA.

[94] Generalized Few-Shot Out-of-Distribution Detection

Pinxuan Li, Bing Cao, Changqing Zhang, Qinghua Hu

Main category: cs.CV

TL;DR: The paper introduces a Generalized Few-shot OOD Detection (GOOD) framework to improve generalization in OOD detection by leveraging a General Knowledge Model (GKM) and a Knowledge Dynamic Embedding (KDE) mechanism.

DetailsMotivation: Existing Few-shot OOD detection methods overfit to limited training data, leading to poor generalization. The paper aims to address this by incorporating general knowledge.

Method: The GOOD framework uses GKM to provide general knowledge and KDE to dynamically align output distributions, balancing generality and specificity.

Result: Experiments show the framework outperforms existing methods on real-world OOD benchmarks.

Conclusion: The GOOD framework effectively enhances generalization in Few-shot OOD detection, with theoretical and empirical validation.

Abstract: Few-shot Out-of-Distribution (OOD) detection has emerged as a critical research direction in machine learning for practical deployment. Most existing Few-shot OOD detection methods suffer from insufficient generalization capability for the open world. Due to the few-shot learning paradigm, the OOD detection ability is often overfit to the limited training data itself, thus degrading the performance on generalized data and performing inconsistently across different scenarios. To address this challenge, we proposed a Generalized Few-shot OOD Detection (GOOD) framework, which empowers the general knowledge of the OOD detection model with an auxiliary General Knowledge Model (GKM), instead of directly learning from few-shot data. We proceed to reveal the few-shot OOD detection from a generalization perspective and theoretically derive the Generality-Specificity balance (GS-balance) for OOD detection, which provably reduces the upper bound of generalization error with a general knowledge model. Accordingly, we propose a Knowledge Dynamic Embedding (KDE) mechanism to adaptively modulate the guidance of general knowledge. KDE dynamically aligns the output distributions of the OOD detection model to the general knowledge model based on the Generalized Belief (G-Belief) of GKM, thereby boosting the GS-balance. Experiments on real-world OOD benchmarks demonstrate our superiority. Codes will be available.

[95] UnGuide: Learning to Forget with LoRA-Guided Diffusion Models

Agnieszka Polowczyk, Alicja Polowczyk, Dawid Malarz, Artur Kasymov, Marcin Mazur, Jacek Tabor, Przemysław Spurek

Main category: cs.CV

TL;DR: UnGuide introduces UnGuidance, a dynamic inference mechanism for precise control in machine unlearning, outperforming LoRA-based methods in concept removal while preserving content fidelity.

DetailsMotivation: Address concerns about misuse of text-to-image diffusion models by enabling effective unlearning of harmful or misleading content without compromising overall performance.

Method: Uses UnGuidance, a dynamic inference mechanism leveraging Classifier-Free Guidance (CFG) to modulate the guidance scale based on denoising stability, combined with LoRA for selective unlearning.

Result: UnGuide achieves controlled concept removal, retains model expressive power, and outperforms LoRA-based methods in erasure tasks.

Conclusion: UnGuide offers a robust solution for targeted unlearning in diffusion models, balancing erasure and content fidelity.

Abstract: Recent advances in large-scale text-to-image diffusion models have heightened concerns about their potential misuse, especially in generating harmful or misleading content. This underscores the urgent need for effective machine unlearning, i.e., removing specific knowledge or concepts from pretrained models without compromising overall performance. One possible approach is Low-Rank Adaptation (LoRA), which offers an efficient means to fine-tune models for targeted unlearning. However, LoRA often inadvertently alters unrelated content, leading to diminished image fidelity and realism. To address this limitation, we introduce UnGuide – a novel approach which incorporates UnGuidance, a dynamic inference mechanism that leverages Classifier-Free Guidance (CFG) to exert precise control over the unlearning process. UnGuide modulates the guidance scale based on the stability of a few first steps of denoising processes, enabling selective unlearning by LoRA adapter. For prompts containing the erased concept, the LoRA module predominates and is counterbalanced by the base model; for unrelated prompts, the base model governs generation, preserving content fidelity. Empirical results demonstrate that UnGuide achieves controlled concept removal and retains the expressive power of diffusion models, outperforming existing LoRA-based methods in both object erasure and explicit content removal tasks.

[96] Can Multimodal Large Language Models Understand Spatial Relations?

Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, Tong Ruan

Main category: cs.CV

TL;DR: SpatialMQA is a new benchmark for spatial relation reasoning in MLLMs, addressing flaws in existing benchmarks and showing current models lag behind human performance.

DetailsMotivation: Current benchmarks for spatial relation reasoning in MLLMs have limitations like reliance on bounding boxes or prior knowledge, hindering true image understanding.

Method: SpatialMQA is introduced, a human-annotated benchmark based on COCO2017, with a tailored annotation procedure producing 5,392 samples.

Result: State-of-the-art MLLMs achieve only 48.14% accuracy on SpatialMQA, compared to human-level accuracy of 98.40%.

Conclusion: SpatialMQA highlights the gap in MLLMs’ spatial reasoning and suggests future research directions, with the benchmark and code publicly available.

Abstract: Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model’s prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://github.com/ziyan-xiaoyu/SpatialMQA.git.

[97] Improving Masked Style Transfer using Blended Partial Convolution

Seyed Hadi Seyed, Ayberk Cansever, David Hart

Main category: cs.CV

TL;DR: The paper introduces a partial-convolution-based style transfer network for applying artistic styles to specific image regions, improving accuracy over traditional masking methods.

DetailsMotivation: Existing methods apply style transfer to entire images, but users often need stylization for specific regions. Post-stylization masking fails to accurately capture style features in the region of interest.

Method: A partial-convolution-based network is proposed, along with internal blending techniques to handle imperfect region selections.

Result: The approach visually and quantitatively outperforms traditional methods, demonstrated using the SA-1B dataset.

Conclusion: The proposed method effectively targets style transfer to specific regions, addressing limitations of post-stylization masking.

Abstract: Artistic style transfer has long been possible with the advancements of convolution- and transformer-based neural networks. Most algorithms apply the artistic style transfer to the whole image, but individual users may only need to apply a style transfer to a specific region in the image. The standard practice is to simply mask the image after the stylization. This work shows that this approach tends to improperly capture the style features in the region of interest. We propose a partial-convolution-based style transfer network that accurately applies the style features exclusively to the region of interest. Additionally, we present network-internal blending techniques that account for imperfections in the region selection. We show that this visually and quantitatively improves stylization using examples from the SA-1B dataset. Code is publicly available at https://github.com/davidmhart/StyleTransferMasked.

[98] MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss

Can Zhao, Pengfei Guo, Dong Yang, Yucheng Tang, Yufan He, Benjamin Simon, Mason Belue, Stephanie Harmon, Baris Turkbey, Daguang Xu

Main category: cs.CV

TL;DR: MAISI-v2 is an accelerated 3D medical image synthesis framework using rectified flow for fast, high-quality generation and a novel contrastive loss for better condition fidelity.

DetailsMotivation: Addressing limitations in existing diffusion models, such as slow inference, limited generalizability, and weak alignment with input conditions in medical imaging.

Method: Integrates rectified flow for acceleration and introduces a region-specific contrastive loss to enhance condition fidelity.

Result: Achieves state-of-the-art image quality with 33× acceleration and demonstrates utility in downstream tasks like segmentation.

Conclusion: MAISI-v2 improves speed and condition alignment, facilitating reproducibility and further development in medical image synthesis.

Abstract: Medical image synthesis is an important topic for both clinical and research applications. Recently, diffusion models have become a leading approach in this area. Despite their strengths, many existing methods struggle with (1) limited generalizability that only work for specific body regions or voxel spacings, (2) slow inference, which is a common issue for diffusion models, and (3) weak alignment with input conditions, which is a critical issue for medical imaging. MAISI, a previously proposed framework, addresses generalizability issues but still suffers from slow inference and limited condition consistency. In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high quality generation. To further enhance condition fidelity, we introduce a novel region-specific contrastive loss to enhance the sensitivity to region of interest. Our experiments show that MAISI-v2 can achieve SOTA image quality with $33 \times$ acceleration for latent diffusion model. We also conducted a downstream segmentation experiment to show that the synthetic images can be used for data augmentation. We release our code, training details, model weights, and a GUI demo to facilitate reproducibility and promote further development within the community.

[99] Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks

Mengyu Li, Guoyao Shen, Chad W. Farris, Xin Zhang

Main category: cs.CV

TL;DR: A framework for few-shot deployment of pretrained MRI transformers using MAE pretraining, achieving state-of-the-art results in brain imaging tasks with minimal supervision.

DetailsMotivation: Addressing the scarcity of annotated data in medical imaging by leveraging pretrained transformers for improved real-world applicability.

Method: Utilizes MAE pretraining on a large-scale brain MRI dataset, combining frozen MAE encoders with lightweight heads for classification and hybrid architectures (MAE-FUnet) for segmentation.

Result: Achieves state-of-the-art accuracy in MRI sequence identification and outperforms baselines in segmentation tasks under data-limited conditions.

Conclusion: The framework is efficient, stable, and scalable, making it suitable for low-resource clinical and neuroimaging applications.

Abstract: Machine learning using transformers has shown great potential in medical imaging, but its real-world applicability remains limited due to the scarcity of annotated data. In this study, we propose a practical framework for the few-shot deployment of pretrained MRI transformers in diverse brain imaging tasks. By utilizing the Masked Autoencoder (MAE) pretraining strategy on a large-scale, multi-cohort brain MRI dataset comprising over 31 million slices, we obtain highly transferable latent representations that generalize well across tasks and datasets. For high-level tasks such as classification, a frozen MAE encoder combined with a lightweight linear head achieves state-of-the-art accuracy in MRI sequence identification with minimal supervision. For low-level tasks such as segmentation, we propose MAE-FUnet, a hybrid architecture that fuses multiscale CNN features with pretrained MAE embeddings. This model consistently outperforms other strong baselines in both skull stripping and multi-class anatomical segmentation under data-limited conditions. With extensive quantitative and qualitative evaluations, our framework demonstrates efficiency, stability, and scalability, suggesting its suitability for low-resource clinical environments and broader neuroimaging applications.

[100] Optimization-Free Style Transfer for 3D Gaussian Splats

Raphael Du Sablon, David Hart

Main category: cs.CV

TL;DR: A novel, optimization-free method for stylizing 3D Gaussian splats using a graph structure and surface-based stylization, achieving fast results without training.

DetailsMotivation: Existing 3D Gaussian splat style transfer methods require reconstruction, fine-tuning, or optimization, which are time-consuming and restrictive.

Method: Generates a graph structure on the splat’s implicit surface, applies a feed-forward stylization method, and interpolates back to the splats.

Result: Fast stylization (under 2 minutes on consumer hardware) with high quality, compatible with any style image and splat.

Conclusion: The proposed method is efficient, flexible, and avoids the need for training or optimization, outperforming existing approaches.

Abstract: The task of style transfer for 3D Gaussian splats has been explored in many previous works, but these require reconstructing or fine-tuning the splat while incorporating style information or optimizing a feature extraction network on the splat representation. We propose a reconstruction- and optimization-free approach to stylizing 3D Gaussian splats. This is done by generating a graph structure across the implicit surface of the splat representation. A feed-forward, surface-based stylization method is then used and interpolated back to the individual splats in the scene. This allows for any style image and 3D Gaussian splat to be used without any additional training or optimization. This also allows for fast stylization of splats, achieving speeds under 2 minutes even on consumer-grade hardware. We demonstrate the quality results this approach achieves and compare to other 3D Gaussian splat style transfer methods. Code is publicly available at https://github.com/davidmhart/FastSplatStyler.

[101] A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

Main category: cs.CV

TL;DR: The paper explores integrating classification objectives into super-resolution (SR) processes to enhance both image quality and downstream classification accuracy, proposing a novel method for synthetic aperture radar imagery.

DetailsMotivation: Low-resolution images limit automated analysis accuracy, and traditional SR methods focus on pixel-level metrics without considering downstream classification performance.

Method: A novel SR methodology optimizes loss functions for both image quality and classification performance, applied to synthetic aperture radar imagery.

Result: The approach improves image quality and enhances classification accuracy.

Conclusion: Integrating classification objectives into SR processes can simultaneously improve image fidelity and classification performance.

Abstract: High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

[102] MZEN: Multi-Zoom Enhanced NeRF for 3-D Reconstruction with Unknown Camera Poses

Jong-Ik Park, Carlee Joe-Wong, Gary K. Fedder

Main category: cs.CV

TL;DR: MZEN enhances NeRF for industrial inspection by handling multi-zoom images, improving detail capture without losing global accuracy.

DetailsMotivation: NeRF lacks fine-detail capture for industrial tasks like defect detection. Multi-zoom images disrupt NeRF's multi-view consistency.

Method: MZEN introduces a learnable zoom scalar and a pose strategy: wide-field images establish a global frame, while zoom-in images are pose-primed and refined.

Result: MZEN outperforms baselines, boosting PSNR by 28%, SSIM by 10%, and reducing LPIPS by 222%.

Conclusion: MZEN extends NeRF to industrial settings, capturing micron-level details while maintaining global accuracy.

Abstract: Neural Radiance Fields (NeRF) methods excel at 3D reconstruction from multiple 2D images, even those taken with unknown camera poses. However, they still miss the fine-detailed structures that matter in industrial inspection, e.g., detecting sub-micron defects on a production line or analyzing chips with Scanning Electron Microscopy (SEM). In these scenarios, the sensor resolution is fixed and compute budgets are tight, so the only way to expose fine structure is to add zoom-in images; yet, this breaks the multi-view consistency that pose-free NeRF training relies on. We propose Multi-Zoom Enhanced NeRF (MZEN), the first NeRF framework that natively handles multi-zoom image sets. MZEN (i) augments the pin-hole camera model with an explicit, learnable zoom scalar that scales the focal length, and (ii) introduces a novel pose strategy: wide-field images are solved first to establish a global metric frame, and zoom-in images are then pose-primed to the nearest wide-field counterpart via a zoom-consistent crop-and-match procedure before joint refinement. Across eight forward-facing scenes$\unicode{x2013}$synthetic TCAD models, real SEM of micro-structures, and BLEFF objects$\unicode{x2013}$MZEN consistently outperforms pose-free baselines and even high-resolution variants, boosting PSNR by up to $28 %$, SSIM by $10 %$, and reducing LPIPS by up to $222 %$. MZEN, therefore, extends NeRF to real-world factory settings, preserving global accuracy while capturing the micron-level details essential for industrial inspection.

[103] TSMS-SAM2: Multi-scale Temporal Sampling Augmentation and Memory-Splitting Pruning for Promptable Video Object Segmentation and Tracking in Surgical Scenarios

Guoping Xu, Hua-Chieh Shao, You Zhang

Main category: cs.CV

TL;DR: TSMS-SAM2 enhances promptable video object segmentation and tracking in surgical videos by addressing motion dynamics and memory redundancy in SAM2, achieving top performance on EndoVis datasets.

DetailsMotivation: Existing foundation models like SAM2 struggle with surgical video analysis due to complex motion and memory redundancy, limiting their effectiveness.

Method: TSMS-SAM2 introduces multi-temporal-scale video sampling augmentation for motion robustness and a memory splitting/pruning mechanism for efficient feature handling.

Result: Achieved highest mean Dice scores of 95.24 and 86.73 on EndoVis2017 and EndoVis2018, outperforming prior methods.

Conclusion: TSMS-SAM2 is effective for robust and efficient segmentation in surgical scenarios, with potential for broader applications.

Abstract: Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework’s potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.

[104] Temporal Cluster Assignment for Efficient Real-Time Video Segmentation

Ka-Wai Yung, Felix J. S. Bragman, Jialang Xu, Imanol Luengo, Danail Stoyanov, Evangelos B. Mazomenos

Main category: cs.CV

TL;DR: TCA (Temporal Cluster Assignment) improves video segmentation by leveraging temporal coherence to refine token clusters, reducing computation while retaining detail.

DetailsMotivation: Swin Transformer's high computational cost in video segmentation limits real-time applications, and existing token reduction methods fail to exploit temporal redundancy.

Method: TCA refines token clusters using temporal correlations across frames, avoiding indiscriminate token dropping.

Result: TCA enhances accuracy-speed trade-off on multiple datasets (YouTube-VIS 2019/2021, OVIS, surgical videos).

Conclusion: TCA effectively generalizes across natural and domain-specific videos, optimizing performance without fine-tuning.

Abstract: Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine-grained details while significantly reducing computation. Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy-speed trade-off of existing clustering-based methods. Our results demonstrate that TCA generalizes competently across both natural and domain-specific videos.

[105] VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

Kaiser Hamid, Khandakar Ashrafi Akbar, Nade Liang

Main category: cs.CV

TL;DR: A vision-language framework predicts driver gaze shifts using natural language, outperforming general-purpose models in attention detection and interpretability.

DetailsMotivation: To address the limitation of prior studies focusing on static attention estimation by modeling dynamic gaze behavior through language.

Method: Uses few-shot and zero-shot learning on RGB images, fine-tunes LLaVA with human-curated captions from BDD-A, and integrates low-level cues and top-down context.

Result: Fine-tuned model excels in attention shift detection and interpretability, surpassing general-purpose VLMs.

Conclusion: Pioneers language-based gaze prediction, advancing explainable AI for autonomous driving and enabling downstream applications.

Abstract: Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers’ gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.

[106] Multi-view Gaze Target Estimation

Qiaomu Miao, Vivek Raju Golani, Jingyi Xu, Progga Paromita Dutta, Minh Hoai, Dimitris Samaras

Main category: cs.CV

TL;DR: A multi-view camera method for gaze target estimation (GTE) improves accuracy by addressing single-view limitations like occlusion and ambiguity. It uses head information aggregation, uncertainty-based gaze selection, and epipolar-based scene attention.

DetailsMotivation: Existing single-view GTE methods struggle with face occlusion, target ambiguity, and out-of-view targets, limiting accuracy and applicability.

Method: The approach integrates two camera views, using a Head Information Aggregation module, Uncertainty-based Gaze Selection, and Epipolar-based Scene Attention for cross-view information sharing.

Result: The method outperforms single-view baselines, especially when the second camera provides a clear face view, and can estimate gaze targets using only the second view.

Conclusion: The proposed multi-view GTE method enhances accuracy and expands capabilities, supported by a new multi-view dataset for future research.

Abstract: This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-of-view targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertainty-based Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person’s face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. Furthermore, the paper introduces a multi-view dataset for developing and evaluating multi-view GTE methods. Data and code are available at https://www3.cs.stonybrook.edu/~cvl/multiview_gte.html

[107] ETTA: Efficient Test-Time Adaptation for Vision-Language Models through Dynamic Embedding Updates

Hamidreza Dastmalchi, Aijun An, Ali cheraghian

Main category: cs.CV

TL;DR: ETTA improves test-time adaptation for VLMs by dynamically integrating all test samples and adaptively combining prompt-based and cache-based methods, achieving better accuracy and efficiency.

DetailsMotivation: Current cache-based TTA methods for VLMs are limited by storing only high-confidence samples, ignoring broader test data influence, which restricts generalization.

Method: ETTA introduces a Recursive Updating module for dynamic integration of all test samples and an Adaptive Ensemble module to reduce prompt dependency, combining both adaptively.

Result: ETTA outperforms state-of-the-art TTA models in accuracy and computational efficiency on benchmarks.

Conclusion: ETTA sets a new standard for efficient and effective test-time adaptation, with released code for reproducibility.

Abstract: Pretrained vision-language models (VLMs) like CLIP show strong zero-shot performance but struggle with generalization under distribution shifts. Test-Time Adaptation (TTA) addresses this by adapting VLMs to unlabeled test data in new domains. While some TTA methods rely on prompt-tuning, training-free cache-based approaches are preferred for efficiency. However, current cache-based TTA models store only a limited set of high-confidence samples, restricting the decision boundary to these samples and ignoring the influence of other incoming test data. To address this, we propose Efficient Test-Time Adaptation (ETTA), introducing a Recursive Updating module that integrates all incoming test samples, progressively refining the decision boundary. This strategy mimics an unbounded cache, dynamically updating contextual embeddings for improved accuracy with minimal memory and computational overhead. ETTA also includes an Adaptive Ensemble module to reduce prompt dependency in image-to-text scores by dynamically selecting optimal prompts for each class. Furthermore, ETTA adaptively combines scores from both modules based on confidence levels, leveraging their complementary strengths. Extensive experiments on two benchmarks confirm that ETTA surpasses the state-of-the-art TTA models in computational complexity and accuracy, setting a new standard for effective, efficient test-time adaptation. The code has been released at https://github.com/hamidreza-dastmalchi/ETTA.

[108] HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing

Zixuan Bian, Ruohan Ren, Yue Yang, Chris Callison-Burch

Main category: cs.CV

TL;DR: HOLODECK 2.0 is a vision-language-guided framework for generating and editing 3D scenes from text, supporting diverse styles and interactive feedback.

DetailsMotivation: Current 3D scene generation relies heavily on manual effort and lacks flexibility for open-domain scenes and editing.

Method: HOLODECK 2.0 uses vision-language models to parse objects, generate assets, and apply spatial constraints for coherent layouts.

Result: It produces high-quality, semantically aligned scenes, outperforming baselines in evaluations.

Conclusion: The framework enhances efficiency in applications like game modeling, offering flexible editing and immersive environments.

Abstract: 3D scene generation plays a crucial role in gaming, artistic creation, virtual reality and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. As a result, generating 3D worlds directly from text has garnered increasing attention. In this paper, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. It then iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Human evaluations and CLIP-based assessments demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, we provide editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling, generating visually rich and immersive environments, potentially boosting efficiency.

[109] Robust Image Stitching with Optimal Plane

Lang Nie, Yuan Mei, Kang Liao, Yunqiu Xu, Chunyu Lin, Bin Xiao

Main category: cs.CV

TL;DR: RopStitch is an unsupervised deep image stitching framework that ensures robustness and naturalness through a dual-branch architecture and virtual optimal planes.

DetailsMotivation: To address the challenges of robustness and naturalness in image stitching, especially in diverse real-world scenes.

Method: Uses a dual-branch model (pretrained and learnable branches) to capture coarse and fine features, and introduces virtual optimal planes for content alignment and structural preservation.

Result: Significantly outperforms existing methods in scene robustness and content naturalness.

Conclusion: RopStitch provides a robust and natural solution for image stitching, validated by extensive experiments.

Abstract: We present \textit{RopStitch}, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textit{RopStitch}, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textit{RopStitch} by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textit{RopStitch} significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at {\color{red}https://github.com/MmelodYy/RopStitch}.

[110] Neural Field Representations of Mobile Computational Photography

Ilya Chugunov

Main category: cs.CV

TL;DR: Neural field models enable compact representation of complex geometry and lighting effects in mobile imaging, outperforming state-of-the-art methods without complex pre-processing or labeled data.

DetailsMotivation: To leverage the versatility of smartphones as computational imaging platforms and address challenges in scene reconstruction and image processing.

Method: Uses carefully designed neural field models trained with stochastic gradient descent to fit raw smartphone measurements.

Result: Outperforms existing approaches in tasks like depth estimation, layer separation, and image stitching.

Conclusion: Neural fields offer a powerful, self-regularized solution for mobile computational imaging without relying on traditional data representations or priors.

Abstract: Over the past two decades, mobile imaging has experienced a profound transformation, with cell phones rapidly eclipsing all other forms of digital photography in popularity. Today’s cell phones are equipped with a diverse range of imaging technologies - laser depth ranging, multi-focal camera arrays, and split-pixel sensors - alongside non-visual sensors such as gyroscopes, accelerometers, and magnetometers. This, combined with on-board integrated chips for image and signal processing, makes the cell phone a versatile pocket-sized computational imaging platform. Parallel to this, we have seen in recent years how neural fields - small neural networks trained to map continuous spatial input coordinates to output signals - enable the reconstruction of complex scenes without explicit data representations such as pixel arrays or point clouds. In this thesis, I demonstrate how carefully designed neural field models can compactly represent complex geometry and lighting effects. Enabling applications such as depth estimation, layer separation, and image stitching directly from collected in-the-wild mobile photography data. These methods outperform state-of-the-art approaches without relying on complex pre-processing steps, labeled ground truth data, or machine learning priors. Instead, they leverage well-constructed, self-regularized models that tackle challenging inverse problems through stochastic gradient descent, fitting directly to raw measurements from a smartphone.

[111] Enhancing Construction Site Analysis and Understanding with 3D Segmentation

Sri Ramana Saketh Vasanthawada, Pengkun Liu, Pingbo Tang

Main category: cs.CV

TL;DR: The paper evaluates SAM and Mask3D for 3D segmentation in construction monitoring, highlighting their adaptability and the lack of outdoor benchmarks.

DetailsMotivation: Traditional methods struggle with construction sites' complexity, prompting the need for efficient computer-vision-based solutions.

Method: Comparative analysis of SAM and Mask3D in real-world construction settings, initially trained on indoor datasets.

Result: Identifies gaps in segmentation approaches and showcases the models’ effectiveness in challenging conditions.

Conclusion: Emphasizes the need for tailored workflows to improve automated and precise construction monitoring.

Abstract: Monitoring construction progress is crucial yet resource-intensive, prompting the exploration of computer-vision-based methodologies for enhanced efficiency and scalability. Traditional data acquisition methods, primarily focusing on indoor environments, falter in construction site’s complex, cluttered, and dynamically changing conditions. This paper critically evaluates the application of two advanced 3D segmentation methods, Segment Anything Model (SAM) and Mask3D, in challenging outdoor and indoor conditions. Trained initially on indoor datasets, both models’ adaptability and performance are assessed in real-world construction settings, highlighting the gap in current segmentation approaches due to the absence of benchmarks for outdoor scenarios. Through a comparative analysis, this study not only showcases the relative effectiveness of SAM and Mask3D but also addresses the critical need for tailored segmentation workflows capable of extracting actionable insights from construction site data, thereby advancing the field towards more automated and precise monitoring techniques.

[112] A 3DGS-Diffusion Self-Supervised Framework for Normal Estimation from a Single Image

Yanxing Liang, Yinghui Wang, Jinlong Yang, Wei Li

Main category: cs.CV

TL;DR: SINGAD is a self-supervised framework for normal estimation from a single image, using 3D Gaussian splatting and diffusion to address multi-view inconsistency and data dependency.

DetailsMotivation: Existing methods rely on data-driven priors and lack explicit light-surface interaction modeling, causing multi-view conflicts and dependency on dense annotations.

Method: Integrates physics-driven light-interaction modeling and differentiable rendering to convert 3D geometric errors into optimization signals. Uses a conditional diffusion model with cross-domain feature fusion.

Result: Outperforms state-of-the-art methods on the Google Scanned Objects dataset.

Conclusion: SINGAD solves multi-view inconsistency and reduces reliance on annotated data, advancing single-image normal estimation.

Abstract: The lack of spatial dimensional information remains a challenge in normal estimation from a single image. Recent diffusion-based methods have demonstrated significant potential in 2D-to-3D implicit mapping, they rely on data-driven statistical priors and miss the explicit modeling of light-surface interaction, leading to multi-view normal direction conflicts. Moreover, the discrete sampling mechanism of diffusion models causes gradient discontinuity in differentiable rendering reconstruction modules, preventing 3D geometric errors from being backpropagated to the normal generation network, thereby forcing existing methods to depend on dense normal annotations. This paper proposes SINGAD, a novel Self-supervised framework from a single Image for Normal estimation via 3D GAussian splatting guided Diffusion. By integrating physics-driven light-interaction modeling and a differentiable rendering-based reprojection strategy, our framework directly converts 3D geometric errors into normal optimization signals, solving the challenges of multi-view geometric inconsistency and data dependency. Specifically, the framework constructs a light-interaction-driven 3DGS reparameterization model to generate multi-scale geometric features consistent with light transport principles, ensuring multi-view normal consistency. A cross-domain feature fusion module is designed within a conditional diffusion model, embedding geometric priors to constrain normal generation while maintaining accurate geometric error propagation. Furthermore, a differentiable 3D reprojection loss strategy is introduced for self-supervised optimization that minimizes geometric error between the reconstructed and input image, eliminating dependence on annotated normal datasets. Quantitative evaluations on the Google Scanned Objects dataset demonstrate that our method outperforms state-of-the-art approaches across multiple metrics.

[113] Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

Han Lin, Jaemin Cho, Amir Zadeh, Chuan Li, Mohit Bansal

Main category: cs.CV

TL;DR: Bifrost-1 integrates pretrained MLLMs and diffusion models using patch-level CLIP embeddings for efficient, high-fidelity image generation without compromising reasoning.

DetailsMotivation: To enable high-fidelity visual synthesis in LLMs without costly training or loss of reasoning capabilities.

Method: Uses patch-level CLIP embeddings as latents, adapts ControlNet, and adds a visual generation branch to MLLMs.

Result: Achieves comparable/better performance in fidelity and understanding with lower compute.

Conclusion: Bifrost-1 is an efficient, effective framework for integrating MLLMs and diffusion models.

Abstract: There is growing interest in integrating high-fidelity visual synthesis capabilities into large language models (LLMs) without compromising their strong reasoning capabilities. Existing methods that directly train LLMs or bridge LLMs and diffusion models usually suffer from costly training since the backbone LLMs have not seen image representations during pretraining. We present Bifrost-1, a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models using patch-level CLIP image embeddings as latent variables, which are natively aligned with the MLLM’s CLIP visual encoder. These patch-level image embeddings are integrated into the diffusion model with a lightweight adaptation of its ControlNet. To retain the original multimodal reasoning capabilities of MLLMs, we equip the MLLM with a visual generation branch initialized from the original MLLM parameters when predicting the patch-level image embeddings. By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation with significant training efficiency. Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding, with substantially lower compute during training. We also provide comprehensive ablation studies showing the effectiveness of our design choices.

[114] PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation

Zhihao Zhu, Yifan Zheng, Siyu Pan, Yaohui Jin, Yao Mu

Main category: cs.CV

TL;DR: PASG is a framework for robotic manipulation that bridges semantic and geometric features by automating primitive extraction and coupling them with affordances using VLMs.

DetailsMotivation: Addressing the gap between high-level task semantics and low-level geometric features in robotic manipulation, which limits dynamic semantic-affordance relationships.

Method: Introduces PASG with automatic primitive extraction, VLM-driven semantic anchoring, and a spatial-semantic reasoning benchmark.

Result: Achieves performance comparable to manual annotations in diverse robotic tasks, enabling finer-grained semantic-affordance understanding.

Conclusion: PASG successfully unifies geometric primitives with task semantics, improving robotic manipulation.

Abstract: The fragmentation between high-level task semantics and low-level geometric features remains a persistent challenge in robotic manipulation. While vision-language models (VLMs) have shown promise in generating affordance-aware visual representations, the lack of semantic grounding in canonical spaces and reliance on manual annotations severely limit their ability to capture dynamic semantic-affordance relationships. To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). We demonstrate PASG’s effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations. PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.

[115] AnimateScene: Camera-controllable Animation in Any Scene

Qingyang Liu, Bingjie Gao, Weiheng Huang, Jun Zhang, Zhongqian Sun, Yang Wei, Zelin Peng, Qianli Ma, Shuai Yang, Zhaohe Liao, Haonan Zhao, Li Niu

Main category: cs.CV

TL;DR: AnimateScene integrates 4D human animation with 3D scenes, addressing placement, style alignment, and camera trajectory challenges.

DetailsMotivation: Seamlessly integrating 4D human animation into 3D scenes is challenging due to placement, lighting/style mismatches, and camera movement needs.

Method: AnimateScene uses an accurate placement module, training-free style alignment, and joint post-reconstruction for camera trajectories.

Result: The framework produces high-detail, coherent videos with dynamic scenes and human animations under various conditions.

Conclusion: AnimateScene effectively unifies human animation and scene reconstruction, overcoming key integration challenges.

Abstract: 3D scene reconstruction and 4D human animation have seen rapid progress and broad adoption in recent years. However, seamlessly integrating reconstructed scenes with 4D human animation to produce visually engaging results remains challenging. One key difficulty lies in placing the human at the correct location and scale within the scene while avoiding unrealistic interpenetration. Another challenge is that the human and the background may exhibit different lighting and style, leading to unrealistic composites. In addition, appealing character motion videos are often accompanied by camera movements, which means that the viewpoints need to be reconstructed along a specified trajectory. We present AnimateScene, which addresses the above issues in a unified framework. First, we design an accurate placement module that automatically determines a plausible 3D position for the human and prevents any interpenetration within the scene during motion. Second, we propose a training-free style alignment method that adapts the 4D human representation to match the background’s lighting and style, achieving coherent visual integration. Finally, we design a joint post-reconstruction method for both the 4D human and the 3D scene that allows camera trajectories to be inserted, enabling the final rendered video to feature visually appealing camera movements. Extensive experiments show that AnimateScene generates dynamic scene videos with high geometric detail and spatiotemporal coherence across various camera and action combinations.

[116] ETA: Energy-based Test-time Adaptation for Depth Completion

Younjoon Chung, Hyoungseob Park, Patrick Rim, Xiaoran Zhang, Jihe He, Ziyao Zeng, Safa Cicek, Byung-Woo Hong, James S. Duncan, Alex Wong

Main category: cs.CV

TL;DR: The paper introduces Energy-based Test-time Adaptation (ETA), a method to adapt pretrained depth completion models to novel environments by minimizing prediction energy, improving accuracy over prior methods.

DetailsMotivation: Addresses the issue of erroneous predictions when depth completion models are applied to new environments due to covariate shifts, without prior access to target data.

Method: Uses adversarial perturbations to explore data space and trains an energy model to score predictions as in- or out-of-distribution, updating model parameters at test time to minimize energy.

Result: ETA outperforms the previous state-of-the-art by 6.94% outdoors and 10.23% indoors across six datasets.

Conclusion: ETA effectively aligns test-time predictions to the source distribution, demonstrating significant improvements in depth completion accuracy for novel environments.

Abstract: We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some source'' data, often predict erroneous outputs when transferred to target’’ data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation’’, or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.

[117] Fast Motion Estimation and Context-Aware Refinement for Efficient Bayer-Domain Video Vision

Haichao Wang, Xinyue Xi, Jiangtao Wen, Yuxing Han

Main category: cs.CV

TL;DR: The paper proposes an efficient video computer vision system by eliminating the image signal processor and using Bayer-format data directly, along with a fast block matching-based motion estimation algorithm and context-aware refinement, achieving significant speedup with minimal performance loss.

DetailsMotivation: Existing methods fail to fully reduce temporal redundancy and overlook front-end computation overhead in video computer vision systems.

Method: The system removes the image signal processor, uses Bayer-format data directly, employs a fast block matching-based motion estimation algorithm with MV refinement, and introduces a context-aware block refinement network. A frame selection strategy balances accuracy and efficiency.

Result: Experiments show the method achieves significant acceleration with only slight performance degradation across multiple video computer vision tasks.

Conclusion: The proposed system effectively reduces computation overhead and temporal redundancy, offering a practical solution for efficient video computer vision.

Abstract: The efficiency of video computer vision system remains a challenging task due to the high temporal redundancy inside a video. Existing works have been proposed for efficient vision computer vision. However, they do not fully reduce the temporal redundancy and neglect the front end computation overhead. In this paper, we propose an efficient video computer vision system. First, image signal processor is removed and Bayer-format data is directly fed into video computer vision models, thus saving the front end computation. Second, instead of optical flow models and video codecs, a fast block matching-based motion estimation algorithm is proposed specifically for efficient video computer vision, with a MV refinement module. To correct the error, context-aware block refinement network is introduced to refine regions with large error. To further balance the accuracy and efficiency, a frame selection strategy is employed. Experiments on multiple video computer vision tasks demonstrate that our method achieves significant acceleration with slight performance loss.

[118] ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge

Juewen Hu, Yexin Li, Jiulin Li, Shuo Chen, Pring Wong

Main category: cs.CV

TL;DR: A novel multimodal emotion recognition framework is proposed, leveraging pre-trained models and innovative fusion strategies to improve performance on the MER2025-SEMI dataset.

DetailsMotivation: Enhancing human-computer interaction by addressing data scarcity and improving emotion recognition accuracy.

Method: Uses dual-branch visual encoders, context-enriched textual methods, and a fusion strategy with self-attention and residual connections. Noisy labels are refined via multi-source labeling.

Result: Achieves a weighted F-score of 87.49%, outperforming the baseline of 78.63%.

Conclusion: The framework is effective for multimodal emotion recognition, validated by significant performance gains.

Abstract: Emotion recognition plays a vital role in enhancing human-computer interaction. In this study, we tackle the MER-SEMI challenge of the MER2025 competition by proposing a novel multimodal emotion recognition framework. To address the issue of data scarcity, we leverage large-scale pre-trained models to extract informative features from visual, audio, and textual modalities. Specifically, for the visual modality, we design a dual-branch visual encoder that captures both global frame-level features and localized facial representations. For the textual modality, we introduce a context-enriched method that employs large language models to enrich emotional cues within the input text. To effectively integrate these multimodal features, we propose a fusion strategy comprising two key components, i.e., self-attention mechanisms for dynamic modality weighting, and residual connections to preserve original representations. Beyond architectural design, we further refine noisy labels in the training set by a multi-source labeling strategy. Our approach achieves a substantial performance improvement over the official baseline on the MER2025-SEMI dataset, attaining a weighted F-score of 87.49% compared to 78.63%, thereby validating the effectiveness of the proposed framework.

[119] EvoMakeup: High-Fidelity and Controllable Makeup Editing with MakeupQuad

Huadong Wu, Yi Fu, Yunhao Li, Yuan Gao, Kang Du

Main category: cs.CV

TL;DR: The paper introduces MakeupQuad, a dataset for facial makeup editing, and EvoMakeup, a framework that improves makeup transfer quality by iterative training, outperforming existing methods.

DetailsMotivation: Existing methods for facial makeup editing produce low-quality results due to lack of structured paired data and struggle with identity and makeup fidelity.

Method: Proposes MakeupQuad dataset and EvoMakeup framework, using multi-stage distillation for iterative improvement of data and model quality.

Result: EvoMakeup outperforms prior methods on real-world benchmarks, achieving high-fidelity, controllable makeup editing with superior identity preservation.

Conclusion: The method effectively balances makeup fidelity and identity preservation, supporting diverse editing tasks within a single model.

Abstract: Facial makeup editing aims to realistically transfer makeup from a reference to a target face. Existing methods often produce low-quality results with coarse makeup details and struggle to preserve both identity and makeup fidelity, mainly due to the lack of structured paired data – where source and result share identity, and reference and result share identical makeup. To address this, we introduce MakeupQuad, a large-scale, high-quality dataset with non-makeup faces, references, edited results, and textual makeup descriptions. Building on this, we propose EvoMakeup, a unified training framework that mitigates image degradation during multi-stage distillation, enabling iterative improvement of both data and model quality. Although trained solely on synthetic data, EvoMakeup generalizes well and outperforms prior methods on real-world benchmarks. It supports high-fidelity, controllable, multi-task makeup editing – including full-face and partial reference-based editing, as well as text-driven makeup editing – within a single model. Experimental results demonstrate that our method achieves superior makeup fidelity and identity preservation, effectively balancing both aspects. Code and dataset will be released upon acceptance.

[120] MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

Jun Feng, Zixin Wang, Zhentao Zhang, Yue Guo, Zhihan Zhou, Xiuyi Chen, Zhenyang Li, Dawei Yin

Main category: cs.CV

TL;DR: MathReal introduces a dataset of 2,000 real-world K-12 math questions with images to evaluate MLLMs, revealing their limitations in authentic educational scenarios.

DetailsMotivation: Existing benchmarks for MLLMs use clean inputs, lacking real-world educational data. MathReal fills this gap with realistic images.

Method: Curated 2,000 questions with mobile-captured images, classified into 3 main and 14 subcategories, spanning 5 knowledge areas and 3 difficulty levels.

Result: MLLMs struggle with real-world educational contexts, showing significant challenges in problem-solving.

Conclusion: The study highlights MLLMs’ limitations in real-world math reasoning and suggests future improvements.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.

[121] ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors

Minsu Kim, Subin Jeon, In Cho, Mijin Yoo, Seon Joo Kim

Main category: cs.CV

TL;DR: A 3DGS-based pipeline improves novel view synthesis by generating additional training views and refining results with video diffusion priors, outperforming existing methods on challenging scenes.

DetailsMotivation: Existing 3DGS methods struggle with artifacts and missing regions when rendering from viewpoints outside the training trajectory, limiting seamless scene exploration.

Method: Proposes a pipeline with an information-gain-driven virtual camera placement strategy and video diffusion priors to refine rendered results, followed by fine-tuning 3D Gaussians.

Result: Outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints.

Conclusion: The proposed method enhances reconstruction quality and scene exploration, validated by the Wild-Explore benchmark.

Abstract: Recent advances in novel view synthesis (NVS) have enabled real-time rendering with 3D Gaussian Splatting (3DGS). However, existing methods struggle with artifacts and missing regions when rendering from viewpoints that deviate from the training trajectory, limiting seamless scene exploration. To address this, we propose a 3DGS-based pipeline that generates additional training views to enhance reconstruction. We introduce an information-gain-driven virtual camera placement strategy to maximize scene coverage, followed by video diffusion priors to refine rendered results. Fine-tuning 3D Gaussians with these enhanced views significantly improves reconstruction quality. To evaluate our method, we present Wild-Explore, a benchmark designed for challenging scene exploration. Experiments demonstrate that our approach outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints. https://exploregs.github.io

[122] Improved Sub-Visible Particle Classification in Flow Imaging Microscopy via Generative AI-Based Image Synthesis

Utku Ozbulak, Michaela Cohrs, Hristo L. Svilenov, Joris Vankerschaver, Wesley De Neve

Main category: cs.CV

TL;DR: A diffusion model is developed to generate high-fidelity images for addressing data imbalance in sub-visible particle analysis, improving multi-class deep neural network training.

DetailsMotivation: The scarcity and imbalance of particle type data hinder effective multi-class classification, especially for rare particle types like silicone oil and air bubbles.

Method: A state-of-the-art diffusion model generates synthetic images to augment training datasets, validated by visual and structural similarity to real images.

Result: Experiments on 500,000 protein particle images show improved classification performance with negligible downsides.

Conclusion: The approach enhances classification, and the models are publicly released for open research and reproducibility.

Abstract: Sub-visible particle analysis using flow imaging microscopy combined with deep learning has proven effective in identifying particle types, enabling the distinction of harmless components such as silicone oil from protein particles. However, the scarcity of available data and severe imbalance between particle types within datasets remain substantial hurdles when applying multi-class classifiers to such problems, often forcing researchers to rely on less effective methods. The aforementioned issue is particularly challenging for particle types that appear unintentionally and in lower numbers, such as silicone oil and air bubbles, as opposed to protein particles, where obtaining large numbers of images through controlled settings is comparatively straightforward. In this work, we develop a state-of-the-art diffusion model to address data imbalance by generating high-fidelity images that can augment training datasets, enabling the effective training of multi-class deep neural networks. We validate this approach by demonstrating that the generated samples closely resemble real particle images in terms of visual quality and structure. To assess the effectiveness of using diffusion-generated images in training datasets, we conduct large-scale experiments on a validation dataset comprising 500,000 protein particle images and demonstrate that this approach improves classification performance with no negligible downside. Finally, to promote open research and reproducibility, we publicly release both our diffusion models and the trained multi-class deep neural network classifiers, along with a straightforward interface for easy integration into future studies, at https://github.com/utkuozbulak/svp-generative-ai.

[123] Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts

Kiran Chhatre, Christopher Peters, Srikrishna Karanam

Main category: cs.CV

TL;DR: Spectrum is a unified network for detailed human parsing, leveraging a fine-tuned Image-to-Texture diffusion model to improve alignment with body parts and clothing, outperforming baselines in segmentation tasks.

DetailsMotivation: Existing methods lack fine-grained clothing and body part distinctions, and open-vocabulary segmentation often groups humans into a single category, missing detailed parsing.

Method: Spectrum repurposes an Image-to-Texture diffusion model, fine-tuned on 3D human texture maps, to extract features and generate semantically valid masks via prompt-guided grounding.

Result: Spectrum consistently outperforms baselines in cross-dataset experiments for body parts, clothing, unseen categories, and full-body masks.

Conclusion: Spectrum effectively addresses the limitations of existing methods by providing detailed, prompt-based semantic segmentation for human parsing.

Abstract: Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model – obtained by fine-tuning a T2I model on 3D human texture maps – for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments – separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks – and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.

[124] InstantEdit: Text-Guided Few-Step Image Editing with Piecewise Rectified Flow

Yiming Gong, Zhen Zhu, Minjia Zhang

Main category: cs.CV

TL;DR: InstantEdit is a fast text-guided image editing method using RectifiedFlow, featuring PerRFI inversion, Inversion Latent Injection, Disentangled Prompt Guidance, and ControlNet for better results.

DetailsMotivation: To enable fast and precise text-guided image editing while preserving critical content and following textual instructions closely.

Method: Leverages RectifiedFlow with PerRFI inversion, Inversion Latent Injection, Disentangled Prompt Guidance, and Canny-conditioned ControlNet.

Result: Achieves better qualitative and quantitative results on the PIE dataset compared to state-of-the-art methods.

Conclusion: InstantEdit is efficient and outperforms existing few-step editing techniques.

Abstract: We propose a fast text-guided image editing method called InstantEdit based on the RectifiedFlow framework, which is structured as a few-step editing process that preserves critical content while following closely to textual instructions. Our approach leverages the straight sampling trajectories of RectifiedFlow by introducing a specialized inversion strategy called PerRFI. To maintain consistent while editable results for RectifiedFlow model, we further propose a novel regeneration method, Inversion Latent Injection, which effectively reuses latent information obtained during inversion to facilitate more coherent and detailed regeneration. Additionally, we propose a Disentangled Prompt Guidance technique to balance editability with detail preservation, and integrate a Canny-conditioned ControlNet to incorporate structural cues and suppress artifacts. Evaluation on the PIE image editing dataset demonstrates that InstantEdit is not only fast but also achieves better qualitative and quantitative results compared to state-of-the-art few-step editing methods.

[125] More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference Alignment

Jun Xie, Yingjian Zhu, Feng Chen, Zhenghao Zhang, Xiaohui Fan, Hongzhu Yi, Xinming Wang, Chen Yu, Yue Bi, Zhaoran Zhao, Xiongjun Guan, Zhepeng Wang

Main category: cs.CV

TL;DR: A semi-supervised learning framework for emotion recognition, combining diverse input modalities and pseudo-labeling, achieving 2nd place in MER2025-SEMI with an F1-score of 0.8772.

DetailsMotivation: To address the challenge of semi-supervised learning in emotion recognition by leveraging diverse modalities and unlabeled data effectively.

Method: Integrates multiple input modalities as experts, uses consensus-based pseudo-labeling, and employs a two-stage training paradigm with a multi-expert voting ensemble.

Result: Achieved an F1-score of 0.8772 on the MER2025-SEMI test set, ranking 2nd.

Conclusion: The proposed framework effectively combines diverse modalities and pseudo-labeling for robust semi-supervised emotion recognition.

Abstract: In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that “more is better,” to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.

[126] Fourier-VLM: Compressing Vision Tokens in the Frequency Domain for Large Vision-Language Models

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

Main category: cs.CV

TL;DR: Fourier-VLM compresses visual representations in the frequency domain to reduce computational overhead and latency in Vision-Language Models (VLMs) without compromising performance.

DetailsMotivation: The large number of vision tokens in VLMs increases context length, causing high computational overhead and latency. Existing methods either compromise performance or add extra costs.

Method: Proposes Fourier-VLM, which compresses visual features using a low-pass filter via 2D Discrete Cosine Transform (DCT), computed efficiently via Fast Fourier Transform (FFT).

Result: Achieves competitive performance, reduces inference FLOPs by up to 83.8%, and boosts generation speed by 31.2% compared to LLaVA-v1.5.

Conclusion: Fourier-VLM offers superior efficiency and practicality for VLMs without additional parameters or significant performance trade-offs.

Abstract: Vision-Language Models (VLMs) typically replace the predefined image placeholder token () in textual instructions with visual features from an image encoder, forming the input to a backbone Large Language Model (LLM). However, the large number of vision tokens significantly increases the context length, leading to high computational overhead and inference latency. While previous efforts mitigate this by selecting only important visual features or leveraging learnable queries to reduce token count, they often compromise performance or introduce substantial extra costs. In response, we propose Fourier-VLM, a simple yet efficient method that compresses visual representations in the frequency domain. Our approach is motivated by the observation that vision features output from the vision encoder exhibit concentrated energy in low-frequency components. Leveraging this, we apply a low-pass filter to the vision features using a two-dimentional Discrete Cosine Transform (DCT). Notably, the DCT is efficiently computed via the Fast Fourier Transform (FFT) operator with a time complexity of $\mathcal{O}(n\log n)$, minimizing the extra computational cost while introducing no additional parameters. Extensive experiments across various image-based benchmarks demonstrate that Fourier-VLM achieves competitive performance with strong generalizability across both LLaVA and Qwen-VL architectures. Crucially, it reduce inference FLOPs by up to 83.8% and boots generation speed by 31.2% compared to LLaVA-v1.5, highlighting the superior efficiency and practicality.

[127] NEP: Autoregressive Image Editing via Next Editing Token Prediction

Huimin Wu, Xiaojian Ma, Haozhe Zhao, Yanpeng Zhao, Qing Li

Main category: cs.CV

TL;DR: The paper proposes Next Editing-token Prediction (NEP) for text-guided image editing, focusing only on regions needing edits to reduce computational costs and improve edit quality.

DetailsMotivation: Existing methods regenerate entire images, leading to unnecessary costs and compromised edit quality by reconstructing non-edited areas.

Method: Formulates editing as NEP using autoregressive image generation and pre-trains an any-order autoregressive text-to-image model for zero-shot editing.

Result: Achieves state-of-the-art performance on benchmarks and supports test-time scaling (TTS) for iterative refinement.

Conclusion: NEP offers efficient, high-quality image editing by selectively regenerating only necessary regions, outperforming existing approaches.

Abstract: Text-guided image editing involves modifying a source image based on a language instruction and, typically, requires changes to only small local regions. However, existing approaches generate the entire target image rather than selectively regenerate only the intended editing areas. This results in (1) unnecessary computational costs and (2) a bias toward reconstructing non-editing regions, which compromises the quality of the intended edits. To resolve these limitations, we propose to formulate image editing as Next Editing-token Prediction (NEP) based on autoregressive image generation, where only regions that need to be edited are regenerated, thus avoiding unintended modification to the non-editing areas. To enable any-region editing, we propose to pre-train an any-order autoregressive text-to-image (T2I) model. Once trained, it is capable of zero-shot image editing and can be easily adapted to NEP for image editing, which achieves a new state-of-the-art on widely used image editing benchmarks. Moreover, our model naturally supports test-time scaling (TTS) through iteratively refining its generation in a zero-shot manner. The project page is: https://nep-bigai.github.io/

[128] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, Xiongkuo Min

Main category: cs.CV

TL;DR: VQAThinker is a reasoning-based VQA framework using LMMs and reinforcement learning to improve generalization and explainability in video quality assessment.

DetailsMotivation: Existing VQA models struggle with poor generalization to OOD videos and limited explainability, limiting real-world applicability.

Method: Uses GRPO, a rule-guided reinforcement learning algorithm, with three VQA-specific rewards: bell-shaped regression, pairwise ranking, and temporal consistency.

Result: Achieves state-of-the-art performance on in-domain and OOD benchmarks, with superior distortion attribution and quality description.

Conclusion: Reinforcement learning with score-level supervision can build generalizable and explainable VQA models.

Abstract: Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.

[129] LV-Net: Anatomy-aware lateral ventricle shape modeling with a case study on Alzheimer’s disease, the Australian Imaging Biomarkers and Lifestyle flagship study of ageing

Wonjung Park, Suhyun Ahn, Jinah Park

Main category: cs.CV

TL;DR: LV-Net is a framework for 3D LV mesh reconstruction from MRI, improving accuracy and robustness by using a joint LV-hippocampus template. It enhances shape analysis for neurological diseases like Alzheimer’s.

DetailsMotivation: Lateral ventricle (LV) shape analysis is promising for neurological disease biomarkers, but challenges like shape variability and MRI resolution limitations persist.

Method: LV-Net deforms an anatomy-aware joint LV-hippocampus template mesh, incorporating anatomical relationships to reduce segmentation artifacts and improve point correspondence.

Result: LV-Net achieves superior reconstruction accuracy, robust performance despite segmentation issues, and reliable shape descriptors. It identifies Alzheimer’s-associated LV subregions.

Conclusion: LV-Net advances LV shape analysis, offering improved biomarkers for neurological diseases like Alzheimer’s, with publicly available code.

Abstract: Lateral ventricle (LV) shape analysis holds promise as a biomarker for neurological diseases; however, challenges remain due to substantial shape variability across individuals and segmentation difficulties arising from limited MRI resolution. We introduce LV-Net, a novel framework for producing individualized 3D LV meshes from brain MRI by deforming an anatomy-aware joint LV-hippocampus template mesh. By incorporating anatomical relationships embedded within the joint template, LV-Net reduces boundary segmentation artifacts and improves reconstruction robustness. In addition, by classifying the vertices of the template mesh based on their anatomical adjacency, our method enhances point correspondence across subjects, leading to more accurate LV shape statistics. We demonstrate that LV-Net achieves superior reconstruction accuracy, even in the presence of segmentation imperfections, and delivers more reliable shape descriptors across diverse datasets. Finally, we apply LV-Net to Alzheimer’s disease analysis, identifying LV subregions that show significantly associations with the disease relative to cognitively normal controls. The codes for LV shape modeling are available at https://github.com/PWonjung/LV_Shape_Modeling.

[130] AGI for the Earth, the path, possibilities and how to evaluate intelligence of models that work with Earth Observation Data?

Mojtaba Valipour, Kelly Zheng, James Lowman, Spencer Szabados, Mike Gartner, Bobby Braswell

Main category: cs.CV

TL;DR: The paper advocates for the inclusion of satellite spectral imagery in AGI research, highlighting its potential and current limitations in benchmarks. It proposes a comprehensive benchmark for evaluating Earth observation models.

DetailsMotivation: Satellite spectral imagery is underutilized in AGI research despite its potential to enhance understanding of the natural world. Existing benchmarks lack the ability to evaluate generalization in this domain.

Method: The paper reviews current benchmarks and their limitations, then proposes a set of tasks for a more comprehensive benchmark to assess Earth observation models.

Result: The study identifies gaps in existing benchmarks and suggests tasks to better evaluate models’ interaction with Earth observation data.

Conclusion: A more robust benchmark is needed to advance AGI capabilities in Earth observation, and the proposed tasks aim to address this gap.

Abstract: Artificial General Intelligence (AGI) is closer than ever to becoming a reality, sparking widespread enthusiasm in the research community to collect and work with various modalities, including text, image, video, and audio. Despite recent efforts, satellite spectral imagery, as an additional modality, has yet to receive the attention it deserves. This area presents unique challenges, but also holds great promise in advancing the capabilities of AGI in understanding the natural world. In this paper, we argue why Earth Observation data is useful for an intelligent model, and then we review existing benchmarks and highlight their limitations in evaluating the generalization ability of foundation models in this domain. This paper emphasizes the need for a more comprehensive benchmark to evaluate earth observation models. To facilitate this, we propose a comprehensive set of tasks that a benchmark should encompass to effectively assess a model’s ability to understand and interact with Earth observation data.

[131] Lightweight Quad Bayer HybridEVS Demosaicing via State Space Augmented Cross-Attention

Shiyang Zhou, Haijin Zeng, Yunfan Lu, Yongyong Chen, Jie Liu, Jingyong Su

Main category: cs.CV

TL;DR: TSANet, a lightweight two-stage network, addresses demosaicing challenges in HybridEVS cameras by separating event pixel inpainting and demosaicing, outperforming state-of-the-art methods with reduced computational costs.

DetailsMotivation: Combining Quad Bayer CFA sensors with event pixels in HybridEVS cameras causes aliasing and artifacts, which current methods fail to address efficiently, especially on mobile devices.

Method: TSANet uses a two-stage approach with state space augmented cross-attention and a Cross-Swin State Block for demosaicing, leveraging positional priors and global dependencies with linear complexity.

Result: TSANet achieves superior demosaicing performance on HybridEVS data, outperforming DemosaicFormer in PSNR and SSIM while reducing parameters and computation costs significantly.

Conclusion: TSANet offers an efficient solution for mobile device demosaicing, demonstrating better performance and lower resource usage than existing methods.

Abstract: Event cameras like the Hybrid Event-based Vision Sensor (HybridEVS) camera capture brightness changes as asynchronous “events” instead of frames, offering advanced application on mobile photography. However, challenges arise from combining a Quad Bayer Color Filter Array (CFA) sensor with event pixels lacking color information, resulting in aliasing and artifacts on the demosaicing process before downstream application. Current methods struggle to address these issues, especially on resource-limited mobile devices. In response, we introduce \textbf{TSANet}, a lightweight \textbf{T}wo-stage network via \textbf{S}tate space augmented cross-\textbf{A}ttention, which can handle event pixels inpainting and demosaicing separately, leveraging the benefits of dividing complex tasks into manageable subtasks. Furthermore, we introduce a lightweight Cross-Swin State Block that uniquely utilizes positional prior for demosaicing and enhances global dependencies through the state space model with linear complexity. In summary, TSANet demonstrates excellent demosaicing performance on both simulated and real data of HybridEVS while maintaining a lightweight model, averaging better results than the previous state-of-the-art method DemosaicFormer across seven diverse datasets in both PSNR and SSIM, while respectively reducing parameter and computation costs by $1.86\times$ and $3.29\times$. Our approach presents new possibilities for efficient image demosaicing on mobile devices. Code is available in the supplementary materials.

[132] Distribution-Specific Learning for Joint Salient and Camouflaged Object Detection

Chao Hao, Zitong Yu, Xin Liu, Yuhao Wang, Weicheng Xie, Jingang Shi, Huanjing Yue, Jingyu Yang

Main category: cs.CV

TL;DR: The paper proposes SCJoint, a joint learning scheme for Salient Object Detection (SOD) and Camouflaged Object Detection (COD), showing that with proper learning, networks can excel at both tasks. It introduces a saliency-based sampling strategy (SBSS) and trains a generalist network, JoNet, achieving competitive results.

DetailsMotivation: Previous works assumed joint learning of SOD and COD would confuse networks, but this paper argues the opposite—proper learning can enhance both tasks.

Method: SCJoint learns task-specific decoding distributions with minimal parameters in a shared network, decoupling contradictory attributes. SBSS balances and improves training data quality.

Result: JoNet, trained with SCJoint and SBSS, performs competitively in both SOD and COD tasks.

Conclusion: Joint learning of SOD and COD is feasible and beneficial with the right approach, as demonstrated by SCJoint and JoNet.

Abstract: Salient object detection (SOD) and camouflaged object detection (COD) are two closely related but distinct computer vision tasks. Although both are class-agnostic segmentation tasks that map from RGB space to binary space, the former aims to identify the most salient objects in the image, while the latter focuses on detecting perfectly camouflaged objects that blend into the background in the image. These two tasks exhibit strong contradictory attributes. Previous works have mostly believed that joint learning of these two tasks would confuse the network, reducing its performance on both tasks. However, here we present an opposite perspective: with the correct approach to learning, the network can simultaneously possess the capability to find both salient and camouflaged objects, allowing both tasks to benefit from joint learning. We propose SCJoint, a joint learning scheme for SOD and COD tasks, assuming that the decoding processes of SOD and COD have different distribution characteristics. The key to our method is to learn the respective means and variances of the decoding processes for both tasks by inserting a minimal amount of task-specific learnable parameters within a fully shared network structure, thereby decoupling the contradictory attributes of the two tasks at a minimal cost. Furthermore, we propose a saliency-based sampling strategy (SBSS) to sample the training set of the SOD task to balance the training set sizes of the two tasks. In addition, SBSS improves the training set quality and shortens the training time. Based on the proposed SCJoint and SBSS, we train a powerful generalist network, named JoNet, which has the ability to simultaneously capture both salient" and camouflaged". Extensive experiments demonstrate the competitive performance and effectiveness of our proposed method. The code is available at https://github.com/linuxsino/JoNet.

[133] Can Large Models Fool the Eye? A New Turing Test for Biological Animation

Zijian Chen, Lirong Deng, Zhengyu Chen, Kaiwei Zhang, Qi Jia, Yuan Tian, Yucheng Zhu, Guangtao Zhai

Main category: cs.CV

TL;DR: BioMotion Arena introduces a visual animation framework to evaluate LLMs and MLLMs, highlighting performance gaps via point-light motion patterns.

DetailsMotivation: Current benchmarks lack intuitive feedback on model performance differences, prompting the need for a more perceptible evaluation method.

Method: Uses pairwise comparison and point-light imaging to collect 45k votes on 53 models, analyzing biological motion variants.

Result: Crowd-sourced votes align with expert ratings, revealing 90% of models fail basic humanoid motion tasks.

Conclusion: BioMotion Arena serves as a discriminative, ground-truth-free benchmark for visualizing model performance.

Abstract: Evaluating the abilities of large models and manifesting their gaps are challenging. Current benchmarks adopt either ground-truth-based score-form evaluation on static datasets or indistinct textual chatbot-style human preferences collection, which may not provide users with immediate, intuitive, and perceptible feedback on performance differences. In this paper, we introduce BioMotion Arena, a novel framework for evaluating large language models (LLMs) and multimodal large language models (MLLMs) via visual animation. Our methodology draws inspiration from the inherent visual perception of motion patterns characteristic of living organisms that utilizes point-light source imaging to amplify the performance discrepancies between models. Specifically, we employ a pairwise comparison evaluation and collect more than 45k votes for 53 mainstream LLMs and MLLMs on 90 biological motion variants. Data analyses show that the crowd-sourced human votes are in good agreement with those of expert raters, demonstrating the superiority of our BioMotion Arena in offering discriminative feedback. We also find that over 90% of evaluated models, including the cutting-edge open-source InternVL3 and proprietary Claude-4 series, fail to produce fundamental humanoid point-light groups, much less smooth and biologically plausible motions. This enables BioMotion Arena to serve as a challenging benchmark for performance visualization and a flexible evaluation framework without restrictions on ground-truth.

[134] Towards MR-Based Trochleoplasty Planning

Michael Wehrli, Alicia Durrer, Paul Friedrich, Sidaty El Hadramy, Edwin Li, Luana Brahaj, Carol C. Hasler, Philippe C. Cattin

Main category: cs.CV

TL;DR: A pipeline for generating super-resolved 3D pseudo-healthy morphologies from clinical MR scans improves Trochlear Dysplasia treatment by enhancing surgical planning and outcomes.

DetailsMotivation: Current TD treatments rely on low-resolution MR scans and surgeon intuition, leading to inconsistent results. The goal is to provide high-resolution, patient-specific 3D targets for better surgical planning.

Method: The pipeline involves super-resolving MR scans with INR, segmenting bones with a custom network, and generating pseudo-healthy morphologies using a Wavelet Diffusion Model.

Result: The approach improved sulcus angle and trochlear groove depth in 25 TD patients, offering sub-millimeter resolved 3D shapes without requiring CT scans.

Conclusion: The proposed method enhances TD treatment by providing high-resolution, radiation-free 3D targets for precise surgical planning.

Abstract: To treat Trochlear Dysplasia (TD), current approaches rely mainly on low-resolution clinical Magnetic Resonance (MR) scans and surgical intuition. The surgeries are planned based on surgeons experience, have limited adoption of minimally invasive techniques, and lead to inconsistent outcomes. We propose a pipeline that generates super-resolved, patient-specific 3D pseudo-healthy target morphologies from conventional clinical MR scans. First, we compute an isotropic super-resolved MR volume using an Implicit Neural Representation (INR). Next, we segment femur, tibia, patella, and fibula with a multi-label custom-trained network. Finally, we train a Wavelet Diffusion Model (WDM) to generate pseudo-healthy target morphologies of the trochlear region. In contrast to prior work producing pseudo-healthy low-resolution 3D MR images, our approach enables the generation of sub-millimeter resolved 3D shapes compatible for pre- and intraoperative use. These can serve as preoperative blueprints for reshaping the femoral groove while preserving the native patella articulation. Furthermore, and in contrast to other work, we do not require a CT for our pipeline - reducing the amount of radiation. We evaluated our approach on 25 TD patients and could show that our target morphologies significantly improve the sulcus angle (SA) and trochlear groove depth (TGD). The code and interactive visualization are available at https://wehrlimi.github.io/sr-3d-planning/.

[135] DreamVE: Unified Instruction-based Image and Video Editing

Bin Xia, Jiyang Liu, Yuechen Zhang, Bohao Peng, Ruihang Chu, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: DreamVE is a unified model for instruction-based image and video editing, trained in two stages (image then video) and using diverse data synthesis pipelines for improved performance and flexibility.

DetailsMotivation: Instruction-based editing is limited by scarce training data, especially for video, hindering practical use. DreamVE addresses this by leveraging scalable image data and diverse synthesis methods.

Method: A two-stage training strategy (image then video) with collage-based and generative model-based data synthesis pipelines. Uses a token concatenation with early drop approach for consistency and editability.

Result: DreamVE achieves strong performance in key editing types and enhances generalization, though collage-based data lacks some attribute editing cases. Generative model-based fine-tuning addresses this.

Conclusion: DreamVE offers a scalable, flexible solution for instruction-based editing, unifying image and video tasks with efficient training and diverse data synthesis.

Abstract: Instruction-based editing holds vast potential due to its simple and efficient interactive editing format. However, instruction-based editing, particularly for video, has been constrained by limited training data, hindering its practical application. To this end, we introduce DreamVE, a unified model for instruction-based image and video editing. Specifically, We propose a two-stage training strategy: first image editing, then video editing. This offers two main benefits: (1) Image data scales more easily, and models are more efficient to train, providing useful priors for faster and better video editing training. (2) Unifying image and video generation is natural and aligns with current trends. Moreover, we present comprehensive training data synthesis pipelines, including collage-based and generative model-based data synthesis. The collage-based data synthesis combines foreground objects and backgrounds to generate diverse editing data, such as object manipulation, background changes, and text modifications. It can easily generate billions of accurate, consistent, realistic, and diverse editing pairs. We pretrain DreamVE on extensive collage-based data to achieve strong performance in key editing types and enhance generalization and transfer capabilities. However, collage-based data lacks some attribute editing cases, leading to a relative drop in performance. In contrast, the generative model-based pipeline, despite being hard to scale up, offers flexibility in handling attribute editing cases. Therefore, we use generative model-based data to further fine-tune DreamVE. Besides, we design an efficient and powerful editing framework for DreamVE. We build on the SOTA T2V model and use a token concatenation with early drop approach to inject source image guidance, ensuring strong consistency and editability. The codes and models will be released.

[136] SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment

Yanxiao Sun, Jiafu Wu, Yun Cao, Chengming Xu, Yabiao Wang, Weijian Cao, Donghao Luo, Chengjie Wang, Yanwei Fu

Main category: cs.CV

TL;DR: SwiftVideo is a unified distillation framework combining trajectory-preserving and distribution-matching strategies to accelerate video generation while maintaining quality.

DetailsMotivation: Existing distillation methods for video synthesis suffer from performance breakdown or artifacts under few-step settings, limiting efficiency.

Method: SwiftVideo uses continuous-time consistency distillation for ODE trajectory preservation and dual-perspective alignment (distribution and trajectory alignment).

Result: The method outperforms existing approaches in few-step video generation on the OpenVid-1M benchmark.

Conclusion: SwiftVideo effectively balances speed and quality in video synthesis, addressing limitations of prior methods.

Abstract: Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose \textbf{\emph{SwiftVideo}}, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.

[137] AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance

Weichen Zhang, Zhui Zhu, Ningbo Li, Kebin Liu, Yunhao Liu

Main category: cs.CV

TL;DR: AdaptInfer is a plug-and-play framework for adaptive vision token pruning in VLMs, reducing CUDA latency by 61.3% while maintaining high accuracy.

DetailsMotivation: Existing pruning methods for VLMs fail to exploit dynamic internal signals during inference, leading to inefficiencies.

Method: Introduces a dynamic text-guided pruning mechanism and a principled pruning schedule based on cross-modal attention shifts.

Result: Reduces CUDA latency by 61.3% with 92.9% accuracy on LLaVA-1.5-7B, outperforming SOTA under the same token budget.

Conclusion: AdaptInfer is effective, lightweight, and generalizable for multi-modal tasks.

Abstract: Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering (VQA), but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3% while maintaining an average accuracy of 92.9% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.

[138] Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation

Yachun Mi, Yu Li, Yanting Li, Shixin Sun, Chen Hui, Tong Zhang, Yuanyuan Liu, Chenyue Song, Shaohui Liu

Main category: cs.CV

TL;DR: Q-CLIP is a Vision-Language Model (VLM)-based framework for Video Quality Assessment (VQA) that reduces computational costs and enhances sensitivity to quality variations using a Shared Cross-Modal Adapter and learnable prompts.

DetailsMotivation: Current VQA methods rely on pretraining on large datasets, which is computationally expensive and insufficient for capturing video quality factors like semantics, distortion, motion, and aesthetics. VLMs offer a promising alternative due to their generalization capabilities.

Method: Q-CLIP uses a Shared Cross-Modal Adapter (SCMA) with minimal trainable parameters and introduces learnable quality-level prompts. It also explores frame-difference-based sampling for better generalization.

Result: Q-CLIP achieves excellent performance on multiple VQA datasets while significantly reducing computational costs.

Conclusion: Q-CLIP demonstrates the potential of VLMs for efficient and accurate VQA, addressing the limitations of traditional pretraining methods.

Abstract: Accurate and efficient Video Quality Assessment (VQA) has long been a key research challenge. Current mainstream VQA methods typically improve performance by pretraining on large-scale classification datasets (e.g., ImageNet, Kinetics-400), followed by fine-tuning on VQA datasets. However, this strategy presents two significant challenges: (1) merely transferring semantic knowledge learned from pretraining is insufficient for VQA, as video quality depends on multiple factors (e.g., semantics, distortion, motion, aesthetics); (2) pretraining on large-scale datasets demands enormous computational resources, often dozens or even hundreds of times greater than training directly on VQA datasets. Recently, Vision-Language Models (VLMs) have shown remarkable generalization capabilities across a wide range of visual tasks, and have begun to demonstrate promising potential in quality assessment. In this work, we propose Q-CLIP, the first fully VLMs-based framework for VQA. Q-CLIP enhances both visual and textual representations through a Shared Cross-Modal Adapter (SCMA), which contains only a minimal number of trainable parameters and is the only component that requires training. This design significantly reduces computational cost. In addition, we introduce a set of five learnable quality-level prompts to guide the VLMs in perceiving subtle quality variations, thereby further enhancing the model’s sensitivity to video quality. Furthermore, we investigate the impact of different frame sampling strategies on VQA performance, and find that frame-difference-based sampling leads to better generalization performance across datasets. Extensive experiments demonstrate that Q-CLIP exhibits excellent performance on several VQA datasets.

[139] E-React: Towards Emotionally Controlled Synthesis of Human Reactions

Chen Zhu, Buzhen Huang, Zijing Wu, Binghui Zuo, Yangang Wang

Main category: cs.CV

TL;DR: The paper introduces a method for generating diverse human reaction motions based on emotional cues, addressing the gap in existing frameworks that ignore emotions. It uses a semi-supervised emotion prior in an actor-reactor diffusion model for realistic reaction synthesis.

DetailsMotivation: Existing motion generation frameworks lack emotional consideration, reducing naturalness and limiting interactive applications like human reaction synthesis.

Method: A semi-supervised emotion prior is integrated into an actor-reactor diffusion model, leveraging shared emotions in short motion sequences for training.

Result: The model generates realistic reactions under various emotional conditions, outperforming existing methods.

Conclusion: The approach successfully addresses the challenge of emotion-driven reaction synthesis, with potential for broader interactive applications.

Abstract: Emotion serves as an essential component in daily human interactions. Existing human motion generation frameworks do not consider the impact of emotions, which reduces naturalness and limits their application in interactive tasks, such as human reaction synthesis. In this work, we introduce a novel task: generating diverse reaction motions in response to different emotional cues. However, learning emotion representation from limited motion data and incorporating it into a motion generation framework remains a challenging problem. To address the above obstacles, we introduce a semi-supervised emotion prior in an actor-reactor diffusion model to facilitate emotion-driven reaction synthesis. Specifically, based on the observation that motion clips within a short sequence tend to share the same emotion, we first devise a semi-supervised learning framework to train an emotion prior. With this prior, we further train an actor-reactor diffusion model to generate reactions by considering both spatial interaction and emotional response. Finally, given a motion sequence of an actor, our approach can generate realistic reactions under various emotional conditions. Experimental results demonstrate that our model outperforms existing reaction generation methods. The code and data will be made publicly available at https://ereact.github.io/

[140] UGD-IML: A Unified Generative Diffusion-based Framework for Constrained and Unconstrained Image Manipulation Localization

Yachun Mi, Xingyang He, Shixin Sun, Yu Li, Yanting Li, Zhixuan Li, Jian Jin, Chen Hui, Shaohui Liu

Main category: cs.CV

TL;DR: UGD-IML, a novel generative framework using diffusion models, unifies Image Manipulation Localization (IML) and Constrained IML (CIML) tasks, reducing reliance on large datasets and outperforming SOTA methods.

DetailsMotivation: Addressing the limitations of existing IML methods, which require large annotated datasets and lack efficiency in annotation processes, by proposing a unified generative approach.

Method: A generative framework based on diffusion models, incorporating class embedding and parameter-sharing to switch between IML and CIML modes efficiently.

Result: UGD-IML outperforms SOTA methods by 9.66 and 4.36 F1 metrics for IML and CIML, respectively, and excels in uncertainty estimation and robustness.

Conclusion: The proposed UGD-IML framework effectively addresses dataset and efficiency challenges in IML and CIML, demonstrating superior performance and versatility.

Abstract: In the digital age, advanced image editing tools pose a serious threat to the integrity of visual content, making image forgery detection and localization a key research focus. Most existing Image Manipulation Localization (IML) methods rely on discriminative learning and require large, high-quality annotated datasets. However, current datasets lack sufficient scale and diversity, limiting model performance in real-world scenarios. To overcome this, recent studies have explored Constrained IML (CIML), which generates pixel-level annotations through algorithmic supervision. However, existing CIML approaches often depend on complex multi-stage pipelines, making the annotation process inefficient. In this work, we propose a novel generative framework based on diffusion models, named UGD-IML, which for the first time unifies both IML and CIML tasks within a single framework. By learning the underlying data distribution, generative diffusion models inherently reduce the reliance on large-scale labeled datasets, allowing our approach to perform effectively even under limited data conditions. In addition, by leveraging a class embedding mechanism and a parameter-sharing design, our model seamlessly switches between IML and CIML modes without extra components or training overhead. Furthermore, the end-to-end design enables our model to avoid cumbersome steps in the data annotation process. Extensive experimental results on multiple datasets demonstrate that UGD-IML outperforms the SOTA methods by an average of 9.66 and 4.36 in terms of F1 metrics for IML and CIML tasks, respectively. Moreover, the proposed method also excels in uncertainty estimation, visualization and robustness.

[141] MCA: 2D-3D Retrieval with Noisy Labels via Multi-level Adaptive Correction and Alignment

Gui Zou, Chaofan Gan, Chern Hong Lim, Supavadee Aramvith, Weiyao Lin

Main category: cs.CV

TL;DR: A robust 2D-3D cross-modal retrieval framework (MCA) is proposed to handle noisy labels, using multimodal joint label correction and multi-level adaptive alignment for improved performance.

DetailsMotivation: Imperfect annotations in 2D-3D data pose challenges for cross-modal retrieval, requiring robust solutions to avoid overfitting on noisy labels.

Method: MCA includes a Multimodal Joint label Correction (MJC) mechanism for label refinement and a Multi-level Adaptive Alignment (MAA) strategy for feature enhancement.

Result: MCA achieves state-of-the-art performance on both conventional and noisy 3D benchmarks.

Conclusion: The proposed MCA framework is effective and generalizable for robust 2D-3D cross-modal retrieval under noisy label conditions.

Abstract: With the increasing availability of 2D and 3D data, significant advancements have been made in the field of cross-modal retrieval. Nevertheless, the existence of imperfect annotations presents considerable challenges, demanding robust solutions for 2D-3D cross-modal retrieval in the presence of noisy label conditions. Existing methods generally address the issue of noise by dividing samples independently within each modality, making them susceptible to overfitting on corrupted labels. To address these issues, we propose a robust 2D-3D \textbf{M}ulti-level cross-modal adaptive \textbf{C}orrection and \textbf{A}lignment framework (MCA). Specifically, we introduce a Multimodal Joint label Correction (MJC) mechanism that leverages multimodal historical self-predictions to jointly model the modality prediction consistency, enabling reliable label refinement. Additionally, we propose a Multi-level Adaptive Alignment (MAA) strategy to effectively enhance cross-modal feature semantics and discrimination across different levels. Extensive experiments demonstrate the superiority of our method, MCA, which achieves state-of-the-art performance on both conventional and realistic noisy 3D benchmarks, highlighting its generality and effectiveness.

[142] Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention

Shree Mitra, Ritabrata Chakraborty, Nilkanta Sahu

Main category: cs.CV

TL;DR: A self-supervised learning framework for handwritten mathematical expression recognition (HMER) using contrastive loss and progressive spatial masking, outperforming existing methods.

DetailsMotivation: HMER is challenging due to 2D structure and complex symbol relationships; labeled data is expensive.

Method: Combines global/local contrastive loss for pretraining, self-supervised attention with progressive masking, and supervised fine-tuning with a transformer.

Result: Outperforms SSL and supervised baselines on CROHME benchmarks.

Conclusion: The progressive attention mechanism enhances HMER performance without needing labeled data.

Abstract: Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.

[143] FMCE-Net++: Feature Map Convergence Evaluation and Training

Zhibo Zhu, Renyu Huang, Lei He

Main category: cs.CV

TL;DR: FMCE-Net++ is a training framework integrating FMCE-Net for feature convergence, improving model performance without architecture changes.

DetailsMotivation: Addressing the interpretability and validation gaps in Feature Map Convergence Evaluation (FMCE) for DNNs.

Method: Uses a pretrained FMCE-Net as an auxiliary head to generate Feature Map Convergence Scores (FMCS) and combines them with task labels via Representation Auxiliary Loss (RAL).

Result: Achieves accuracy gains (e.g., +1.16 pp on ResNet-50/CIFAR-10) across multiple datasets.

Conclusion: FMCE-Net++ effectively enhances model performance by optimizing feature convergence.

Abstract: Deep Neural Networks (DNNs) face interpretability challenges due to their opaque internal representations. While Feature Map Convergence Evaluation (FMCE) quantifies module-level convergence via Feature Map Convergence Scores (FMCS), it lacks experimental validation and closed-loop integration. To address this limitation, we propose FMCE-Net++, a novel training framework that integrates a pretrained, frozen FMCE-Net as an auxiliary head. This module generates FMCS predictions, which, combined with task labels, jointly supervise backbone optimization through a Representation Auxiliary Loss. The RAL dynamically balances the primary classification loss and feature convergence optimization via a tunable \Representation Abstraction Factor. Extensive experiments conducted on MNIST, CIFAR-10, FashionMNIST, and CIFAR-100 demonstrate that FMCE-Net++ consistently enhances model performance without architectural modifications or additional data. Key experimental outcomes include accuracy gains of $+1.16$ pp (ResNet-50/CIFAR-10) and $+1.08$ pp (ShuffleNet v2/CIFAR-100), validating that FMCE-Net++ can effectively elevate state-of-the-art performance ceilings.

[144] GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving

Jian Wang, Chaokang Jiang, Haitao Xu

Main category: cs.CV

TL;DR: GMF-Drive introduces a gated Mamba fusion framework for autonomous driving, replacing transformers with efficient state-space models to improve performance and efficiency.

DetailsMotivation: Current diffusion-based models for autonomous driving rely on transformer-based fusion, which has quadratic computational complexity and lacks spatial priors for BEV representations.

Method: GMF-Drive uses a geometrically-augmented pillar format for LiDAR and a hierarchical gated Mamba fusion (GM-Fusion) architecture with state-space models (SSMs) for efficient, spatially-aware processing.

Result: GMF-Drive achieves state-of-the-art performance on the NAVSIM benchmark, outperforming DiffusionDrive, with ablation studies confirming its efficacy.

Conclusion: Task-specific SSMs can surpass general-purpose transformers in autonomous driving, offering better performance and efficiency.

Abstract: Diffusion-based models are redefining the state-of-the-art in end-to-end autonomous driving, yet their performance is increasingly hampered by a reliance on transformer-based fusion. These architectures face fundamental limitations: quadratic computational complexity restricts the use of high-resolution features, and a lack of spatial priors prevents them from effectively modeling the inherent structure of Bird’s Eye View (BEV) representations. This paper introduces GMF-Drive (Gated Mamba Fusion for Driving), an end-to-end framework that overcomes these challenges through two principled innovations. First, we supersede the information-limited histogram-based LiDAR representation with a geometrically-augmented pillar format encoding shape descriptors and statistical features, preserving critical 3D geometric details. Second, we propose a novel hierarchical gated mamba fusion (GM-Fusion) architecture that substitutes an expensive transformer with a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM leverages directional sequencing and adaptive fusion mechanisms to capture long-range dependencies with linear complexity, while explicitly respecting the unique spatial properties of the driving scene. Extensive experiments on the challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new state-of-the-art performance, significantly outperforming DiffusionDrive. Comprehensive ablation studies validate the efficacy of each component, demonstrating that task-specific SSMs can surpass a general-purpose transformer in both performance and efficiency for autonomous driving.

[145] SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation

Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, Yunhao Liu

Main category: cs.CV

TL;DR: SynSeg introduces Multi-Category Contrastive Learning (MCCL) and Feature Synergy Structure (FSS) to improve weakly-supervised semantic segmentation, outperforming SOTA methods.

DetailsMotivation: Addressing challenges in open-vocabulary semantic segmentation, such as semantic misalignment and poor performance in weakly-supervised settings.

Method: Uses MCCL for intra- and inter-category alignment and FSS for feature reconstruction to avoid foreground bias.

Result: Achieves higher accuracy than SOTA baselines (e.g., 4.5% on VOC, 8.9% on Context).

Conclusion: SynSeg enhances semantic localization and discrimination under weak supervision, demonstrating superior performance.

Abstract: Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. For instance, SynSeg achieves higher accuracy than SOTA baselines by 4.5% on VOC, 8.9% on Context, 2.6% on Object and 2.0% on City.

[146] Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang, Zeyu Zhang, Yunzhong Hou, Zhuowan Li, Gaowen Liu, Ali Payani, Yuan-Sen Ting, Liang Zheng

Main category: cs.CV

TL;DR: The paper introduces a modular and diversified approach to generating synthetic charts for improving multimodal large language models’ (MLLMs) chart understanding, resulting in the Effective Chart Dataset (ECD) with 10k+ images and 300k+ QA pairs.

DetailsMotivation: Existing MLLMs perform poorly (30%-50% success rate) on chart understanding tasks due to inadequate synthetic chart similarity to real charts.

Method: A five-step data synthesis pipeline is designed, separating data/function creation, conditioning subplot generation, diversifying visuals, filtering low-quality data, and generating QA pairs with GPT-4.

Result: ECD improves MLLM performance on real-world and synthetic test sets.

Conclusion: The modular and diversified approach enhances chart understanding, with ECD serving as a valuable resource for fine-tuning MLLMs.

Abstract: Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30%-50% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts. In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the effective chart dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets. Code, data and models are available at: https://github.com/yuweiyang-anu/ECD.

[147] Learning Representations of Satellite Images with Evaluations on Synoptic Weather Events

Ting-Shuo Yo, Shih-Hao Su, Chien-Ming Wu, Wei-Ting Chen, Jung-Lien Chu, Chiao-Wei Chang, Hung-Chi Kuo

Main category: cs.CV

TL;DR: The study compares PCA, CAE, and PT for weather event classification using satellite images, finding CAE superior in performance but lacking physical interpretability.

DetailsMotivation: To evaluate the effectiveness of representation learning algorithms (PCA, CAE, PT) for classifying weather events from satellite images.

Method: Applied PCA, convolutional autoencoder (CAE), and pre-trained residual network (PT) to satellite images, comparing their latent spaces for weather event classification.

Result: CAE consistently outperformed PCA and PT in classification tasks, though PT excelled in tropical cyclone recognition. Higher-resolution datasets improved deep-learning performance, and latent space size impacted false-alarm rates.

Conclusion: CAE is effective but lacks physical interpretability; future work could focus on physics-informed CAE models.

Abstract: This study applied representation learning algorithms to satellite images and evaluated the learned latent spaces with classifications of various weather events. The algorithms investigated include the classical linear transformation, i.e., principal component analysis (PCA), state-of-the-art deep learning method, i.e., convolutional autoencoder (CAE), and a residual network pre-trained with large image datasets (PT). The experiment results indicated that the latent space learned by CAE consistently showed higher threat scores for all classification tasks. The classifications with PCA yielded high hit rates but also high false-alarm rates. In addition, the PT performed exceptionally well at recognizing tropical cyclones but was inferior in other tasks. Further experiments suggested that representations learned from higher-resolution datasets are superior in all classification tasks for deep-learning algorithms, i.e., CAE and PT. We also found that smaller latent space sizes had minor impact on the classification task’s hit rate. Still, a latent space dimension smaller than 128 caused a significantly higher false alarm rate. Though the CAE can learn latent spaces effectively and efficiently, the interpretation of the learned representation lacks direct connections to physical attributions. Therefore, developing a physics-informed version of CAE can be a promising outlook for the current work.

[148] SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, Tao Chen

Main category: cs.CV

TL;DR: SC-Captioner is a reinforcement learning framework for self-correcting image caption models, using a novel reward function and refined metrics for better caption quality.

DetailsMotivation: Improving image caption models by enabling self-correction capabilities and addressing limitations in existing evaluation metrics.

Method: Decomposes captions into object, attribute, and relation sets, calculates set differences for corrections, and uses a reward function for accuracy. Introduces refined metrics and a new dataset, RefinedCaps.

Result: SC-Captioner outperforms direct preference optimization, generating better captions across diverse scenarios.

Conclusion: The framework effectively enhances caption quality through self-correction and improved evaluation metrics.

Abstract: We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. Specifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate correctness bonuses for accurate refinements and mistake punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Furthermore, we collect a fine-grained annotated image caption dataset, RefinedCaps, consisting of 6.5K diverse images from COCO dataset. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.

[149] SAM Encoder Breach by Adversarial Simplicial Complex Triggers Downstream Model Failures

Yi Qin, Rui Wang, Tao Huang, Tong Xiao, Liping Jing

Main category: cs.CV

TL;DR: VeSCA is a novel adversarial attack method targeting SAM’s vulnerabilities, improving transferability by 12.7% over state-of-the-art methods.

DetailsMotivation: SAM's vulnerabilities pose risks to downstream applications, necessitating proactive evaluation of transferable weaknesses.

Method: VeSCA leverages SAM’s encoder to identify shared vulnerable regions via a parametric simplicial complex, refined iteratively and adapted across domains.

Result: VeSCA outperforms existing methods by 12.7% across three downstream model categories and five datasets.

Conclusion: The study underscores the risks of SAM’s vulnerabilities and calls for more robust foundation models.

Abstract: While the Segment Anything Model (SAM) transforms interactive segmentation with zero-shot abilities, its inherent vulnerabilities present a single-point risk, potentially leading to the failure of numerous downstream applications. Proactively evaluating these transferable vulnerabilities is thus imperative. Prior adversarial attacks on SAM often present limited transferability due to insufficient exploration of common weakness across domains. To address this, we propose Vertex-Refining Simplicial Complex Attack (VeSCA), a novel method that leverages only the encoder of SAM for generating transferable adversarial examples. Specifically, it achieves this by explicitly characterizing the shared vulnerable regions between SAM and downstream models through a parametric simplicial complex. Our goal is to identify such complexes within adversarially potent regions by iterative vertex-wise refinement. A lightweight domain re-adaptation strategy is introduced to bridge domain divergence using minimal reference data during the initialization of simplicial complex. Ultimately, VeSCA generates consistently transferable adversarial examples through random simplicial complex sampling. Extensive experiments demonstrate that VeSCA achieves performance improved by 12.7% compared to state-of-the-art methods across three downstream model categories across five domain-specific datasets. Our findings further highlight the downstream model risks posed by SAM’s vulnerabilities and emphasize the urgency of developing more robust foundation models.

[150] Roll Your Eyes: Gaze Redirection via Explicit 3D Eyeball Rotation

YoungChan Choi, HengFei Wang, YiHua Cheng, Boeun Kim, Hyung Jin Chang, YoungGeun Choi, Sang-Il Choi

Main category: cs.CV

TL;DR: A novel 3D gaze redirection framework using explicit 3D eyeball structure and 3D Gaussian Splatting outperforms NeRF-based methods in image quality and gaze accuracy.

DetailsMotivation: Existing gaze redirection methods rely on implicit neural representations (NeRF), lacking explicit modeling of 3D eyeball rotation and translation.

Method: Uses a dedicated 3D eyeball structure with 3D Gaussian Splatting (3DGS) and an adaptive deformation module for muscle movements.

Result: Achieves superior image quality and gaze estimation accuracy on the ETH-XGaze dataset.

Conclusion: The framework effectively generates photorealistic gaze images with explicit 3D control, outperforming prior methods.

Abstract: We propose a novel 3D gaze redirection framework that leverages an explicit 3D eyeball structure. Existing gaze redirection methods are typically based on neural radiance fields, which employ implicit neural representations via volume rendering. Unlike these NeRF-based approaches, where the rotation and translation of 3D representations are not explicitly modeled, we introduce a dedicated 3D eyeball structure to represent the eyeballs with 3D Gaussian Splatting (3DGS). Our method generates photorealistic images that faithfully reproduce the desired gaze direction by explicitly rotating and translating the 3D eyeball structure. In addition, we propose an adaptive deformation module that enables the replication of subtle muscle movements around the eyes. Through experiments conducted on the ETH-XGaze dataset, we demonstrate that our framework is capable of generating diverse novel gaze images, achieving superior image quality and gaze estimation accuracy compared to previous state-of-the-art methods.

[151] DiffCap: Diffusion-based Real-time Human Motion Capture using Sparse IMUs and a Monocular Camera

Shaohua Pan, Xinyu Yi, Yan Zhou, Weihua Jian, Yuan Zhang, Pengfei Wan, Feng Xu

Main category: cs.CV

TL;DR: A diffusion-based method combines sparse IMUs and a monocular camera for real-time human motion capture, leveraging sequential visual features and frame-wise IMU data for robust performance.

DetailsMotivation: To address challenges like occlusions or camera view loss in human motion capture by fusing sparse IMUs and monocular camera data effectively.

Method: Uses a diffusion model to integrate sequential visual features (as a condition embedding) and frame-wise IMU measurements (concatenated with noisy poses).

Result: Demonstrates robustness to visual degenerations and achieves state-of-the-art performance in pose estimation.

Conclusion: The proposed framework effectively fuses IMU and camera data, outperforming previous methods, with code available for research.

Abstract: Combining sparse IMUs and a monocular camera is a new promising setting to perform real-time human motion capture. This paper proposes a diffusion-based solution to learn human motion priors and fuse the two modalities of signals together seamlessly in a unified framework. By delicately considering the characteristics of the two signals, the sequential visual information is considered as a whole and transformed into a condition embedding, while the inertial measurement is concatenated with the noisy body pose frame by frame to construct a sequential input for the diffusion model. Firstly, we observe that the visual information may be unavailable in some frames due to occlusions or subjects moving out of the camera view. Thus incorporating the sequential visual features as a whole to get a single feature embedding is robust to the occasional degenerations of visual information in those frames. On the other hand, the IMU measurements are robust to occlusions and always stable when signal transmission has no problem. So incorporating them frame-wisely could better explore the temporal information for the system. Experiments have demonstrated the effectiveness of the system design and its state-of-the-art performance in pose estimation compared with the previous works. Our codes are available for research at https://shaohua-pan.github.io/diffcap-page.

[152] SDEval: Safety Dynamic Evaluation for Multimodal Large Language Models

Hanqing Wang, Yuan Tian, Mingyu Liu, Zhenhao Zhang, Xiangyang Zhu

Main category: cs.CV

TL;DR: SDEval is a dynamic safety evaluation framework for Multimodal Large Language Models (MLLMs) that adjusts benchmark distribution and complexity to address outdated datasets and contamination issues.

DetailsMotivation: Safety concerns in MLLM outputs and limitations of existing datasets due to rapid advancements and contamination risks.

Method: SDEval uses text, image, and text-image dynamics to generate new samples, exploring their individual and combined effects on model safety.

Result: SDEval significantly impacts safety evaluation, mitigates data contamination, and exposes MLLM safety limitations across benchmarks.

Conclusion: SDEval is a versatile framework for improving safety evaluations in MLLMs and can be applied to existing benchmarks.

Abstract: In the rapidly evolving landscape of Multimodal Large Language Models (MLLMs), the safety concerns of their outputs have earned significant attention. Although numerous datasets have been proposed, they may become outdated with MLLM advancements and are susceptible to data contamination issues. To address these problems, we propose \textbf{SDEval}, the \textit{first} safety dynamic evaluation framework to controllably adjust the distribution and complexity of safety benchmarks. Specifically, SDEval mainly adopts three dynamic strategies: text, image, and text-image dynamics to generate new samples from original benchmarks. We first explore the individual effects of text and image dynamics on model safety. Then, we find that injecting text dynamics into images can further impact safety, and conversely, injecting image dynamics into text also leads to safety risks. SDEval is general enough to be applied to various existing safety and even capability benchmarks. Experiments across safety benchmarks, MLLMGuard and VLSBench, and capability benchmarks, MMBench and MMVet, show that SDEval significantly influences safety evaluation, mitigates data contamination, and exposes safety limitations of MLLMs. Code is available at https://github.com/hq-King/SDEval

[153] Text-guided Visual Prompt DINO for Generic Segmentation

Yuchen Guan, Chong Sun, Canmiao Fu, Zhipeng Huang, Chun Yuan, Chen Li

Main category: cs.CV

TL;DR: Prompt-DINO introduces early fusion, order-aligned query selection, and a generative data engine to improve open-world segmentation, achieving state-of-the-art results.

DetailsMotivation: Address limitations in late-stage feature fusion, suboptimal query selection, and caption-derived vocabulary constraints in multimodal vision models.

Method: Proposes an early fusion mechanism, order-aligned query selection for DETR-based architectures, and a generative data engine using the RAP model.

Result: Achieves state-of-the-art performance on open-world benchmarks, expands semantic coverage, and reduces label noise by 80.5%.

Conclusion: Establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios.

Abstract: Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code are available at https://github.com/WeChatCV/WeVisionOne.

[154] DSConv: Dynamic Splitting Convolution for Pansharpening

Xuanyu Liu, Bonan An

Main category: cs.CV

TL;DR: The paper introduces DSConv, a dynamic kernel-splitting method with attention for pansharpening, outperforming existing approaches.

DetailsMotivation: Existing pansharpening methods rely on standard convolutions, missing adaptive techniques that leverage inter-pixel correlations in remote sensing images.

Method: Proposes DSConv, dynamically splitting convolution kernels with attention to focus on key positions, enhancing feature extraction and network performance.

Result: DSConv achieves state-of-the-art performance, improving generalization, optimization, and feature representation.

Conclusion: DSConv is superior, with rigorous experiments validating its effectiveness and optimal usage conditions.

Abstract: Aiming to obtain a high-resolution image, pansharpening involves the fusion of a multi-spectral image (MS) and a panchromatic image (PAN), the low-level vision task remaining significant and challenging in contemporary research. Most existing approaches rely predominantly on standard convolutions, few making the effort to adaptive convolutions, which are effective owing to the inter-pixel correlations of remote sensing images. In this paper, we propose a novel strategy for dynamically splitting convolution kernels in conjunction with attention, selecting positions of interest, and splitting the original convolution kernel into multiple smaller kernels, named DSConv. The proposed DSConv more effectively extracts features of different positions within the receptive field, enhancing the network’s generalization, optimization, and feature representation capabilities. Furthermore, we innovate and enrich concepts of dynamic splitting convolution and provide a novel network architecture for pansharpening capable of achieving the tasks more efficiently, building upon this methodology. Adequate fair experiments illustrate the effectiveness and the state-of-the-art performance attained by DSConv.Comprehensive and rigorous discussions proved the superiority and optimal usage conditions of DSConv.

[155] VISTAR:A User-Centric and Role-Driven Benchmark for Text-to-Image Evaluation

Kaiyuan Jiang, Ruoxi Sun, Ying Cao, Yuqi Xu, Xinran Zhang, Junyan Guo, ChengSheng Deng

Main category: cs.CV

TL;DR: VISTAR is a user-centric, multi-dimensional benchmark for text-to-image evaluation, combining deterministic metrics and a novel HWPQ scheme for high human alignment and actionable insights.

DetailsMotivation: Address limitations of existing text-to-image evaluation metrics by incorporating both quantifiable attributes and abstract semantics.

Method: Two-tier hybrid paradigm: deterministic metrics for quantifiable attributes and HWPQ scheme for abstract semantics, validated by expert input and human comparisons.

Result: Achieves >75% human alignment, with HWPQ at 85.9% accuracy. No universal champion model; role-weighted scores reorder rankings.

Conclusion: VISTAR provides reproducible, domain-specific guidance for text-to-image model evaluation, with publicly released resources.

Abstract: We present VISTAR, a user-centric, multi-dimensional benchmark for text-to-image (T2I) evaluation that addresses the limitations of existing metrics. VISTAR introduces a two-tier hybrid paradigm: it employs deterministic, scriptable metrics for physically quantifiable attributes (e.g., text rendering, lighting) and a novel Hierarchical Weighted P/N Questioning (HWPQ) scheme that uses constrained vision-language models to assess abstract semantics (e.g., style fusion, cultural fidelity). Grounded in a Delphi study with 120 experts, we defined seven user roles and nine evaluation angles to construct the benchmark, which comprises 2,845 prompts validated by over 15,000 human pairwise comparisons. Our metrics achieve high human alignment (>75%), with the HWPQ scheme reaching 85.9% accuracy on abstract semantics, significantly outperforming VQA baselines. Comprehensive evaluation of state-of-the-art models reveals no universal champion, as role-weighted scores reorder rankings and provide actionable guidance for domain-specific deployment. All resources are publicly released to foster reproducible T2I assessment.

[156] An Interpretable Multi-Plane Fusion Framework With Kolmogorov-Arnold Network Guided Attention Enhancement for Alzheimer’s Disease Diagnosis

Xiaoxiao Yang, Meiliang Liu, Yunfang Xu, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao

Main category: cs.CV

TL;DR: The paper proposes MPF-KANSC, a deep learning framework for early Alzheimer’s disease diagnosis by fusing multi-plane sMRI features and using a novel attention mechanism.

DetailsMotivation: Early and precise AD diagnosis is challenging due to subtle brain changes. Existing methods lack accuracy in capturing complex brain atrophy patterns.

Method: MPF-KANSC integrates multi-plane fusion (MPF) and a Kolmogorov-Arnold Network-guided spatial-channel attention mechanism (KANSC) for feature extraction and disease-related abnormality identification.

Result: MPF-KANSC outperforms existing methods on the ADNI dataset and reveals right-lateralized asymmetry in subcortical changes.

Conclusion: The framework improves AD diagnosis accuracy and interpretability, offering insights into disease progression.

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that severely impairs cognitive function and quality of life. Timely intervention in AD relies heavily on early and precise diagnosis, which remains challenging due to the complex and subtle structural changes in the brain. Most existing deep learning methods focus only on a single plane of structural magnetic resonance imaging (sMRI) and struggle to accurately capture the complex and nonlinear relationships among pathological regions of the brain, thus limiting their ability to precisely identify atrophic features. To overcome these limitations, we propose an innovative framework, MPF-KANSC, which integrates multi-plane fusion (MPF) for combining features from the coronal, sagittal, and axial planes, and a Kolmogorov-Arnold Network-guided spatial-channel attention mechanism (KANSC) to more effectively learn and represent sMRI atrophy features. Specifically, the proposed model enables parallel feature extraction from multiple anatomical planes, thus capturing more comprehensive structural information. The KANSC attention mechanism further leverages a more flexible and accurate nonlinear function approximation technique, facilitating precise identification and localization of disease-related abnormalities. Experiments on the ADNI dataset confirm that the proposed MPF-KANSC achieves superior performance in AD diagnosis. Moreover, our findings provide new evidence of right-lateralized asymmetry in subcortical structural changes during AD progression, highlighting the model’s promising interpretability.

[157] Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment

Zhenbang Du, Yonggan Fu, Lifu Wang, Jiayi Qian, Xiao Luo, Yingyan, Lin

Main category: cs.CV

TL;DR: PostDiff is a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at input and module levels, showing that reducing per-step inference cost is more effective than cutting denoising steps.

DetailsMotivation: High computational demands of diffusion models challenge deployment on resource-limited platforms, prompting investigation into compute-optimal deployment strategies without fine-tuning.

Method: Proposes PostDiff with mixed-resolution denoising (input level) and hybrid module caching (module level) to reduce redundancy.

Result: PostDiff improves fidelity-efficiency trade-off; reducing per-step inference cost is more effective than reducing denoising steps.

Conclusion: PostDiff offers a practical solution for efficient deployment of diffusion models, prioritizing per-step cost reduction over step reduction.

Abstract: Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the number of denoising steps increases the variability of the distributions across steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. Our code is available at https://github.com/GATECH-EIC/PostDiff.

[158] UW-3DGS: Underwater 3D Reconstruction with Physics-Aware Gaussian Splatting

Wenpeng Xing, Jie Chen, Zaifeng Yang, Changting Lin, Jianfeng Dong, Chaochao Chen, Xun Zhou, Meng Han

Main category: cs.CV

TL;DR: UW-3DGS improves underwater 3D scene reconstruction by adapting 3D Gaussian Splatting with a learnable underwater image formation module and Physics-Aware Uncertainty Pruning, outperforming existing methods.

DetailsMotivation: Traditional methods like NeRF and its extensions struggle with underwater conditions due to light absorption and scattering, leading to degraded geometry and color fidelity.

Method: UW-3DGS uses a voxel-based regression for underwater image formation and a Physics-Aware Uncertainty Pruning branch to remove noisy Gaussians, optimizing them end-to-end with underwater parameters.

Result: Achieves PSNR of 27.604, SSIM of 0.868, and LPIPS of 0.104 on SeaThru-NeRF, with ~65% fewer floating artifacts.

Conclusion: UW-3DGS offers a robust solution for underwater 3D reconstruction, combining learned physics with efficient Gaussian optimization for high-fidelity results.

Abstract: Underwater 3D scene reconstruction faces severe challenges from light absorption, scattering, and turbidity, which degrade geometry and color fidelity in traditional methods like Neural Radiance Fields (NeRF). While NeRF extensions such as SeaThru-NeRF incorporate physics-based models, their MLP reliance limits efficiency and spatial resolution in hazy environments. We introduce UW-3DGS, a novel framework adapting 3D Gaussian Splatting (3DGS) for robust underwater reconstruction. Key innovations include: (1) a plug-and-play learnable underwater image formation module using voxel-based regression for spatially varying attenuation and backscatter; and (2) a Physics-Aware Uncertainty Pruning (PAUP) branch that adaptively removes noisy floating Gaussians via uncertainty scoring, ensuring artifact-free geometry. The pipeline operates in training and rendering stages. During training, noisy Gaussians are optimized end-to-end with underwater parameters, guided by PAUP pruning and scattering modeling. In rendering, refined Gaussians produce clean Unattenuated Radiance Images (URIs) free from media effects, while learned physics enable realistic Underwater Images (UWIs) with accurate light transport. Experiments on SeaThru-NeRF and UWBundle datasets show superior performance, achieving PSNR of 27.604, SSIM of 0.868, and LPIPS of 0.104 on SeaThru-NeRF, with ~65% reduction in floating artifacts.

[159] Synthetic Data-Driven Multi-Architecture Framework for Automated Polyp Segmentation Through Integrated Detection and Mask Generation

Ojonugwa Oluwafemi Ejiga Peter, Akingbola Oluwapemiisin, Amalahu Chetachi, Adeniran Opeyemi, Fahmi Khalifa, Md Mahmudur Rahman

Main category: cs.CV

TL;DR: A novel multidirectional framework automates polyp detection in colonoscopy images, combining synthetic data generation, Faster R-CNN for localization, and SAM for segmentation, achieving high performance metrics.

DetailsMotivation: To address the challenges of limited healthcare datasets and annotation complexities in colorectal cancer diagnosis via colonoscopy.

Method: Uses Stable Diffusion for synthetic data, Faster R-CNN for initial polyp localization, and SAM for refined segmentation. Evaluates five segmentation models (U-Net, PSPNet, FPN, LinkNet, MANet) with ResNet34.

Result: Faster R-CNN achieved 93.08% recall, 88.97% precision, and 90.98% F1 score. FPN led in PSNR (7.205893) and SSIM (0.492381), while U-Net excelled in recall (84.85%) and LinkNet in IoU (64.20%) and Dice (77.53%).

Conclusion: The framework effectively automates polyp detection, with FPN and U-Net showing superior segmentation performance, enhancing early colorectal cancer diagnosis.

Abstract: Colonoscopy is a vital tool for the early diagnosis of colorectal cancer, which is one of the main causes of cancer-related mortality globally; hence, it is deemed an essential technique for the prevention and early detection of colorectal cancer. The research introduces a unique multidirectional architectural framework to automate polyp detection within colonoscopy images while helping resolve limited healthcare dataset sizes and annotation complexities. The research implements a comprehensive system that delivers synthetic data generation through Stable Diffusion enhancements together with detection and segmentation algorithms. This detection approach combines Faster R-CNN for initial object localization while the Segment Anything Model (SAM) refines the segmentation masks. The faster R-CNN detection algorithm achieved a recall of 93.08% combined with a precision of 88.97% and an F1 score of 90.98%.SAM is then used to generate the image mask. The research evaluated five state-of-the-art segmentation models that included U-Net, PSPNet, FPN, LinkNet, and MANet using ResNet34 as a base model. The results demonstrate the superior performance of FPN with the highest scores of PSNR (7.205893) and SSIM (0.492381), while UNet excels in recall (84.85%) and LinkNet shows balanced performance in IoU (64.20%) and Dice score (77.53%).

[160] Graph-based Robot Localization Using a Graph Neural Network with a Floor Camera and a Feature Rich Industrial Floor

Dominik Brämer, Diana Kleingarn, Oliver Urbann

Main category: cs.CV

TL;DR: A novel graph-based localization framework using GCNs improves accuracy (0.64cm error) and efficiency, solving the kidnapped robot problem without complex filtering.

DetailsMotivation: Traditional localization methods like Lidar or QR-codes lack scalability and adaptability in complex environments.

Method: Uses graph-based representations of floor features with Graph Convolutional Networks (GCNs) for localization.

Result: Achieves 0.64cm localization error and efficiently solves the kidnapped robot problem per frame.

Conclusion: The framework enhances robotic navigation in diverse environments by leveraging floor characteristics.

Abstract: Accurate localization represents a fundamental challenge in robotic navigation. Traditional methodologies, such as Lidar or QR-code based systems, suffer from inherent scalability and adaptability con straints, particularly in complex environments. In this work, we propose an innovative localization framework that harnesses flooring characteris tics by employing graph-based representations and Graph Convolutional Networks (GCNs). Our method uses graphs to represent floor features, which helps localize the robot more accurately (0.64cm error) and more efficiently than comparing individual image features. Additionally, this approach successfully addresses the kidnapped robot problem in every frame without requiring complex filtering processes. These advancements open up new possibilities for robotic navigation in diverse environments.

[161] MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration

Cheng Liu, Daou Zhang, Tingxu Liu, Yuhan Wang, Jinyang Chen, Yuexuan Li, Xinying Xiao, Chenbo Xin, Ziru Wang, Weichao Wu

Main category: cs.CV

TL;DR: MA-CBP is a multi-agent framework for real-time criminal behavior prediction using semantic descriptions and causal summaries, outperforming traditional methods.

DetailsMotivation: Addressing limitations of traditional anomaly detection and LLM-based methods in capturing high-level behavioral semantics and meeting real-time needs.

Method: Transforms video streams into semantic descriptions, constructs causal summaries, and fuses frames for joint reasoning.

Result: Achieves superior performance on multiple datasets, enabling early warnings for criminal activity.

Conclusion: MA-CBP offers a promising solution for urban public safety risk warning.

Abstract: With the acceleration of urbanization, criminal behavior in public scenes poses an increasingly serious threat to social security. Traditional anomaly detection methods based on feature recognition struggle to capture high-level behavioral semantics from historical information, while generative approaches based on Large Language Models (LLMs) often fail to meet real-time requirements. To address these challenges, we propose MA-CBP, a criminal behavior prediction framework based on multi-agent asynchronous collaboration. This framework transforms real-time video streams into frame-level semantic descriptions, constructs causally consistent historical summaries, and fuses adjacent image frames to perform joint reasoning over long- and short-term contexts. The resulting behavioral decisions include key elements such as event subjects, locations, and causes, enabling early warning of potential criminal activity. In addition, we construct a high-quality criminal behavior dataset that provides multi-scale language supervision, including frame-level, summary-level, and event-level semantic annotations. Experimental results demonstrate that our method achieves superior performance on multiple datasets and offers a promising solution for risk warning in urban public safety scenarios.

[162] A Semantic Segmentation Algorithm for Pleural Effusion Based on DBIF-AUNet

Ruixiang Tang, Jianglong Qin, Mingda Zhang, Yan Song, Yi Wu, Wei Wu

Main category: cs.CV

TL;DR: The paper introduces DBIF-AUNet, a model for pleural effusion CT image segmentation, addressing challenges like gray-level similarity and blurred edges. It achieves superior performance over existing methods.

DetailsMotivation: Accurate pleural effusion segmentation improves clinical diagnosis, but current methods struggle with image variations and semantic gaps.

Method: Proposes DBIF-AUNet with Dual-Domain Feature Disentanglement (DDFD) and Branch Interaction Attention Fusion (BIAF) modules for multi-scale feature complementarity and dynamic feature fusion.

Result: Achieved IoU of 80.1% and Dice of 89.0%, outperforming U-Net++ and Swin-UNet.

Conclusion: DBIF-AUNet significantly enhances segmentation accuracy for complex pleural effusion CT images.

Abstract: Pleural effusion semantic segmentation can significantly enhance the accuracy and timeliness of clinical diagnosis and treatment by precisely identifying disease severity and lesion areas. Currently, semantic segmentation of pleural effusion CT images faces multiple challenges. These include similar gray levels between effusion and surrounding tissues, blurred edges, and variable morphology. Existing methods often struggle with diverse image variations and complex edges, primarily because direct feature concatenation causes semantic gaps. To address these challenges, we propose the Dual-Branch Interactive Fusion Attention model (DBIF-AUNet). This model constructs a densely nested skip-connection network and innovatively refines the Dual-Domain Feature Disentanglement module (DDFD). The DDFD module orthogonally decouples the functions of dual-domain modules to achieve multi-scale feature complementarity and enhance characteristics at different levels. Concurrently, we design a Branch Interaction Attention Fusion module (BIAF) that works synergistically with the DDFD. This module dynamically weights and fuses global, local, and frequency band features, thereby improving segmentation robustness. Furthermore, we implement a nested deep supervision mechanism with hierarchical adaptive hybrid loss to effectively address class imbalance. Through validation on 1,622 pleural effusion CT images from Southwest Hospital, DBIF-AUNet achieved IoU and Dice scores of 80.1% and 89.0% respectively. These results outperform state-of-the-art medical image segmentation models U-Net++ and Swin-UNet by 5.7%/2.7% and 2.2%/1.5% respectively, demonstrating significant optimization in segmentation accuracy for complex pleural effusion CT images.

[163] LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning

Chang Che, Ziqi Wang, Pengwan Yang, Qi Wang, Hui Ma, Zenglin Shi

Main category: cs.CV

TL;DR: LiLoRA introduces an efficient architecture expansion method for CVIT in MLLMs, reducing parameter overhead and improving scalability while mitigating catastrophic forgetting.

DetailsMotivation: Addressing catastrophic forgetting and parameter inefficiency in continual visual instruction tuning for MLLMs.

Method: LiLoRA shares LoRA matrix A across tasks, applies low-rank decomposition to matrix B, and uses cosine-regularized stability loss.

Result: LiLoRA achieves superior performance in sequential task learning with improved parameter efficiency.

Conclusion: LiLoRA is a scalable and efficient solution for CVIT in MLLMs, outperforming existing methods.

Abstract: Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models (MLLMs) to incrementally learn new tasks over time. However, this process is challenged by catastrophic forgetting, where performance on previously learned tasks deteriorates as the model adapts to new ones. A common approach to mitigate forgetting is architecture expansion, which introduces task-specific modules to prevent interference. Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. LiLoRA shares the LoRA matrix A across tasks to reduce redundancy, applies an additional low-rank decomposition to matrix B to minimize task-specific parameters, and incorporates a cosine-regularized stability loss to preserve consistency in shared representations over time. Extensive experiments on a diverse CVIT benchmark show that LiLoRA consistently achieves superior performance in sequential task learning while significantly improving parameter efficiency compared to existing approaches.

[164] AnomalyMoE: Towards a Language-free Generalist Model for Unified Visual Anomaly Detection

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Wei Ge, Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: AnomalyMoE is a universal anomaly detection framework using a Mixture-of-Experts architecture to detect diverse anomalies hierarchically, outperforming specialized methods.

DetailsMotivation: Existing anomaly detection methods are specialized and lack generalizability, limiting their performance across diverse contexts.

Method: AnomalyMoE decomposes anomaly detection into three semantic hierarchies (local, component, global) using dedicated expert networks and introduces EIR and ESB modules for expert diversity and utilization.

Result: AnomalyMoE achieves state-of-the-art performance on 8 diverse datasets, surpassing specialized methods in their respective domains.

Conclusion: AnomalyMoE offers a universal, hierarchical approach to anomaly detection, significantly improving generalizability and performance.

Abstract: Anomaly detection is a critical task across numerous domains and modalities, yet existing methods are often highly specialized, limiting their generalizability. These specialized models, tailored for specific anomaly types like textural defects or logical errors, typically exhibit limited performance when deployed outside their designated contexts. To overcome this limitation, we propose AnomalyMoE, a novel and universal anomaly detection framework based on a Mixture-of-Experts (MoE) architecture. Our key insight is to decompose the complex anomaly detection problem into three distinct semantic hierarchies: local structural anomalies, component-level semantic anomalies, and global logical anomalies. AnomalyMoE correspondingly employs three dedicated expert networks at the patch, component, and global levels, and is specialized in reconstructing features and identifying deviations at its designated semantic level. This hierarchical design allows a single model to concurrently understand and detect a wide spectrum of anomalies. Furthermore, we introduce an Expert Information Repulsion (EIR) module to promote expert diversity and an Expert Selection Balancing (ESB) module to ensure the comprehensive utilization of all experts. Experiments on 8 challenging datasets spanning industrial imaging, 3D point clouds, medical imaging, video surveillance, and logical anomaly detection demonstrate that AnomalyMoE establishes new state-of-the-art performance, significantly outperforming specialized methods in their respective domains.

[165] PA-HOI: A Physics-Aware Human and Object Interaction Dataset

Ruiyan Wang, Lin Zuo, Zonghao Lin, Qiang Wang, Zhengxue Cheng, Rong Xie, Jun Ling, Li Song

Main category: cs.CV

TL;DR: The paper introduces the PA-HOI Motion Capture dataset to study how objects’ physical attributes influence human motion dynamics, addressing gaps in existing HOI datasets.

DetailsMotivation: Existing HOI datasets overlook the impact of objects' physical properties on human motion, limiting understanding in fields like robotics and virtual reality.

Method: The PA-HOI dataset includes 562 motion sequences of human-object interactions, with varied object properties (size, shape, weight) and human subjects.

Result: The dataset extends understanding of how object attributes affect human posture, speed, and interaction strategies, and integrates well with motion generation methods.

Conclusion: The PA-HOI dataset fills a critical gap by providing realistic physical awareness for HOI research and applications.

Abstract: The Human-Object Interaction (HOI) task explores the dynamic interactions between humans and objects in physical environments, providing essential biomechanical and cognitive-behavioral foundations for fields such as robotics, virtual reality, and human-computer interaction. However, existing HOI data sets focus on details of affordance, often neglecting the influence of physical properties of objects on human long-term motion. To bridge this gap, we introduce the PA-HOI Motion Capture dataset, which highlights the impact of objects’ physical attributes on human motion dynamics, including human posture, moving velocity, and other motion characteristics. The dataset comprises 562 motion sequences of human-object interactions, with each sequence performed by subjects of different genders interacting with 35 3D objects that vary in size, shape, and weight. This dataset stands out by significantly extending the scope of existing ones for understanding how the physical attributes of different objects influence human posture, speed, motion scale, and interacting strategies. We further demonstrate the applicability of the PA-HOI dataset by integrating it with existing motion generation methods, validating its capacity to transfer realistic physical awareness.

[166] SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang

Main category: cs.CV

TL;DR: SIFThinker is a spatially-aware framework for MLLMs that improves visual tasks by mimicking human perception, using attention correction and spatial cues.

DetailsMotivation: Current MLLMs struggle with complex visual tasks like spatial understanding and fine-grained perception, lacking iterative refinement of focus on relevant regions.

Method: Introduces SIFThinker with depth-enhanced bounding boxes and natural language, using a reverse-expansion-forward-inference strategy and GRPO-SIF training paradigm.

Result: Outperforms state-of-the-art methods in spatial and fine-grained tasks while maintaining general capabilities.

Conclusion: SIFThinker effectively enhances MLLMs’ visual reasoning through attention correction and spatial awareness.

Abstract: Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware “think-with-images” framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset. Besides, we propose GRPO-SIF, a reinforced training paradigm that integrates depth-informed visual grounding into a unified reasoning pipeline, teaching the model to dynamically correct and focus on prompt-relevant regions. Extensive experiments demonstrate that SIFThinker outperforms state-of-the-art methods in spatial understanding and fine-grained visual perception, while maintaining strong general capabilities, highlighting the effectiveness of our method.

[167] Interpretable Rheumatoid Arthritis Scoring via Anatomy-aware Multiple Instance Learning

Zhiyan Bo, Laura C. Coates, Bartlomiej W. Papiez

Main category: cs.CV

TL;DR: A two-stage pipeline using dual-hand radiographs for interpretable SvdH score prediction in RA, achieving state-of-the-art accuracy comparable to radiologists.

DetailsMotivation: The complexity of manual SvdH scoring limits its clinical use, prompting an automated solution.

Method: A two-stage pipeline with attention-based multiple instance learning, using two region extraction schemes for disease-relevant features.

Result: Best model achieved PCC 0.943 and RMSE 15.73; ensemble learning improved to PCC 0.945 and RMSE 15.57, matching radiologist performance.

Conclusion: The pipeline efficiently predicts SvdH scores and aligns with clinical relevance for RA progression.

Abstract: The Sharp/van der Heijde (SvdH) score has been widely used in clinical trials to quantify radiographic damage in Rheumatoid Arthritis (RA), but its complexity has limited its adoption in routine clinical practice. To address the inefficiency of manual scoring, this work proposes a two-stage pipeline for interpretable image-level SvdH score prediction using dual-hand radiographs. Our approach extracts disease-relevant image regions and integrates them using attention-based multiple instance learning to generate image-level features for prediction. We propose two region extraction schemes: 1) sampling image tiles most likely to contain abnormalities, and 2) cropping patches containing disease-relevant joints. With Scheme 2, our best individual score prediction model achieved a Pearson’s correlation coefficient (PCC) of 0.943 and a root mean squared error (RMSE) of 15.73. Ensemble learning further boosted prediction accuracy, yielding a PCC of 0.945 and RMSE of 15.57, achieving state-of-the-art performance that is comparable to that of experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, our pipeline effectively identified and made decisions based on anatomical structures which clinicians consider relevant to RA progression.

[168] TEFormer: Texture-Aware and Edge-Guided Transformer for Semantic Segmentation of Urban Remote Sensing Images

Guoyu Zhou, Jing Zhang, Yi Yan, Hui Zhang, Li Zhuo

Main category: cs.CV

TL;DR: TEFormer, a texture-aware and edge-guided Transformer, improves semantic segmentation of urban remote sensing images by addressing texture differences and edge complexities.

DetailsMotivation: Urban remote sensing images (URSIs) face challenges like subtle texture differences, irregular shapes, and blurred boundaries, leading to semantic ambiguity and misclassification.

Method: TEFormer integrates a texture-aware module (TaM) for texture discrimination, an edge-guided tri-branch decoder (Eg3Head) for edge preservation, and an edge-guided feature fusion module (EgFFM) for refined segmentation.

Result: TEFormer achieves mIoU scores of 88.57%, 81.46%, and 53.55% on Potsdam, Vaihingen, and LoveDA datasets, respectively.

Conclusion: TEFormer effectively addresses the challenges of URSI segmentation, demonstrating superior performance in accuracy and edge preservation.

Abstract: Semantic segmentation of urban remote sensing images (URSIs) is crucial for applications such as urban planning and environmental monitoring. However, geospatial objects often exhibit subtle texture differences and similar spatial structures, which can easily lead to semantic ambiguity and misclassification. Moreover, challenges such as irregular object shapes, blurred boundaries, and overlapping spatial distributions of semantic objects contribute to complex and diverse edge morphologies, further complicating accurate segmentation. To tackle these issues, we propose a texture-aware and edge-guided Transformer (TEFormer) that integrates texture awareness and edge-guidance mechanisms for semantic segmentation of URSIs. In the encoder, a texture-aware module (TaM) is designed to capture fine-grained texture differences between visually similar categories to enhance semantic discrimination. Then, an edge-guided tri-branch decoder (Eg3Head) is constructed to preserve local edges and details for multiscale context-awareness. Finally, an edge-guided feature fusion module (EgFFM) is to fuse contextual and detail information with edge information to realize refined semantic segmentation. Extensive experiments show that TEFormer achieves mIoU of 88.57%, 81.46%, and 53.55% on the Potsdam, Vaihingen, and LoveDA datasets, respectively, shows the effectiveness in URSI semantic segmentation.

[169] Mixture of Experts Guided by Gaussian Splatters Matters: A new Approach to Weakly-Supervised Video Anomaly Detection

Giacomo D’Amicantonio, Snehashis Majhi, Quan Kong, Lorenzo Garattoni, Gianpiero Francesca, François Bremond, Egor Bondarev

Main category: cs.CV

TL;DR: The paper introduces GS-MoE, a framework for Weakly-Supervised Video Anomaly Detection (WSVAD) that uses specialized expert models and temporal Gaussian splatting to improve anomaly detection, achieving state-of-the-art results.

DetailsMotivation: Current WSVAD models struggle with complex anomalies due to shared model limitations and weak supervision signals lacking temporal precision.

Method: Proposes GS-MoE, a framework with expert models for specific anomaly types, guided by temporal Gaussian splatting loss to enhance weak supervision.

Result: Achieves 91.58% AUC on UCF-Crime and superior performance on XD-Violence and MSAD datasets.

Conclusion: GS-MoE sets a new benchmark for WSVAD by leveraging category-specific expertise and temporal guidance.

Abstract: Video Anomaly Detection (VAD) is a challenging task due to the variability of anomalous events and the limited availability of labeled data. Under the Weakly-Supervised VAD (WSVAD) paradigm, only video-level labels are provided during training, while predictions are made at the frame level. Although state-of-the-art models perform well on simple anomalies (e.g., explosions), they struggle with complex real-world events (e.g., shoplifting). This difficulty stems from two key issues: (1) the inability of current models to address the diversity of anomaly types, as they process all categories with a shared model, overlooking category-specific features; and (2) the weak supervision signal, which lacks precise temporal information, limiting the ability to capture nuanced anomalous patterns blended with normal events. To address these challenges, we propose Gaussian Splatting-guided Mixture of Experts (GS-MoE), a novel framework that employs a set of expert models, each specialized in capturing specific anomaly types. These experts are guided by a temporal Gaussian splatting loss, enabling the model to leverage temporal consistency and enhance weak supervision. The Gaussian splatting approach encourages a more precise and comprehensive representation of anomalies by focusing on temporal segments most likely to contain abnormal events. The predictions from these specialized experts are integrated through a mixture-of-experts mechanism to model complex relationships across diverse anomaly patterns. Our approach achieves state-of-the-art performance, with a 91.58% AUC on the UCF-Crime dataset, and demonstrates superior results on XD-Violence and MSAD datasets. By leveraging category-specific expertise and temporal guidance, GS-MoE sets a new benchmark for VAD under weak supervision.

[170] Depth Jitter: Seeing through the Depth

Md Sazidur Rahman, David Cabecinhas, Ricard Marxer

Main category: cs.CV

TL;DR: Depth-Jitter is a depth-based augmentation technique that improves model robustness in depth-sensitive environments by simulating natural depth variations.

DetailsMotivation: Conventional augmentation techniques lack depth awareness, limiting model performance in real-world depth variations.

Method: Depth-Jitter applies adaptive depth offsetting guided by depth variance thresholds to generate synthetic depth perturbations while preserving structural integrity.

Result: While not always outperforming traditional methods, Depth-Jitter consistently enhances model stability and generalization in depth-sensitive scenarios.

Conclusion: Depth-Jitter highlights the potential of depth-aware augmentation for real-world applications and encourages further research in depth-based learning strategies.

Abstract: Depth information is essential in computer vision, particularly in underwater imaging, robotics, and autonomous navigation. However, conventional augmentation techniques overlook depth aware transformations, limiting model robustness in real world depth variations. In this paper, we introduce Depth-Jitter, a novel depth-based augmentation technique that simulates natural depth variations to improve generalization. Our approach applies adaptive depth offsetting, guided by depth variance thresholds, to generate synthetic depth perturbations while preserving structural integrity. We evaluate Depth-Jitter on two benchmark datasets, FathomNet and UTDAC2020 demonstrating its impact on model stability under diverse depth conditions. Extensive experiments compare Depth-Jitter against traditional augmentation strategies such as ColorJitter, analyzing performance across varying learning rates, encoders, and loss functions. While Depth-Jitter does not always outperform conventional methods in absolute performance, it consistently enhances model stability and generalization in depth-sensitive environments. These findings highlight the potential of depth-aware augmentation for real-world applications and provide a foundation for further research into depth-based learning strategies. The proposed technique is publicly available to support advancements in depth-aware augmentation. The code is publicly available on \href{https://github.com/mim-team/Depth-Jitter}{github}.

[171] Towards Unified Image Deblurring using a Mixture-of-Experts Decoder

Daniel Feijoo, Paula Garrido-Mellado, Jaesung Rim, Alvaro Garcia, Marcos V. Conde

Main category: cs.CV

TL;DR: An all-in-one image deblurring method using a mixture-of-experts (MoE) decoding module to handle diverse blur types efficiently.

DetailsMotivation: Existing deblurring methods lack generalization, requiring multiple models for different blur types, which is impractical.

Method: Proposes a MoE decoding module that dynamically routes features based on blur type for precise restoration.

Result: Achieves performance comparable to task-specific models and shows robustness on unseen blur scenarios.

Conclusion: The unified approach is efficient, generalizable, and practical for diverse blur degradations.

Abstract: Image deblurring, removing blurring artifacts from images, is a fundamental task in computational photography and low-level computer vision. Existing approaches focus on specialized solutions tailored to particular blur types, thus, these solutions lack generalization. This limitation in current methods implies requiring multiple models to cover several blur types, which is not practical in many real scenarios. In this paper, we introduce the first all-in-one deblurring method capable of efficiently restoring images affected by diverse blur degradations, including global motion, local motion, blur in low-light conditions, and defocus blur. We propose a mixture-of-experts (MoE) decoding module, which dynamically routes image features based on the recognized blur degradation, enabling precise and efficient restoration in an end-to-end manner. Our unified approach not only achieves performance comparable to dedicated task-specific models, but also demonstrates remarkable robustness and generalization capabilities on unseen blur degradation scenarios.

Aman Bhatta, Maria Dhakal, Michael C. King, Kevin W. Bowyer

Main category: cs.CV

TL;DR: The paper proposes a new method for detecting Out-of-gallery cases in one-to-many facial identification by training a classifier using additional enrolled images, showing effectiveness across various probe conditions and demographics.

DetailsMotivation: To reduce false positives and wrongful arrests by objectively determining if a rank-one result in facial identification is Out-of-gallery.

Method: Generates training data from ranks of additional enrolled images, trains a classifier to predict In-gallery/Out-of-gallery status, and tests across datasets and matchers.

Result: The approach works well for mugshot and degraded probes, with similar accuracy across demographics, and is effective only with advanced matchers.

Conclusion: The method offers a viable solution to improve facial identification reliability, reducing investigative errors and time.

Abstract: A central problem in one-to-many facial identification is that the person in the probe image may or may not have enrolled image(s) in the gallery; that is, may be In-gallery or Out-of-gallery. Past approaches to detect when a rank-one result is Out-of-gallery have mostly focused on finding a suitable threshold on the similarity score. We take a new approach, using the additional enrolled images of the identity with the rank-one result to predict if the rank-one result is In-gallery / Out-of-gallery. Given a gallery of identities and images, we generate In-gallery and Out-of-gallery training data by extracting the ranks of additional enrolled images corresponding to the rank-one identity. We then train a classifier to utilize this feature vector to predict whether a rank-one result is In-gallery or Out-of-gallery. Using two different datasets and four different matchers, we present experimental results showing that our approach is viable for mugshot quality probe images, and also, importantly, for probes degraded by blur, reduced resolution, atmospheric turbulence and sunglasses. We also analyze results across demographic groups, and show that In-gallery / Out-of-gallery classification accuracy is similar across demographics. Our approach has the potential to provide an objective estimate of whether a one-to-many facial identification is Out-of-gallery, and thereby to reduce false positive identifications, wrongful arrests, and wasted investigative time. Interestingly, comparing the results of older deep CNN-based face matchers with newer ones suggests that the effectiveness of our Out-of-gallery detection approach emerges only with matchers trained using advanced margin-based loss functions.

[173] Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

Main category: cs.CV

TL;DR: LNCLIP-DF, a parameter-efficient adaptation of CLIP, fine-tunes only Layer Normalization parameters (0.03%) and uses L2 normalization for robust deepfake detection generalization. It outperforms complex methods on 13 datasets, showing older datasets still generalize well.

DetailsMotivation: Addressing the challenge of generalizing deepfake detectors to unseen manipulation techniques without introducing architectural complexity.

Method: Fine-tunes Layer Normalization parameters in a pre-trained CLIP model, enforces hyperspherical feature manifold with L2 normalization and latent space augmentations.

Result: Achieves state-of-the-art performance on 13 benchmark datasets, outperforming complex methods in cross-dataset AUROC. Key findings: paired real-fake data is crucial, and older datasets generalize well.

Conclusion: Targeted, minimal changes to CLIP can achieve state-of-the-art generalization, offering a computationally efficient and reproducible method.

Abstract: The generalization of deepfake detectors to unseen manipulation techniques remains a challenge for practical deployment. Although many approaches adapt foundation models by introducing significant architectural complexity, this work demonstrates that robust generalization is achievable through a parameter-efficient adaptation of a pre-trained CLIP vision encoder. The proposed method, LNCLIP-DF, fine-tunes only the Layer Normalization parameters (0.03% of the total) and enhances generalization by enforcing a hyperspherical feature manifold using L2 normalization and latent space augmentations. We conducted an extensive evaluation on 13 benchmark datasets spanning from 2019 to 2025. The proposed method achieves state-of-the-art performance, outperforming more complex, recent approaches in average cross-dataset AUROC. Our analysis yields two primary findings for the field: 1) training on paired real-fake data from the same source video is essential for mitigating shortcut learning and improving generalization, and 2) detection difficulty on academic datasets has not strictly increased over time, with models trained on older, diverse datasets showing strong generalization capabilities. This work delivers a computationally efficient and reproducible method, proving that state-of-the-art generalization is attainable by making targeted, minimal changes to a pre-trained CLIP model. The code will be made publicly available upon acceptance.

[174] TRUST: Leveraging Text Robustness for Unsupervised Domain Adaptation

Mattia Litrico, Mario Valerio Giuffrida, Sebastiano Battiato, Devis Tuia

Main category: cs.CV

TL;DR: TRUST is a novel UDA method using language modality to guide vision model adaptation, improving robustness to complex domain shifts.

DetailsMotivation: Existing UDA methods struggle with complex domain shifts (e.g., geographical). Language modality offers robustness, motivating its integration into vision model adaptation.

Method: TRUST generates pseudo-labels from captions, uses CLIP similarity for uncertainty estimation, and introduces a multimodal soft-contrastive learning loss to align vision and language features.

Result: TRUST outperforms prior methods, achieving state-of-the-art on DomainNet and GeoNet benchmarks.

Conclusion: TRUST effectively leverages language to enhance vision model adaptation, addressing complex domain shifts and improving performance.

Abstract: Recent unsupervised domain adaptation (UDA) methods have shown great success in addressing classical domain shifts (e.g., synthetic-to-real), but they still suffer under complex shifts (e.g. geographical shift), where both the background and object appearances differ significantly across domains. Prior works showed that the language modality can help in the adaptation process, exhibiting more robustness to such complex shifts. In this paper, we introduce TRUST, a novel UDA approach that exploits the robustness of the language modality to guide the adaptation of a vision model. TRUST generates pseudo-labels for target samples from their captions and introduces a novel uncertainty estimation strategy that uses normalised CLIP similarity scores to estimate the uncertainty of the generated pseudo-labels. Such estimated uncertainty is then used to reweight the classification loss, mitigating the adverse effects of wrong pseudo-labels obtained from low-quality captions. To further increase the robustness of the vision model, we propose a multimodal soft-contrastive learning loss that aligns the vision and language feature spaces, by leveraging captions to guide the contrastive training of the vision model on target images. In our contrastive loss, each pair of images acts as both a positive and a negative pair and their feature representations are attracted and repulsed with a strength proportional to the similarity of their captions. This solution avoids the need for hardly determining positive and negative pairs, which is critical in the UDA setting. Our approach outperforms previous methods, setting the new state-of-the-art on classical (DomainNet) and complex (GeoNet) domain shifts. The code will be available upon acceptance.

[175] FedX: Explanation-Guided Pruning for Communication-Efficient Federated Learning in Remote Sensing

Barış Büyüktaş, Jonas Klotz, Begüm Demir

Main category: cs.CV

TL;DR: FedX is a novel federated learning strategy using explanation-guided pruning to reduce communication overhead in remote sensing image classification without performance loss.

DetailsMotivation: Federated learning (FL) is suitable for remote sensing tasks due to privacy constraints, but communication overhead from large model updates is a challenge.

Method: FedX employs backpropagation-based explanation methods to prune less important model components, reducing the size of transmitted models.

Result: FedX significantly reduces shared model parameters and improves generalization, outperforming unpruned and state-of-the-art pruning methods.

Conclusion: FedX effectively addresses communication overhead in FL for remote sensing tasks while maintaining performance.

Abstract: Federated learning (FL) enables the collaborative training of deep neural networks across decentralized data archives (i.e., clients), where each client stores data locally and only shares model updates with a central server. This makes FL a suitable learning paradigm for remote sensing (RS) image classification tasks, where data centralization may be restricted due to legal and privacy constraints. However, a key challenge in applying FL to RS tasks is the communication overhead caused by the frequent exchange of large model updates between clients and the central server. To address this issue, in this paper we propose a novel strategy (denoted as FedX) that uses explanation-guided pruning to reduce communication overhead by minimizing the size of the transmitted models without compromising performance. FedX leverages backpropagation-based explanation methods to estimate the task-specific importance of model components and prunes the least relevant ones at the central server. The resulting sparse global model is then sent to clients, substantially reducing communication overhead. We evaluate FedX on multi-label scene classification using the BigEarthNet-S2 dataset and single-label scene classification using the EuroSAT dataset. Experimental results show the success of FedX in significantly reducing the number of shared model parameters while enhancing the generalization capability of the global model, compared to both unpruned model and state-of-the-art pruning methods. The code of FedX will be available at https://git.tu-berlin.de/rsim/FedX.

[176] XAG-Net: A Cross-Slice Attention and Skip Gating Network for 2.5D Femur MRI Segmentation

Byunghyun Ko, Anning Tian, Jeongkyu Lee

Main category: cs.CV

TL;DR: XAG-Net, a 2.5D U-Net-based model with cross-slice attention and skip attention gating, outperforms existing methods in femur MRI segmentation.

DetailsMotivation: Accurate femur segmentation from MRI is crucial for orthopedic diagnosis and surgery, but current 2D/3D deep learning methods have limitations.

Method: Proposes XAG-Net, incorporating pixel-wise cross-slice attention and skip attention gating for better inter-slice and intra-slice feature modeling.

Result: XAG-Net achieves higher accuracy than 2D, 2.5D, and 3D U-Net models while remaining computationally efficient.

Conclusion: XAG-Net is an effective framework for femur MRI segmentation, validated by ablation studies.

Abstract: Accurate segmentation of femur structures from Magnetic Resonance Imaging (MRI) is critical for orthopedic diagnosis and surgical planning but remains challenging due to the limitations of existing 2D and 3D deep learning-based segmentation approaches. In this study, we propose XAG-Net, a novel 2.5D U-Net-based architecture that incorporates pixel-wise cross-slice attention (CSA) and skip attention gating (AG) mechanisms to enhance inter-slice contextual modeling and intra-slice feature refinement. Unlike previous CSA-based models, XAG-Net applies pixel-wise softmax attention across adjacent slices at each spatial location for fine-grained inter-slice modeling. Extensive evaluations demonstrate that XAG-Net surpasses baseline 2D, 2.5D, and 3D U-Net models in femur segmentation accuracy while maintaining computational efficiency. Ablation studies further validate the critical role of the CSA and AG modules, establishing XAG-Net as a promising framework for efficient and accurate femur MRI segmentation.

[177] SPARSE Data, Rich Results: Few-Shot Semi-Supervised Learning via Class-Conditioned Image Translation

Guido Manni, Clemente Lauretti, Loredana Zollo, Paolo Soda

Main category: cs.CV

TL;DR: A GAN-based semi-supervised learning framework for medical imaging improves classification with minimal labeled data, outperforming state-of-the-art methods.

DetailsMotivation: Deep learning in medical imaging is limited by scarce labeled data. This paper addresses the challenge of achieving robust performance with very few labeled samples.

Method: The framework uses a GAN with three networks (generator, discriminator, classifier) and a three-phase training process. It combines supervised learning on labeled data and unsupervised learning via image-to-image translation, with ensemble-based pseudo-labeling for unlabeled data.

Result: The method significantly outperforms six state-of-the-art GAN-based semi-supervised methods across eleven MedMNIST datasets, especially in extreme low-data settings (e.g., 5-shot).

Conclusion: The proposed framework is a practical solution for medical imaging with prohibitive annotation costs, enabling strong performance even with minimal labeled data.

Abstract: Deep learning has revolutionized medical imaging, but its effectiveness is severely limited by insufficient labeled training data. This paper introduces a novel GAN-based semi-supervised learning framework specifically designed for low labeled-data regimes, evaluated across settings with 5 to 50 labeled samples per class. Our approach integrates three specialized neural networks – a generator for class-conditioned image translation, a discriminator for authenticity assessment and classification, and a dedicated classifier – within a three-phase training framework. The method alternates between supervised training on limited labeled data and unsupervised learning that leverages abundant unlabeled images through image-to-image translation rather than generation from noise. We employ ensemble-based pseudo-labeling that combines confidence-weighted predictions from the discriminator and classifier with temporal consistency through exponential moving averaging, enabling reliable label estimation for unlabeled data. Comprehensive evaluation across eleven MedMNIST datasets demonstrates that our approach achieves statistically significant improvements over six state-of-the-art GAN-based semi-supervised methods, with particularly strong performance in the extreme 5-shot setting where the scarcity of labeled data is most challenging. The framework maintains its superiority across all evaluated settings (5, 10, 20, and 50 shots per class). Our approach offers a practical solution for medical imaging applications where annotation costs are prohibitive, enabling robust classification performance even with minimal labeled data. Code is available at https://github.com/GuidoManni/SPARSE.

[178] Uncertainty-quantified Rollout Policy Adaptation for Unlabelled Cross-domain Temporal Grounding

Jian Hu, Zixu Cheng, Shaogang Gong, Isabel Guan, Jianye Hao, Jun Wang, Kun Shao

Main category: cs.CV

TL;DR: The paper introduces URPA, a data-efficient method for cross-domain video temporal grounding without labeled target data, using uncertainty-quantified rollouts for adaptation.

DetailsMotivation: Existing methods like GRPO require labeled data and are computationally expensive, limiting their use in unlabeled domains and real-time applications.

Method: URPA adapts a model trained on labeled source data to a target domain using unlabeled videos, leveraging GRPO rollouts to generate pseudo labels and confidence-weighted rewards.

Result: URPA achieves strong generalization across six cross-domain settings with minimal unlabeled target data.

Conclusion: URPA offers a practical solution for real-time, label-efficient video temporal grounding.

Abstract: Video Temporal Grounding (TG) aims to temporally locate video segments matching a natural language description (a query) in a long video. While Vision-Language Models (VLMs) are effective at holistic semantic matching, they often struggle with fine-grained temporal localisation. Recently, Group Relative Policy Optimisation (GRPO) reformulates the inference process as a reinforcement learning task, enabling fine-grained grounding and achieving strong in-domain performance. However, GRPO relies on labelled data, making it unsuitable in unlabelled domains. Moreover, because videos are large and expensive to store and process, performing full-scale adaptation introduces prohibitive latency and computational overhead, making it impractical for real-time deployment. To overcome both problems, we introduce a Data-Efficient Unlabelled Cross-domain Temporal Grounding method, from which a model is first trained on a labelled source domain, then adapted to a target domain using only a small number of unlabelled videos from the target domain. This approach eliminates the need for target annotation and keeps both computational and storage overhead low enough to run in real time. Specifically, we introduce. Uncertainty-quantified Rollout Policy Adaptation (URPA) for cross-domain knowledge transfer in learning video temporal grounding without target labels. URPA generates multiple candidate predictions using GRPO rollouts, averages them to form a pseudo label, and estimates confidence from the variance across these rollouts. This confidence then weights the training rewards, guiding the model to focus on reliable supervision. Experiments on three datasets across six cross-domain settings show that URPA generalises well using only a few unlabelled target videos. Codes will be released once published.

[179] WGAST: Weakly-Supervised Generative Network for Daily 10 m Land Surface Temperature Estimation via Spatio-Temporal Fusion

Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai

Main category: cs.CV

TL;DR: WGAST is a weakly-supervised deep learning framework for estimating daily 10 m Land Surface Temperature (LST) by fusing data from Terra MODIS, Landsat 8, and Sentinel-2, outperforming existing methods.

DetailsMotivation: The increasing demand for precise environmental monitoring due to urbanization, climate change, and agricultural stress necessitates high-resolution LST data, which current remote sensing systems struggle to provide due to spatial-temporal resolution trade-offs.

Method: WGAST uses a conditional generative adversarial network with a four-stage generator: feature extraction, fusion, LST reconstruction, and noise suppression. It employs encoders, cosine similarity, normalization, temporal attention, and a Gaussian filter. Training is weakly supervised and reinforced by a PatchGAN discriminator.

Result: WGAST reduces RMSE by 17.18% and improves SSIM by 11.00% compared to baselines, showing robustness to cloud-induced LST and capturing fine-scale thermal patterns, validated by 33 ground sensors.

Conclusion: WGAST is an effective solution for high-resolution daily LST estimation, advancing spatio-temporal fusion methods for environmental monitoring.

Abstract: Urbanization, climate change, and agricultural stress are increasing the demand for precise and timely environmental monitoring. Land Surface Temperature (LST) is a key variable in this context and is retrieved from remote sensing satellites. However, these systems face a trade-off between spatial and temporal resolution. While spatio-temporal fusion methods offer promising solutions, few have addressed the estimation of daily LST at 10 m resolution. In this study, we present WGAST, a Weakly-Supervised Generative Network for Daily 10 m LST Estimation via Spatio-Temporal Fusion of Terra MODIS, Landsat 8, and Sentinel-2. WGAST is the first end-to-end deep learning framework designed for this task. It adopts a conditional generative adversarial architecture, with a generator composed of four stages: feature extraction, fusion, LST reconstruction, and noise suppression. The first stage employs a set of encoders to extract multi-level latent representations from the inputs, which are then fused in the second stage using cosine similarity, normalization, and temporal attention mechanisms. The third stage decodes the fused features into high-resolution LST, followed by a Gaussian filter to suppress high-frequency noise. Training follows a weakly supervised strategy based on physical averaging principles and reinforced by a PatchGAN discriminator. Experiments demonstrate that WGAST outperforms existing methods in both quantitative and qualitative evaluations. Compared to the best-performing baseline, on average, WGAST reduces RMSE by 17.18% and improves SSIM by 11.00%. Furthermore, WGAST is robust to cloud-induced LST and effectively captures fine-scale thermal patterns, as validated against 33 ground-based sensors. The code is available at https://github.com/Sofianebouaziz1/WGAST.git.

[180] CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

Shengzhu Yang, Jiawei Du, Shuai Lu, Weihang Zhang, Ningli Wang, Huiqi Li

Main category: cs.CV

TL;DR: CLIPin is a non-contrastive plug-in for CLIP-style models to improve multimodal alignment and robustness, tested on diverse tasks.

DetailsMotivation: Address weak supervision in natural image-text datasets and low diversity in medical datasets, hindering robust representation learning in CLIP.

Method: Proposes CLIPin, a unified plug-in with shared pre-projectors for image and text, integrating contrastive and non-contrastive learning.

Result: Demonstrates effectiveness and generality across downstream tasks, enhancing alignment robustness.

Conclusion: CLIPin is a versatile, plug-and-play solution compatible with various contrastive frameworks.

Abstract: Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model’s ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.

[181] Can Diffusion Models Bridge the Domain Gap in Cardiac MR Imaging?

Xin Ci Wong, Duygu Sarikaya, Kieran Zucker, Marc De Kamps, Nishant Ravikumar

Main category: cs.CV

TL;DR: A diffusion model generates synthetic cardiac MR images to address domain shift, improving segmentation performance on unseen data.

DetailsMotivation: Domain shift in MR imaging limits AI model deployment; synthetic data offers a solution but faces anatomical consistency challenges.

Method: A diffusion model creates synthetic cardiac MR images resembling a reference, ensuring structural fidelity. Evaluated with nnU-Net and U-Net for domain generalization and adaptation.

Result: Significant improvement in segmentation performance on unseen domains (p < 0.01) compared to real data alone.

Conclusion: The method reduces reliance on transfer learning, addressing domain shift in data-scarce settings.

Abstract: Magnetic resonance (MR) imaging, including cardiac MR, is prone to domain shift due to variations in imaging devices and acquisition protocols. This challenge limits the deployment of trained AI models in real-world scenarios, where performance degrades on unseen domains. Traditional solutions involve increasing the size of the dataset through ad-hoc image augmentation or additional online training/transfer learning, which have several limitations. Synthetic data offers a promising alternative, but anatomical/structural consistency constraints limit the effectiveness of generative models in creating image-label pairs. To address this, we propose a diffusion model (DM) trained on a source domain that generates synthetic cardiac MR images that resemble a given reference. The synthetic data maintains spatial and structural fidelity, ensuring similarity to the source domain and compatibility with the segmentation mask. We assess the utility of our generative approach in multi-centre cardiac MR segmentation, using the 2D nnU-Net, 3D nnU-Net and vanilla U-Net segmentation networks. We explore domain generalisation, where, domain-invariant segmentation models are trained on synthetic source domain data, and domain adaptation, where, we shift target domain data towards the source domain using the DM. Both strategies significantly improved segmentation performance on data from an unseen target domain, in terms of surface-based metrics (Welch’s t-test, p < 0.01), compared to training segmentation models on real data alone. The proposed method ameliorates the need for transfer learning or online training to address domain shift challenges in cardiac MR image analysis, especially useful in data-scarce settings.

[182] Text Embedded Swin-UMamba for DeepLesion Segmentation

Ruida Cheng, Tejas Sudharshan Mathai, Pritam Mukherjee, Benjamin Hou, Qingqing Zhu, Zhiyong Lu, Matthew McAuliffe, Ronald M. Summers

Main category: cs.CV

TL;DR: The paper explores integrating LLMs with the Swin-UMamba architecture for lesion segmentation, achieving high accuracy and outperforming prior models.

DetailsMotivation: To enhance lesion segmentation by combining imaging features with text descriptions from radiology reports.

Method: Integration of text into the Swin-UMamba architecture using the ULS23 DeepLesion dataset and radiology report descriptions.

Result: Achieved a Dice Score of 82% and Hausdorff distance of 6.58 pixels, outperforming previous models.

Conclusion: The Text-Swin-UMamba model is effective for lesion segmentation, combining text and imaging data for improved results.

Abstract: Segmentation of lesions on CT enables automatic measurement for clinical assessment of chronic diseases (e.g., lymphoma). Integrating large language models (LLMs) into the lesion segmentation workflow offers the potential to combine imaging features with descriptions of lesion characteristics from the radiology reports. In this study, we investigate the feasibility of integrating text into the Swin-UMamba architecture for the task of lesion segmentation. The publicly available ULS23 DeepLesion dataset was used along with short-form descriptions of the findings from the reports. On the test dataset, a high Dice Score of 82% and low Hausdorff distance of 6.58 (pixels) was obtained for lesion segmentation. The proposed Text-Swin-UMamba model outperformed prior approaches: 37% improvement over the LLM-driven LanGuideMedSeg model (p < 0.001),and surpassed the purely image-based xLSTM-UNet and nnUNet models by 1.74% and 0.22%, respectively. The dataset and code can be accessed at https://github.com/ruida/LLM-Swin-UMamba

[183] ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction

Patrick Takenaka, Johannes Maucher, Marco F. Huber

Main category: cs.CV

TL;DR: The paper improves ViPro by enabling unsupervised state inference from observations without initial ground truth, addressing shortcuts in the original model.

DetailsMotivation: To address the limitation of ViPro, which relied on ground truth initial states and failed with noisy observations, by enabling robust state inference.

Method: Enhanced ViPro with improvements for unsupervised state inference and extended the Orbits dataset to 3D for realism.

Result: The improved model successfully infers states from observations without initial ground truth, validated on a 3D dataset.

Conclusion: The enhancements enable robust state inference in noisy scenarios, advancing video frame prediction.

Abstract: Predicting future video frames is a challenging task with many downstream applications. Previous work has shown that procedural knowledge enables deep models for complex dynamical settings, however their model ViPro assumed a given ground truth initial symbolic state. We show that this approach led to the model learning a shortcut that does not actually connect the observed environment with the predicted symbolic state, resulting in the inability to estimate states given an observation if previous states are noisy. In this work, we add several improvements to ViPro that enables the model to correctly infer states from observations without providing a full ground truth state in the beginning. We show that this is possible in an unsupervised manner, and extend the original Orbits dataset with a 3D variant to close the gap to real world scenarios.

[184] Street View Sociability: Interpretable Analysis of Urban Social Behavior Across 15 Cities

Kieran Elrod, Katherine Flanigan, Mario Bergés

Main category: cs.CV

TL;DR: Street view imagery can infer social interactions using social science theory, showing correlations with urban design variables and place attachment.

DetailsMotivation: To bridge the gap in measuring social interaction quality in urban planning by leveraging street view imagery and social science theory.

Method: Analyzed 2,998 street view images using a multimodal large language model guided by Mehta’s sociability taxonomy, with regression models controlling for environmental factors.

Result: Sky view index linked to all sociability types; green view index predicted enduring sociability; place attachment correlated with fleeting sociability.

Conclusion: Street view imagery shows promise as a scalable tool for studying urban sociability and informing evidence-based urban design.

Abstract: Designing socially active streets has long been a goal of urban planning, yet existing quantitative research largely measures pedestrian volume rather than the quality of social interactions. We hypothesize that street view imagery – an inexpensive data source with global coverage – contains latent social information that can be extracted and interpreted through established social science theory. As a proof of concept, we analyzed 2,998 street view images from 15 cities using a multimodal large language model guided by Mehta’s taxonomy of passive, fleeting, and enduring sociability – one illustrative example of a theory grounded in urban design that could be substituted or complemented by other sociological frameworks. We then used linear regression models, controlling for factors like weather, time of day, and pedestrian counts, to test whether the inferred sociability measures correlate with city-level place attachment scores from the World Values Survey and with environmental predictors (e.g., green, sky, and water view indices) derived from individual street view images. Results aligned with long-standing urban planning theory: the sky view index was associated with all three sociability types, the green view index predicted enduring sociability, and place attachment was positively associated with fleeting sociability. These results provide preliminary evidence that street view images can be used to infer relationships between specific types of social interactions and built environment variables. Further research could establish street view imagery as a scalable, privacy-preserving tool for studying urban sociability, enabling cross-cultural theory testing and evidence-based design of socially vibrant cities.

[185] Aligning Effective Tokens with Video Anomaly in Large Language Models

Yingxian Chen, Jiahui Liu, Ruifan Di, Yanwei Li, Chirui Chang, Shizhen Zhao, Wilton W. T. Fok, Xiaojuan Qi, Yik-Chung Wu

Main category: cs.CV

TL;DR: VA-GPT is a novel MLLM designed for summarizing and localizing abnormal events in videos, leveraging VLMs and LLMs with spatial and temporal token modules (SETS and TETG) for improved accuracy.

DetailsMotivation: Current MLLMs struggle with anomalies due to spatial and temporal sparsity, leading to suboptimal results.

Method: Proposes VA-GPT with SETS and TETG modules for effective token alignment between visual encoders and LLMs, and introduces a fine-tuning dataset and cross-domain benchmark.

Result: Outperforms state-of-the-art methods on various benchmarks.

Conclusion: VA-GPT effectively addresses challenges in analyzing abnormal events, offering improved performance and accuracy.

Abstract: Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks.

[186] An Implemention of Two-Phase Image Segmentation using the Split Bregman Method

Olakunle S. Abawonse, Günay Doğan

Main category: cs.CV

TL;DR: Implementation of a two-phase image segmentation algorithm based on Goldstein et al.’s modification of the Chan-Vese model, optimized using the split Bregman method.

DetailsMotivation: To efficiently partition images into foreground and background regions by leveraging a modified energy model for smoother boundaries and distinct pixel averages.

Method: Uses a two-phase segmentation algorithm with a modified Chan-Vese energy model, optimized via the split Bregman method for efficiency.

Result: Demonstrated performance across various images and parameter settings, confirming the method’s effectiveness.

Conclusion: The implementation successfully achieves efficient two-phase segmentation with smooth boundaries, validating the modified approach.

Abstract: In this paper, we describe an implementation of the two-phase image segmentation algorithm proposed by Goldstein, Bresson, Osher in \cite{gold:bre}. This algorithm partitions the domain of a given 2d image into foreground and background regions, and each pixel of the image is assigned membership to one of these two regions. The underlying assumption for the segmentation model is that the pixel values of the input image can be summarized by two distinct average values, and that the region boundaries are smooth. Accordingly, the model is defined as an energy in which the variable is a region membership function to assign pixels to either region, originally proposed by Chan and Vese in \cite{chan:vese}. This energy is the sum of image data terms in the regions and a length penalty for region boundaries. Goldstein, Bresson, Osher modify the energy of Chan-Vese in \cite{gold:bre} so that their new energy can be minimized efficiently using the split Bregman method to produce an equivalent two-phase segmentation. We provide a detailed implementation of this method \cite{gold:bre}, and document its performance with several images over a range of algorithm parameters.

[187] Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning

Xiangyu Wu, Feng Yu, Yang Yang, Jianfeng Lu

Main category: cs.CV

TL;DR: TaAM-CPT introduces a scalable method for general representation learning across unlimited modalities using only text data, achieving top results without modality-specific labels.

DetailsMotivation: Existing methods rely on massive labeled data or are limited to single modalities, prompting the need for a more flexible and scalable solution.

Method: TaAM-CPT uses modality prompt pools, text construction, and modality-aligned text encoders, with intra- and inter-modal learning objectives for consistency.

Result: TaAM-CPT achieves leading performance on diverse datasets (video, image, audio classification) without modality-specific labeled data.

Conclusion: TaAM-CPT offers a scalable and effective approach for multimodal learning, extending to unlimited modalities with text-only data.

Abstract: The integration of prompt tuning with multimodal learning has shown significant generalization abilities for various downstream tasks. Despite advancements, existing methods heavily depend on massive modality-specific labeled data (e.g., video, audio, and image), or are customized for a single modality. In this study, we present Text as Any-Modality by Consistent Prompt Tuning (TaAM-CPT), a scalable approach for constructing a general representation model toward unlimited modalities using solely text data. TaAM-CPT comprises modality prompt pools, text construction, and modality-aligned text encoders from pre-trained models, which allows for extending new modalities by simply adding prompt pools and modality-aligned text encoders. To harmonize the learning across different modalities, TaAM-CPT designs intra- and inter-modal learning objectives, which can capture category details within modalities while maintaining semantic consistency across different modalities. Benefiting from its scalable architecture and pre-trained models, TaAM-CPT can be seamlessly extended to accommodate unlimited modalities. Remarkably, without any modality-specific labeled data, TaAM-CPT achieves leading results on diverse datasets spanning various modalities, including video classification, image classification, and audio classification. The code is available at https://github.com/Jinx630/TaAM-CPT.

[188] FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

Wenbin Teng, Gonglin Chen, Haiwei Chen, Yajie Zhao

Main category: cs.CV

TL;DR: FVGen accelerates novel view synthesis using VDMs, reducing sampling steps to four while maintaining quality, improving efficiency for 3D reconstruction with sparse views.

DetailsMotivation: Addressing the slow sampling speed of VDMs in sparse-view 3D reconstruction, which hinders practical applications.

Method: Proposes FVGen, a framework distilling a multi-step VDM into a few-step model using GANs and softened reverse KL-divergence.

Result: Achieves similar/better visual quality with 90% faster sampling, enhancing efficiency for sparse-view reconstruction.

Conclusion: FVGen significantly boosts time efficiency for 3D reconstruction tasks, especially with sparse input views.

Abstract: Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as four sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to previous works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage.

[189] Feature-Space Oversampling for Addressing Class Imbalance in SAR Ship Classification

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

Main category: cs.CV

TL;DR: The paper evaluates oversampling in feature space for SAR ship classification, proposing two novel algorithms (M2m$_f$ and M2m$_u$) that outperform baselines, improving F1-scores by 8.82% and 4.44% on two datasets.

DetailsMotivation: Addressing class imbalance in SAR ship classification due to long-tailed datasets by leveraging oversampling methods.

Method: Proposed two algorithms (M2m$_f$ and M2m$_u$) inspired by M2m, tested on OpenSARShip and FuSARShip using ViT, VGG16, and ResNet50 as feature extractors. Analyzed oversampling impact on class sizes.

Result: Novel methods outperformed baselines, achieving average F1-score increases of 8.82% (FuSARShip) and 4.44% (OpenSARShip).

Conclusion: The proposed oversampling methods effectively improve SAR ship classification performance, especially for underrepresented classes.

Abstract: SAR ship classification faces the challenge of long-tailed datasets, which complicates the classification of underrepresented classes. Oversampling methods have proven effective in addressing class imbalance in optical data. In this paper, we evaluated the effect of oversampling in the feature space for SAR ship classification. We propose two novel algorithms inspired by the Major-to-minor (M2m) method M2m$_f$, M2m$_u$. The algorithms are tested on two public datasets, OpenSARShip (6 classes) and FuSARShip (9 classes), using three state-of-the-art models as feature extractors: ViT, VGG16, and ResNet50. Additionally, we also analyzed the impact of oversampling methods on different class sizes. The results demonstrated the effectiveness of our novel methods over the original M2m and baselines, with an average F1-score increase of 8.82% for FuSARShip and 4.44% for OpenSARShip.

[190] MotionSwap

Om Patil, Jinesh Modi, Suryabha Mukhopadhyay, Meghaditya Giri, Chhavi Malhotra

Main category: cs.CV

TL;DR: This paper enhances SimSwap for face swapping by adding attention mechanisms, dynamic loss weighting, and learning rate scheduling, improving identity preservation and visual quality.

DetailsMotivation: To advance face swapping technology by improving the SimSwap framework for higher fidelity and better performance.

Method: Integrates self and cross-attention mechanisms, dynamic loss weighting, and cosine annealing learning rate scheduling into the generator.

Result: Achieves better identity similarity, lower FID scores, and superior visual quality compared to the baseline.

Conclusion: Future work includes integrating StyleGAN3, improving lip sync, adding 3D modeling, and ensuring temporal consistency for videos.

Abstract: Face swapping technology has gained significant attention in both academic research and commercial applications. This paper presents our implementation and enhancement of SimSwap, an efficient framework for high fidelity face swapping. We introduce several improvements to the original model, including the integration of self and cross-attention mechanisms in the generator architecture, dynamic loss weighting, and cosine annealing learning rate scheduling. These enhancements lead to significant improvements in identity preservation, attribute consistency, and overall visual quality. Our experimental results, spanning 400,000 training iterations, demonstrate progressive improvements in generator and discriminator performance. The enhanced model achieves better identity similarity, lower FID scores, and visibly superior qualitative results compared to the baseline. Ablation studies confirm the importance of each architectural and training improvement. We conclude by identifying key future directions, such as integrating StyleGAN3, improving lip synchronization, incorporating 3D facial modeling, and introducing temporal consistency for video-based applications.

[191] LightSwitch: Multi-view Relighting with Material-guided Diffusion

Yehonathan Litman, Fernando De la Torre, Shubham Tulsiani

Main category: cs.CV

TL;DR: Lightswitch is a novel material-relighting diffusion framework that improves 3D relighting by leveraging multi-view and intrinsic properties, outperforming prior methods in speed and quality.

DetailsMotivation: Existing 2D relighting priors lack utilization of intrinsic properties and multi-view data, leading to subpar results.

Method: Lightswitch finetunes a material-relighting diffusion framework, incorporating multi-view and material cues with scalable denoising.

Result: It exceeds state-of-the-art 2D relighting priors and matches or outperforms diffusion inverse rendering methods in relighting synthetic and real objects quickly.

Conclusion: Lightswitch efficiently and consistently relights multi-view data, demonstrating superior performance in quality and speed.

Abstract: Recent approaches for 3D relighting have shown promise in integrating 2D image relighting generative priors to alter the appearance of a 3D representation while preserving the underlying structure. Nevertheless, generative priors used for 2D relighting that directly relight from an input image do not take advantage of intrinsic properties of the subject that can be inferred or cannot consider multi-view data at scale, leading to subpar relighting. In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. By using multi-view and material information cues together with a scalable denoising scheme, our method consistently and efficiently relights dense multi-view data of objects with diverse material compositions. We show that our 2D relighting prediction quality exceeds previous state-of-the-art relighting priors that directly relight from images. We further demonstrate that LightSwitch matches or outperforms state-of-the-art diffusion inverse rendering methods in relighting synthetic and real objects in as little as 2 minutes.

[192] Improved DDIM Sampling with Moment Matching Gaussian Mixtures

Prasad Gabbur

Main category: cs.CV

TL;DR: Using a GMM kernel in DDIM improves sample quality, especially with fewer sampling steps, outperforming Gaussian kernels in metrics like FID and IS.

DetailsMotivation: To enhance the quality of generated samples in diffusion models by replacing the Gaussian kernel with a GMM kernel, particularly for accelerated sampling.

Method: Matching first and second order moments of DDPM forward marginals by constraining GMM parameters, and applying this to DDIM.

Result: Significant improvements in FID and IS metrics, e.g., FID 6.94 vs. 10.15 on ImageNet 256x256 with 10 steps.

Conclusion: GMM kernels outperform Gaussian kernels in diffusion models, especially with limited sampling steps, and show promise in rectified flow models.

Abstract: We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models.

[193] CPT-Interp: Continuous sPatial and Temporal Motion Modeling for 4D Medical Image Interpolation

Xia Li, Runzhao Yang, Xiangtai Li, Antony Lomax, Ye Zhang, Joachim Buhmann

Main category: cs.CV

TL;DR: A novel method using implicit neural representation for continuous frame interpolation in 4D medical imaging, inspired by fluid mechanics, improves accuracy and speed without requiring extensive datasets.

DetailsMotivation: Overcome the trade-off between temporal resolution and image quality in 4D medical imaging by addressing the limitations of previous frame interpolation methods.

Method: Proposes a fluid mechanics-inspired approach using implicit neural representation to model patient anatomic motion continuously, ensuring spatial and temporal continuity.

Result: Demonstrates superior accuracy and speed in experiments across multiple datasets, while avoiding the need for large datasets or extensive training.

Conclusion: The method effectively bridges Eulerian and Lagrangian specifications, offering a training-free, case-specific optimization solution for continuous frame interpolation.

Abstract: Motion information from 4D medical imaging offers critical insights into dynamic changes in patient anatomy for clinical assessments and radiotherapy planning and, thereby, enhances the capabilities of 3D image analysis. However, inherent physical and technical constraints of imaging hardware often necessitate a compromise between temporal resolution and image quality. Frame interpolation emerges as a pivotal solution to this challenge. Previous methods often suffer from discretion when they estimate the intermediate motion and execute the forward warping. In this study, we draw inspiration from fluid mechanics to propose a novel approach for continuously modeling patient anatomic motion using implicit neural representation. It ensures both spatial and temporal continuity, effectively bridging Eulerian and Lagrangian specifications together to naturally facilitate continuous frame interpolation. Our experiments across multiple datasets underscore the method’s superior accuracy and speed. Furthermore, as a case-specific optimization (training-free) approach, it circumvents the need for extensive datasets and addresses model generalization issues.

[194] A Calibration Tool for Refractive Underwater Vision

Felix Seegräber, Mengkun She, Felix Woelk, Kevin Köser

Main category: cs.CV

TL;DR: An open-source toolbox for underwater refractive camera calibration is introduced, addressing the lack of tools for calibrating cameras behind flat or dome ports in underwater environments.

DetailsMotivation: Underwater cameras face refraction issues due to interfaces like water, glass, and air, complicating calibration. Existing models lack practical calibration tools.

Method: The toolbox provides end-to-end calibration for underwater systems, including camera, stereo, and housing calibration, validated with rendered datasets and real-world experiments.

Result: The implementation successfully calibrates underwater vision systems, handling refraction effects for flat or dome ports.

Conclusion: This work fills a gap in underwater vision by offering a practical, open-source solution for refractive camera calibration.

Abstract: Many underwater applications rely on vision sensors and require proper camera calibration, i.e. knowing the incoming light ray for each pixel in the image. While for the ideal pinhole camera model all viewing rays intersect in a single 3D point, underwater cameras suffer from - possibly multiple - refractions of light rays at the interfaces of water, glass and air. These changes of direction depend on the position and orientation of the camera inside the water-proof housing, as well as on the shape and properties of the optical window, the port, itself. In recent years explicit models for underwater vision behind common ports such as flat or dome port have been proposed, but the underwater community is still lacking a calibration tool which can determine port parameters through refractive calibration. With this work we provide the first open source implementation of an underwater refractive camera calibration toolbox. It allows end-to-end calibration of underwater vision systems, including camera, stereo and housing calibration for systems with dome or flat ports. The implementation is verified using rendered datasets and real-world experiments.

[195] INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs’ Performance in Insurance

Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo

Main category: cs.CV

TL;DR: The paper introduces INS-MMBench, a hierarchical benchmark for evaluating Large Vision-Language Models (LVLMs) in the insurance domain, addressing a gap in systematic assessment.

DetailsMotivation: The potential of LVLMs in the insurance domain, with its diverse scenarios and multimodal data, is underexplored, lacking benchmarks or reviews.

Method: The study reviews multimodal tasks for 4 insurance types, creates INS-MMBench with 22 fundamental, 12 meta-, and 5 scenario tasks, and evaluates 11 LVLMs.

Result: INS-MMBench effectively assesses LVLMs, revealing strengths and limitations in insurance tasks, with GPT-4o and LLaVA among tested models.

Conclusion: INS-MMBench aims to accelerate LVLM integration into insurance and foster interdisciplinary research, with open dataset and code.

Abstract: Large Vision-Language Models (LVLMs) and Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance in various general multimodal applications and have shown increasing promise in specialized domains. However, their potential in the insurance domain-characterized by diverse application scenarios and rich multimodal data-remains largely underexplored. To date, there is no systematic review of multimodal tasks, nor a benchmark specifically designed to assess the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance industry. This study systematically reviews and categorizes multimodal tasks for 4 representative types of insurance: auto, property, health, and agricultural. We introduce INS-MMBench, the first hierarchical benchmark tailored for the insurance domain. INS-MMBench encompasses 22 fundamental tasks, 12 meta-tasks and 5 scenario tasks, enabling a comprehensive and progressive assessment from basic capabilities to real-world use cases. We benchmark 11 leading LVLMs, including closed-source models such as GPT-4o and open-source models like LLaVA. Our evaluation validates the effectiveness of INS-MMBench and offers detailed insights into the strengths and limitations of current LVLMs on a variety of insurance-related multimodal tasks. We hope that INS-MMBench will accelerate the integration of LVLMs into the insurance industry and foster interdisciplinary research. Our dataset and evaluation code are available at https://github.com/FDU-INS/INS-MMBench.

[196] Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection

Hyewon Park, Hyejin Park, Jueun Ko, Dongbo Min

Main category: cs.CV

TL;DR: Hybrid-TTA combines Full-Tuning and Efficient-Tuning dynamically using DDSD and MIMA, achieving a 1.6%p mIoU improvement on Cityscapes-to-ACDC.

DetailsMotivation: Addressing the limitations of existing CTTA methods in handling domain shifts effectively.

Method: Proposes Hybrid-TTA with Dynamic Domain Shift Detection (DDSD) and Masked Image Modeling based Adaptation (MIMA) for dynamic tuning selection.

Result: Achieves 1.6%p mIoU improvement on Cityscapes-to-ACDC, outperforming prior methods.

Conclusion: Hybrid-TTA offers a robust solution for real-world continual adaptation challenges.

Abstract: Continual Test Time Adaptation (CTTA) has emerged as a critical approach for bridging the domain gap between the controlled training environments and the real-world scenarios, enhancing model adaptability and robustness. Existing CTTA methods, typically categorized into Full-Tuning (FT) and Efficient-Tuning (ET), struggle with effectively addressing domain shifts. To overcome these challenges, we propose Hybrid-TTA, a holistic approach that dynamically selects instance-wise tuning method for optimal adaptation. Our approach introduces the Dynamic Domain Shift Detection (DDSD) strategy, which identifies domain shifts by leveraging temporal correlations in input sequences and dynamically switches between FT and ET to adapt to varying domain shifts effectively. Additionally, the Masked Image Modeling based Adaptation (MIMA) framework is integrated to ensure domain-agnostic robustness with minimal computational overhead. Our Hybrid-TTA achieves a notable 1.6%p improvement in mIoU on the Cityscapes-to-ACDC benchmark dataset, surpassing previous state-of-the-art methods and offering a robust solution for real-world continual adaptation challenges.

[197] Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence

Alessandro Riva, Alessandro Raganato, Simone Melzi

Main category: cs.CV

TL;DR: The paper explores fixing attention weights in Transformer architectures for point cloud matching, showing faster training and improved stability.

DetailsMotivation: Current data-driven methods for point cloud matching are resource-intensive, prompting investigation into more efficient approaches.

Method: Integrates fixed Gaussian-like attention weights in Transformer attention heads, testing variants with fixed and learnable variances. Also evaluates noise robustness.

Result: Fixed attention weights speed up training and enhance optimization stability. Ablation studies reveal impactful layers and network reliance.

Conclusion: Fixing attention weights improves efficiency and stability in point cloud matching, with potential for noise robustness.

Abstract: Current data-driven methodologies for point cloud matching demand extensive training time and computational resources, presenting significant challenges for model deployment and application. In the point cloud matching task, recent advancements with an encoder-only Transformer architecture have revealed the emergence of semantically meaningful patterns in the attention heads, particularly resembling Gaussian functions centered on each point of the input shape. In this work, we further investigate this phenomenon by integrating these patterns as fixed attention weights within the attention heads of the Transformer architecture. We evaluate two variants: one utilizing predetermined variance values for the Gaussians, and another where the variance values are treated as learnable parameters. Additionally we analyze the performances on noisy data and explore a possible way to improve robustness to noise. Our findings demonstrate that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization. Furthermore, we conducted an ablation study to identify the specific layers where the infused information is most impactful and to understand the reliance of the network on this information.

[198] ShadowMamba: State-Space Model with Boundary-Region Selective Scan for Shadow Removal

Xiujin Zhu, Chee-Onn Chow, Joon Huang Chuah

Main category: cs.CV

TL;DR: The paper introduces ShadowMamba, a Mamba-based model for image shadow removal, addressing efficiency and global modeling limitations of Transformer-based methods with a novel boundary-region selective scanning mechanism and shadow mask denoising.

DetailsMotivation: Shadows degrade image quality and hinder downstream vision tasks. Existing Transformer-based methods are inefficient due to quadratic complexity, while local attention limits global modeling. Mamba's linear complexity offers a solution but lacks suitable scanning strategies for shadow removal.

Method: Proposes a boundary-region selective scanning mechanism to enhance semantic continuity and a shadow mask denoising method. Introduces ShadowMamba, the first Mamba-based model for shadow removal.

Result: Outperforms existing methods on AISTD, ISTD, and SRD datasets, with advantages in parameter efficiency and computational complexity.

Conclusion: ShadowMamba effectively addresses the limitations of prior methods, offering superior performance and efficiency for shadow removal.

Abstract: Image shadow removal is a typical low-level vision task. Shadows cause local brightness shifts, which reduce the performance of downstream vision tasks. Currently, Transformer-based shadow removal methods suffer from quadratic computational complexity due to the self-attention mechanism. To improve efficiency, many approaches use local attention, but this limits the ability to model global information and weakens the perception of brightness changes between regions. Recently, Mamba has shown strong performance in vision tasks by enabling global modeling with linear complexity. However, existing scanning strategies are not suitable for shadow removal, as they ignore the semantic continuity of shadow boundaries and internal regions. To address this, this paper proposes a boundary-region selective scanning mechanism that captures local details while enhancing semantic continuity between them, effectively improving shadow removal performance. In addition, a shadow mask denoising method is introduced to support the scanning mechanism and improve data quality. Based on these techniques, this paper presents a model called ShadowMamba, the first Mamba-based model designed for shadow removal. Experimental results show that the proposed method outperforms existing mainstream approaches on the AISTD, ISTD, and SRD datasets, and also offers clear advantages in parameter efficiency and computational complexity. Code is available at: https://github.com/ZHUXIUJINChris/ShadowMamba

[199] MBA-SLAM: Motion Blur Aware Gaussian Splatting SLAM

Peng Wang, Lingzhe Zhao, Yin Zhang, Shiyu Zhao, Peidong Liu

Main category: cs.CV

TL;DR: MBA-SLAM improves SLAM performance for motion-blurred inputs by integrating a motion blur-aware tracker with NeRF or 3DGS, outperforming existing methods.

DetailsMotivation: Existing SLAM methods struggle with motion-blurred frames, common in real-world scenarios, reducing accuracy and reconstruction quality.

Method: Proposes a dense visual deblur SLAM pipeline (MBA-SLAM) combining a motion blur-aware tracker with NeRF or 3DGS, modeling motion blur’s physical process.

Result: MBA-SLAM outperforms state-of-the-art methods in camera localization and map reconstruction, validated on synthetic and real datasets.

Conclusion: MBA-SLAM is versatile and robust, effectively handling motion-blurred inputs while enhancing SLAM performance.

Abstract: Emerging 3D scene representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have demonstrated their effectiveness in Simultaneous Localization and Mapping (SLAM) for photo-realistic rendering, particularly when using high-quality video sequences as input. However, existing methods struggle with motion-blurred frames, which are common in real-world scenarios like low-light or long-exposure conditions. This often results in a significant reduction in both camera localization accuracy and map reconstruction quality. To address this challenge, we propose a dense visual deblur SLAM pipeline (i.e. MBA-SLAM) to handle severe motion-blurred inputs and enhance image deblurring. Our approach integrates an efficient motion blur-aware tracker with either neural radiance fields or Gaussian Splatting based mapper. By accurately modeling the physical image formation process of motion-blurred images, our method simultaneously learns 3D scene representation and estimates the cameras’ local trajectory during exposure time, enabling proactive compensation for motion blur caused by camera movement. In our experiments, we demonstrate that MBA-SLAM surpasses previous state-of-the-art methods in both camera localization and map reconstruction, showcasing superior performance across a range of datasets, including synthetic and real datasets featuring sharp images as well as those affected by motion blur, highlighting the versatility and robustness of our approach. Code is available at https://github.com/WU-CVGL/MBA-SLAM.

[200] TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

Riza Velioglu, Petra Bevandic, Robin Chan, Barbara Hammer

Main category: cs.CV

TL;DR: The paper introduces Virtual Try-Off (VTOFF), a task for generating standardized garment images from single photos, and proposes TryOffDiff, a method using Stable Diffusion with SigLIP-based conditioning for high-fidelity reconstructions. It outperforms baselines and highlights VTOFF’s potential for e-commerce and generative model evaluation.

DetailsMotivation: The motivation is to address the limitations of Virtual Try-On (VTON) by focusing on extracting canonical garment images, which require precise reconstruction of shape, texture, and patterns for better generative model evaluation.

Method: The method, TryOffDiff, adapts Stable Diffusion with SigLIP-based visual conditioning to achieve high-fidelity garment reconstructions.

Result: Experiments on VITON-HD and Dress Code datasets show TryOffDiff outperforms pose transfer and VTON baselines. Traditional metrics like SSIM are found inadequate, leading to the use of DISTS for reliable assessment.

Conclusion: VTOFF has potential to enhance e-commerce product imagery, improve generative model evaluation, and guide future high-fidelity reconstruction research. Demo, code, and models are available.

Abstract: This paper introduces Virtual Try-Off (VTOFF), a novel task generating standardized garment images from single photos of clothed individuals. Unlike Virtual Try-On (VTON), which digitally dresses models, VTOFF extracts canonical garment images, demanding precise reconstruction of shape, texture, and complex patterns, enabling robust evaluation of generative model fidelity. We propose TryOffDiff, adapting Stable Diffusion with SigLIP-based visual conditioning to deliver high-fidelity reconstructions. Experiments on VITON-HD and Dress Code datasets show that TryOffDiff outperforms adapted pose transfer and VTON baselines. We observe that traditional metrics such as SSIM inadequately reflect reconstruction quality, prompting our use of DISTS for reliable assessment. Our findings highlight VTOFF’s potential to improve e-commerce product imagery, advance generative model evaluation, and guide future research on high-fidelity reconstruction. Demo, code, and models are available at: https://rizavelioglu.github.io/tryoffdiff

[201] WildSAT: Learning Satellite Image Representations from Wildlife Observations

Rangel Daroya, Elijah Cole, Oisin Mac Aodha, Grant Van Horn, Subhransu Maji

Main category: cs.CV

TL;DR: WildSAT uses species distribution data and satellite images for contrastive learning, improving remote sensing tasks and enabling zero-shot retrieval.

DetailsMotivation: To explore the untapped potential of species distribution data for enhancing representation learning in remote sensing.

Method: WildSAT employs contrastive learning, combining satellite images, species occurrence maps, and textual habitat descriptions.

Result: Outperforms ImageNet-pretrained models and satellite-specific baselines, enabling zero-shot retrieval and surpassing cross-modal learning methods.

Conclusion: WildSAT demonstrates broad applicability for remote sensing and biodiversity monitoring, with key design choices significantly impacting performance.

Abstract: Species distributions encode valuable ecological and environmental information, yet their potential for guiding representation learning in remote sensing remains underexplored. We introduce WildSAT, which pairs satellite images with millions of geo-tagged wildlife observations readily-available on citizen science platforms. WildSAT employs a contrastive learning approach that jointly leverages satellite images, species occurrence maps, and textual habitat descriptions to train or fine-tune models. This approach significantly improves performance on diverse satellite image recognition tasks, outperforming both ImageNet-pretrained models and satellite-specific baselines. Additionally, by aligning visual and textual information, WildSAT enables zero-shot retrieval, allowing users to search geographic locations based on textual descriptions. WildSAT surpasses recent cross-modal learning methods, including approaches that align satellite images with ground imagery or wildlife photos, demonstrating the advantages of our approach. Finally, we analyze the impact of key design choices and highlight the broad applicability of WildSAT to remote sensing and biodiversity monitoring.

[202] SAR Strikes Back: A New Hope for RSVQA

Lucrezia Tosato, Flora Weissgerber, Laurent Wendling, Sylvain Lobry

Main category: cs.CV

TL;DR: The paper explores integrating SAR data into RSVQA, comparing it with optical imagery and proposing fusion strategies. A two-stage model outperforms end-to-end, and decision-level fusion yields the best results.

DetailsMotivation: To leverage SAR's all-weather capabilities and unique electromagnetic features in RSVQA, addressing the lack of comparison and fusion studies between SAR and optical imagery.

Method: Two pipelines: an end-to-end model and a two-stage framework (SAR-to-text translation followed by language processing). Fusion strategies for SAR and optical data are also evaluated.

Result: The two-stage model improves accuracy by ~10%. Decision-level fusion achieves F1-micro 75.00%, F1-average 81.21%, and accuracy 75.49%. SAR excels in land cover questions.

Conclusion: SAR is a valuable complementary modality to optical imagery in RSVQA, with the two-stage approach and decision-level fusion being most effective.

Abstract: Remote Sensing Visual Question Answering (RSVQA) is a task that extracts information from satellite images to answer questions in natural language, aiding image interpretation. While several methods exist for optical images with varying spectral bands and resolutions, only recently have high-resolution Synthetic Aperture Radar (SAR) images been explored. SAR’s ability to operate in all weather conditions and capture electromagnetic features makes it a promising modality, yet no study has compared SAR and optical imagery in RSVQA or proposed effective fusion strategies. This work investigates how to integrate SAR data into RSVQA and how to best combine it with optical images. We present a dataset that enables SAR-based RSVQA and explore two pipelines for the task. The first is an end-to-end model, while the second is a two-stage framework: SAR information is first extracted and translated into text, which is then processed by a language model to produce the final answer. Our results show that the two-stage model performs better, improving accuracy by nearly 10% over the end-to-end approach. We also evaluate fusion strategies for combining SAR and optical data. A decision-level fusion yields the best results, with an F1-micro score of 75.00%, F1-average of 81.21%, and overall accuracy of 75.49% on the proposed dataset. SAR proves especially beneficial for questions related to specific land cover types, such as water areas, demonstrating its value as a complementary modality to optical imagery.

[203] Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

Jiho Choi, Seonho Lee, Minhyun Lee, Seungho Lee, Hyunjung Shim

Main category: cs.CV

TL;DR: PartCATSeg improves Open-Vocabulary Part Segmentation (OVPS) by addressing part-level alignment and structural understanding challenges with object-aware cost aggregation, compositional loss, and DINO guidance.

DetailsMotivation: The challenges in OVPS include aligning part-level image-text correspondence and lacking structural understanding for segmenting object parts.

Method: Proposes PartCATSeg, integrating object-aware part-level cost aggregation, compositional loss, and DINO structural guidance. Uses disentangled cost aggregation and compositional loss for better part-object relationships.

Result: Outperforms state-of-the-art on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets, demonstrating robust generalization to unseen part categories.

Conclusion: PartCATSeg sets a new baseline for OVPS by effectively addressing alignment and structural challenges, enhancing part segmentation precision.

Abstract: Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.

[204] MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang

Main category: cs.CV

TL;DR: MPG-SAM 2 enhances RVOS by integrating multimodal features and global context for accurate segmentation, outperforming benchmarks.

DetailsMotivation: Addressing the challenges of translating text to prompts and lacking global context in offline RVOS using SAM 2.

Method: Uses a unified multimodal encoder, mask prior generator, and hierarchical global-historical aggregator to improve prompt generation and global awareness.

Result: Superior performance on RVOS benchmarks, demonstrating effectiveness of the proposed modules.

Conclusion: MPG-SAM 2 successfully integrates multimodal and temporal dynamics for improved RVOS, with code publicly available.

Abstract: Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we introduce a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of our proposed modules. The code is available at https://github.com/rongfu-dsb/MPG-SAM2.

[205] MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies

Long Yang, Lianqing Zheng, Wenjin Ai, Minghao Liu, Sen Li, Qunshu Lin, Shengyu Yan, Jie Bai, Zhixiong Ma, Tao Huang, Xichan Zhu

Main category: cs.CV

TL;DR: MetaOcc is a multi-modal framework for 3D occupancy prediction using 4D radar and images, addressing challenges in feature extraction and fusion. It introduces novel modules for radar data and fusion, achieving state-of-the-art results with reduced annotation costs.

DetailsMotivation: Traditional vision-only systems struggle in adverse weather, and fusion of radar and cameras is challenging due to heterogeneous data. MetaOcc aims to improve 3D occupancy prediction for autonomous driving.

Method: Proposes Radar Height Self-Attention for radar data and Hierarchical Multi-scale Multi-modal Fusion for adaptive feature fusion. Uses pseudo-label generation for semi-supervised learning.

Result: Achieves +0.47 SC IoU and +4.02 mIoU on OmniHD-Scenes, and +1.16 SC IoU and +1.24 mIoU on SurroundOcc-nuScenes, with 90% performance using 50% labels.

Conclusion: MetaOcc is scalable and robust, offering a practical solution for real-world autonomous systems with reduced annotation costs.

Abstract: Robust 3D occupancy prediction is essential for autonomous driving, particularly under adverse weather conditions where traditional vision-only systems struggle. While the fusion of surround-view 4D radar and cameras offers a promising low-cost solution, effectively extracting and integrating features from these heterogeneous sensors remains challenging. This paper introduces MetaOcc, a novel multi-modal framework for omnidirectional 3D occupancy prediction that leverages both multi-view 4D radar and images. To address the limitations of directly applying LiDAR-oriented encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial reasoning and feature extraction. Additionally, a Hierarchical Multi-scale Multi-modal Fusion strategy is developed to perform adaptive local-global fusion across modalities and time, mitigating spatio-temporal misalignments and enriching fused feature representations. To reduce reliance on expensive point cloud annotations, we further propose a pseudo-label generation pipeline based on an open-set segmentor. This enables a semi-supervised strategy that achieves 90% of the fully supervised performance using only 50% of the ground truth labels, offering an effective trade-off between annotation cost and accuracy. Extensive experiments demonstrate that MetaOcc under full supervision achieves state-of-the-art performance, outperforming previous methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset. These results demonstrate the scalability and robustness of MetaOcc across sensor domains and training conditions, paving the way for practical deployment in real-world autonomous systems. Code and data are available at https://github.com/LucasYang567/MetaOcc.

[206] Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering

Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, Naveed Akhtar

Main category: cs.CV

TL;DR: A comprehensive survey on 3D Scene Question Answering (3D SQA), covering datasets, methodologies, and evaluation metrics, while identifying challenges and future directions.

DetailsMotivation: The rapid progress in 3D SQA lacks unified analysis, prompting a systematic review to guide future research.

Method: Organizes existing work into datasets, methodologies, and evaluation metrics, identifying shared patterns and limitations.

Result: Identifies core challenges and trends like instruction tuning and zero-shot learning, proposing future research directions.

Conclusion: Aims to serve as a foundation for advancing generalizable and intelligent 3D SQA systems.

Abstract: 3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. In this survey, we provide the first comprehensive and systematic review of 3D SQA. We organize existing work from three perspectives: datasets, methodologies, and evaluation metrics. Beyond basic categorization, we identify shared architectural patterns across methods. Our survey further synthesizes core limitations and discusses how current trends, such as instruction tuning, multimodal alignment, and zero-shot, can shape future developments. Finally, we propose a range of promising research directions covering dataset construction, task generalization, interaction modeling, and unified evaluation protocols. This work aims to serve as a foundation for future research and foster progress toward more generalizable and intelligent 3D SQA systems.

[207] Conditional Diffusion Models are Medical Image Classifiers that Provide Explainability and Uncertainty for Free

Gian Mario Favero, Parham Saremi, Emily Kaczmarek, Brennan Nichyporuk, Tal Arbel

Main category: cs.CV

TL;DR: The paper explores class-conditional diffusion models for 2D medical image classification, introducing a novel majority voting scheme and demonstrating competitive performance against discriminative classifiers.

DetailsMotivation: To address the limitations of discriminative classifiers in medical imaging, such as the need for careful design and supervision, by leveraging the robustness and explainability of diffusion models.

Method: Develops a majority voting scheme for diffusion classifiers and evaluates performance on CheXpert and ISIC Melanoma datasets, comparing against state-of-the-art discriminative models.

Result: Diffusion models achieve competitive performance without explicit supervision, offering explainability and uncertainty quantification for clinical reliability.

Conclusion: Diffusion models present a promising alternative to discriminative classifiers in medical imaging, combining performance, explainability, and trustworthiness.

Abstract: Discriminative classifiers have become a foundational tool in deep learning for medical imaging, excelling at learning separable features of complex data distributions. However, these models often need careful design, augmentation, and training techniques to ensure safe and reliable deployment. Recently, diffusion models have become synonymous with generative modeling in 2D. These models showcase robustness across a range of tasks including natural image classification, where classification is performed by comparing reconstruction errors across images generated for each possible conditioning input. This work presents the first exploration of the potential of class conditional diffusion models for 2D medical image classification. First, we develop a novel majority voting scheme shown to improve the performance of medical diffusion classifiers. Next, extensive experiments on the CheXpert and ISIC Melanoma skin cancer datasets demonstrate that foundation and trained-from-scratch diffusion models achieve competitive performance against SOTA discriminative classifiers without the need for explicit supervision. In addition, we show that diffusion classifiers are intrinsically explainable, and can be used to quantify the uncertainty of their predictions, increasing their trustworthiness and reliability in safety-critical, clinical contexts. Further information is available on our project page: https://faverogian.github.io/med-diffusion-classifier.github.io/.

[208] Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge

Nikolaos Dionelis, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold, Nicolas Longépé

Main category: cs.CV

TL;DR: The paper introduces MapYourCity, a multi-modal dataset for estimating building construction years to aid sustainable urban planning. It evaluates top models from a 2024 challenge, showing feasibility even with missing data.

DetailsMotivation: Accurate building age data is crucial for sustainability, as older buildings often lack energy-efficient features, impacting urban planning and climate change mitigation.

Method: Uses a multi-modal dataset (VHR imagery, Sentinel-2 EO data, street-view images) and formulates building age estimation as a seven-class classification problem. Evaluates models on unseen cities and missing modalities.

Result: Building age estimation is feasible and effective, even without street-view data, using only satellite imagery.

Conclusion: MapYourCity is a valuable resource for scalable, real-world solutions in sustainable urban analytics.

Abstract: Estimating the construction year of buildings is critical for advancing sustainability, as older structures often lack energy-efficient features. Sustainable urban planning relies on accurate building age data to reduce energy consumption and mitigate climate change. In this work, we introduce MapYourCity, a novel multi-modal benchmark dataset comprising top-view Very High Resolution (VHR) imagery, multi-spectral Earth Observation (EO) data from the Copernicus Sentinel-2 constellation, and co-localized street-view images across various European cities. Each building is labeled with its construction epoch, and the task is formulated as a seven-class classification problem covering periods from 1900 to the present. To advance research in EO generalization and multi-modal learning, we organized a community-driven data challenge in 2024, hosted by ESA $\Phi$-lab, which ran for four months and attracted wide participation. This paper presents the Top-4 performing models from the challenge and their evaluation results. We assess model generalization on cities excluded from training to prevent data leakage, and evaluate performance under missing modality scenarios, particularly when street-view data is unavailable. Results demonstrate that building age estimation is both feasible and effective, even in previously unseen cities and when relying solely on top-view satellite imagery (i.e. with VHR and Sentinel-2 images). The new MapYourCity dataset thus provides a valuable resource for developing scalable, real-world solutions in sustainable urban analytics.

[209] Generative Video Bi-flow

Chen Liu, Tobias Ritschel

Main category: cs.CV

TL;DR: A novel generative video model uses neural ODE flow with a bilinear objective to learn temporal changes, combining direct frame mapping and error correction, achieving competitive quality with fewer ODE steps.

DetailsMotivation: To improve video generation by reducing computational cost and drifting errors compared to noise-to-frame methods.

Method: Uses neural ODE flow with a bilinear objective: maps past to future frames directly and learns to correct accumulated errors by adding noise during training.

Result: Demonstrates high-quality unconditional video generation with fewer ODE solver steps, outperforming conditional diffusion baselines in speed.

Conclusion: The proposed method efficiently generates videos with reduced computational cost and improved robustness against drifting errors.

Abstract: We propose a novel generative video model to robustly learn temporal change as a neural Ordinary Differential Equation (ODE) flow with a bilinear objective which combines two aspects: The first is to map from the past into future video frames directly. Previous work has mapped the noise to new frames, a more computationally expensive process. Unfortunately, starting from the previous frame, instead of noise, is more prone to drifting errors. Hence, second, we additionally learn how to remove the accumulated errors as the joint objective by adding noise during training. We demonstrate unconditional video generation in a streaming manner for various video datasets, all at competitive quality compared to a conditional diffusion baseline but with higher speed, i.e., fewer ODE solver steps.

[210] M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering

Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang

Main category: cs.CV

TL;DR: M²IV improves multimodal in-context learning for LVLMs by replacing token-heavy demonstrations with learnable vectors, enhancing performance and efficiency.

DetailsMotivation: Address the inefficiency of token-heavy multimodal inputs and cross-modal reasoning in LVLMs' in-context learning.

Method: Proposes M²IV, a representation engineering approach using learnable Multimodal In-context Vectors injected into LVLMs’ residual streams, with a training strategy for semantic distillation and cross-modal learning.

Result: M²IV outperforms vanilla ICL and baselines, achieving a 3.74% average accuracy gain and reducing token overhead.

Conclusion: M²IV and VLibrary offer scalable, efficient, and customizable solutions for enhancing LVLMs’ in-context learning capabilities.

Abstract: Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74% with substantial improvements in overall efficiency.

[211] COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation

Sanghyun Jo, Seo Jin Lee, Seungwoo Lee, Seohyung Hong, Hyungseok Seo, Kyungsu Kim

Main category: cs.CV

TL;DR: COIN is an annotation-free framework for unsupervised cell instance segmentation, outperforming existing methods by leveraging confidence scoring and self-distillation.

DetailsMotivation: Unsupervised CIS models struggle with capturing cell boundaries due to the absence of error-free instances, limiting their accuracy.

Method: COIN uses unsupervised semantic segmentation with optimal transport, instance-level confidence scoring, and recursive self-distillation to improve performance.

Result: COIN surpasses existing UCIS methods and even semi-/weakly-supervised approaches on datasets like MoNuSeg and TNBC.

Conclusion: COIN provides a robust, annotation-free solution for accurate cell instance segmentation, validated by superior performance across multiple datasets.

Abstract: Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor-intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error-free instances as a key limitation, we present COIN (COnfidence score-guided INstance distillation), a novel annotation-free framework with three key steps: (1) Increasing the sensitivity for the presence of error-free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance-level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self-distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi- and weakly-supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code is available at https://github.com/shjo-april/COIN.

[212] Can Test-Time Scaling Improve World Foundation Model?

Wenyan Cong, Hanqing Zhu, Peihao Wang, Bangya Liu, Dejia Xu, Kevin Wang, David Z. Pan, Yan Wang, Zhiwen Fan, Zhangyang Wang

Main category: cs.CV

TL;DR: SWIFT introduces a test-time scaling framework for world foundation models (WFMs), improving inference efficiency without retraining or enlarging models.

DetailsMotivation: WFMs require heavy computational resources and face data constraints, making test-time scaling a practical alternative to traditional methods.

Method: SWIFT combines an extensible WFM evaluation toolkit with process-level inference strategies like fast tokenization, Top-K pruning, and efficient beam search.

Result: Empirical results on COSMOS show test-time scaling is compute-optimal, with SWIFT proving scalable and effective for WFM inference.

Conclusion: Test-time scaling laws apply to WFMs, and SWIFT offers a viable solution for enhancing inference efficiency without additional training or model expansion.

Abstract: World foundation models, which simulate the physical world by predicting future states from current observations and inputs, have become central to many applications in physical intelligence, including autonomous driving and robotics. However, these models require substantial computational resources for pretraining and are further constrained by available data during post-training. As such, scaling computation at test time emerges as both a critical and practical alternative to traditional model enlargement or re-training. In this work, we introduce SWIFT, a test-time scaling framework tailored for WFMs. SWIFT integrates our extensible WFM evaluation toolkit with process-level inference strategies, including fast tokenization, probability-based Top-K pruning, and efficient beam search. Empirical results on the COSMOS model demonstrate that test-time scaling exists even in a compute-optimal way. Our findings reveal that test-time scaling laws hold for WFMs and that SWIFT provides a scalable and effective pathway for improving WFM inference without retraining or increasing model size. Project page: https://scalingwfm.github.io/.

[213] Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Lung Nodule Malignancy Prediction

Luoting Zhuang, Seyed Mohammad Hossein Tabatabaei, Ramin Salehi-Rad, Linh M. Tran, Denise R. Aberle, Ashley E. Prosper, William Hsu

Main category: cs.CV

TL;DR: The paper proposes a method using CLIP to integrate semantic and imaging features for lung cancer prediction, achieving superior performance and explainability.

DetailsMotivation: Current models rely on manual annotation, lack interpretability, and are sensitive to imaging variations, limiting clinical use.

Method: Fine-tuned a pretrained CLIP model to align imaging and semantic text features for predicting lung cancer.

Result: Outperformed SOTA models with AUROC of 0.901 and AUPRC of 0.776, showing robustness in external datasets.

Conclusion: The approach provides explainable, generalizable predictions, aiding clinicians and avoiding shortcuts.

Abstract: Machine learning models have utilized semantic features, deep features, or both to assess lung nodule malignancy. However, their reliance on manual annotation during inference, limited interpretability, and sensitivity to imaging variations hinder their application in real-world clinical settings. Thus, this research aims to integrate semantic features derived from radiologists’ assessments of nodules, guiding the model to learn clinically relevant, robust, and explainable imaging features for predicting lung cancer. We obtained 938 low-dose CT scans from the National Lung Screening Trial (NLST) with 1,246 nodules and semantic features. Additionally, the Lung Image Database Consortium dataset contains 1,018 CT scans, with 2,625 lesions annotated for nodule characteristics. Three external datasets were obtained from UCLA Health, the LUNGx Challenge, and the Duke Lung Cancer Screening. We fine-tuned a pretrained Contrastive Language-Image Pretraining (CLIP) model with a parameter-efficient fine-tuning approach to align imaging and semantic text features and predict the one-year lung cancer diagnosis. Our model outperformed state-of-the-art (SOTA) models in the NLST test set with an AUROC of 0.901 and AUPRC of 0.776. It also showed robust results in external datasets. Using CLIP, we also obtained predictions on semantic features through zero-shot inference, such as nodule margin (AUROC: 0.812), nodule consistency (0.812), and pleural attachment (0.840). Our approach surpasses the SOTA models in predicting lung cancer across datasets collected from diverse clinical settings, providing explainable outputs, aiding clinicians in comprehending the underlying meaning of model predictions. This approach also prevents the model from learning shortcuts and generalizes across clinical settings. The code is available at https://github.com/luotingzhuang/CLIP_nodule.

[214] Two-stage deep learning framework for the restoration of incomplete-ring PET images

Yeqi Fang, Rong Zhou

Main category: cs.CV

TL;DR: A two-stage deep-learning framework improves image quality in incomplete-ring PET scanners without TOF information, achieving high PSNR and SSIM scores.

DetailsMotivation: Address performance degradation in incomplete-ring PET systems due to reduced data completeness and geometric inconsistencies.

Method: A two-stage pipeline: projection-domain Attention U-Net predicts missing sinogram sections, followed by OSEM reconstruction and U-Net-diffusion for artefact removal.

Result: Preserves anatomical structures and tracer distribution with PSNR of 30.92 dB and SSIM of 0.9708, and higher inference speed.

Conclusion: Provides an effective solution for incomplete-ring PET imaging with improved quality and efficiency.

Abstract: Positron Emission Tomography (PET) is an important molecular imaging tool widely used in medicine. Traditional PET systems rely on complete detector rings for full angular coverage and reliable data collection. However, incomplete-ring PET scanners have emerged due to hardware failures, cost constraints, or specific clinical needs. Standard reconstruction algorithms often suffer from performance degradation with these systems because of reduced data completeness and geometric inconsistencies. We present a two-stage deep-learning framework that, without incorporating any time-of-flight (TOF) information, restores high-quality images from data with about 50% missing coincidences - double the loss levels previously addressed by CNN-based methods. The pipeline operates in two stages: a projection-domain Attention U-Net first predicts the missing sections of the sinogram by leveraging spatial context from neighbouring slices, after which the completed data are reconstructed with OSEM algorithm and passed to a U-Net-diffusion module that removes residual artefacts while reinstating high-frequency detail. Using 206 brain volumes from a public dataset, the result shows that our model successfully preserves most anatomical structures and tracer distribution features with PSNR of 30.92 dB and SSIM of 0.9708. We also achieve higher inference speed, thus providing an effective solution for incomplete-ring PET imaging.

[215] Event2Vec: Processing neuromorphic events directly by representations in vector space

Wei Fang, Priyadarshini Panda

Main category: cs.CV

TL;DR: The paper introduces event2vec, a novel representation for neuromorphic event camera data, inspired by word2vec, to address compatibility issues with traditional methods. It shows superior efficiency, accuracy, and speed, and aligns event data with NLP for broader integration.

DetailsMotivation: Event cameras output sparse, irregular data incompatible with mainstream vision methods. Existing solutions compromise temporal resolution or computational efficiency.

Method: Proposes event2vec, inspired by word2vec, to represent event data in a vectorized form, enabling compatibility with deep learning and NLP.

Result: Validated on ASL-DVS dataset, event2vec outperforms prior representations in efficiency, accuracy, and speed, and aligns events with NLP.

Conclusion: event2vec offers a promising solution for integrating event camera data into large language and multimodal models, with open-source resources provided.

Abstract: The neuromorphic event cameras have overwhelming advantages in temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, the event cameras output asynchronous, sparse, and irregular events, which are not compatible with mainstream computer vision and deep learning methods. Various methods have been proposed to solve this issue but at the cost of long preprocessing procedures, losing temporal resolutions, or being incompatible with massively parallel computation. Inspired by the great success of the word to vector, we summarize the similarities between words and events, then propose the first event to vector (event2vec) representation. We validate event2vec on classifying the ASL-DVS dataset, showing impressive parameter efficiency, accuracy, and speed than previous graph/image/voxel-based representations. Beyond task performance, the most attractive advantage of event2vec is that it aligns events to the domain of natural language processing, showing the promising prospect of integrating events into large language and multimodal models. Our codes, models, and training logs are available at https://github.com/fangwei123456/event2vec.

[216] FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Rui Lan, Yancheng Bai, Xu Duan, Mingxing Li, Dongyang Jin, Ryan Xu, Lei Sun, Xiangxiang Chu

Main category: cs.CV

TL;DR: FLUX-Text is a multilingual scene text editing method using DiT architecture, improving glyph generation and reducing training data needs by 97%.

DetailsMotivation: Addressing challenges in scene text editing, especially for non-Latin glyphs, while maintaining visual quality.

Method: Uses lightweight Visual and Text Embedding Modules, Regional Text Perceptual Loss, and a two-stage training strategy.

Result: Outperforms other methods in visual quality and text fidelity with 97% less training data.

Conclusion: FLUX-Text is efficient and effective for multilingual scene text editing.

Abstract: Scene text editing aims to modify or add texts on images while ensuring text fidelity and overall visual quality consistent with the background. Recent methods are primarily built on UNet-based diffusion models, which have improved scene text editing results, but still struggle with complex glyph structures, especially for non-Latin ones (\eg, Chinese, Korean, Japanese). To address these issues, we present \textbf{FLUX-Text}, a simple and advanced multilingual scene text editing DiT method. Specifically, our FLUX-Text enhances glyph understanding and generation through lightweight Visual and Text Embedding Modules, while preserving the original generative capability of FLUX. We further propose a Regional Text Perceptual Loss tailored for text regions, along with a matching two-stage training strategy to better balance text editing and overall image quality. Benefiting from the DiT-based architecture and lightweight feature injection modules, FLUX-Text can be trained with only $0.1$M training examples, a \textbf{97%} reduction compared to $2.9$M required by popular methods. Extensive experiments on multiple public datasets, including English and Chinese benchmarks, demonstrate that our method surpasses other methods in visual quality and text fidelity. All the code is available at https://github.com/AMAP-ML/FluxText.

[217] Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection

Zhangchi Hu, Peixi Wu, Jie Chen, Huyue Zhu, Yijun Wang, Yansong Peng, Hebei Li, Xiaoyan Sun

Main category: cs.CV

TL;DR: Dome-DETR is a novel framework for efficient tiny object detection, addressing feature redundancy and query allocation issues with Density-Oriented Feature-Query Manipulation.

DetailsMotivation: Existing methods for tiny object detection suffer from inefficient feature leverage and high computational costs due to redundant processing and rigid query allocation.

Method: Proposes Dome-DETR with Density-Focal Extractor (DeFE) for clustered compact masks, Masked Window Attention Sparsification (MWAS) for focused computation, and Progressive Adaptive Query Initialization (PAQI) for adaptive query density.

Result: Achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2, +2.5 AP on VisDrone) with low computational complexity and compact model size.

Conclusion: Dome-DETR effectively improves tiny object detection efficiency and performance, offering a practical solution for applications like drone surveillance and remote sensing.

Abstract: Tiny object detection plays a vital role in drone surveillance, remote sensing, and autonomous systems, enabling the identification of small targets across vast landscapes. However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. To reduce feature redundancies, we introduce a lightweight Density-Focal Extractor (DeFE) to produce clustered compact foreground masks. Leveraging these masks, we incorporate Masked Window Attention Sparsification (MWAS) to focus computational resources on the most informative regions via sparse attention. Besides, we propose Progressive Adaptive Query Initialization (PAQI), which adaptively modulates query density across spatial areas for better query allocation. Extensive experiments demonstrate that Dome-DETR achieves state-of-the-art performance (+3.3 AP on AI-TOD-V2 and +2.5 AP on VisDrone) while maintaining low computational complexity and a compact model size. Code is available at https://github.com/RicePasteM/Dome-DETR.

[218] AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Wei-Lun Chao

Main category: cs.CV

TL;DR: AVA-Bench introduces a benchmark to evaluate vision foundation models (VFMs) by disentangling 14 Atomic Visual Abilities (AVAs), addressing gaps in current VQA benchmarks.

DetailsMotivation: Current evaluation protocols for VFMs using VQA benchmarks have blind spots: misaligned instruction tuning data and inability to pinpoint specific visual shortcomings.

Method: AVA-Bench decouples 14 AVAs, ensuring matched training and test distributions for precise evaluation of VFMs.

Result: AVA-Bench reveals distinct ability fingerprints of VFMs and shows a smaller LLM (0.5B) can rank VFMs similarly to a larger one (7B) with 8x less GPU time.

Conclusion: AVA-Bench provides a transparent and efficient benchmark to guide future VFM development and selection.

Abstract: The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM’ visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) – foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive “ability fingerprints,” turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

[219] DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo

Main category: cs.CV

TL;DR: DanceGRPO introduces a stable RL framework for visual generation, outperforming baselines by up to 181% across benchmarks.

DetailsMotivation: Aligning generative AI outputs with human preferences is challenging, and existing RL methods like DDPO and DPOK struggle with stability and scalability.

Method: DanceGRPO adapts Group Relative Policy Optimization (GRPO) for visual tasks, ensuring stable optimization across diverse prompts and generative models.

Result: It achieves stable performance across multiple tasks and models, excelling in benchmarks like HPS-v2.1 and CLIP Score.

Conclusion: DanceGRPO is a robust solution for RLHF in visual generation, advancing the synergy of RL and visual synthesis.

Abstract: Recent advances in generative AI have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. While Reinforcement Learning (RL) has emerged as a promising approach for fine-tuning generative models, existing methods like DDPO and DPOK face fundamental limitations - particularly their inability to maintain stable optimization when scaling to large and diverse prompt sets, severely restricting their practical utility. This paper presents DanceGRPO, a framework that addresses these limitations through an innovative adaptation of Group Relative Policy Optimization (GRPO) for visual generation tasks. Our key insight is that GRPO’s inherent stability mechanisms uniquely position it to overcome the optimization challenges that plague prior RL-based approaches on visual generation. DanceGRPO establishes several significant advances: First, it demonstrates consistent and stable policy optimization across multiple modern generative paradigms, including both diffusion models and rectified flows. Second, it maintains robust performance when scaling to complex, real-world scenarios encompassing three key tasks and four foundation models. Third, it shows remarkable versatility in optimizing for diverse human preferences as captured by five distinct reward models assessing image/video aesthetics, text-image alignment, video motion quality, and binary feedback. Our comprehensive experiments reveal that DanceGRPO outperforms baseline methods by up to 181% across multiple established benchmarks, including HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis.

[220] TD3Net: A Temporal Densely Connected Multi-Dilated Convolutional Network for Lipreading

Byung Hoon Lee, Wooseok Shin, Sung Won Han

Main category: cs.CV

TL;DR: TD3Net, a backend architecture for lipreading, combines dense skip connections and multi-dilated convolutions to improve temporal modeling, achieving high accuracy with fewer parameters.

DetailsMotivation: Existing TCN-based lipreading methods suffer from blind spots in the receptive field, leading to information loss about continuous lip movements.

Method: Proposes TD3Net, using dense skip connections and multi-dilated temporal convolutions to cover a wide, dense receptive field without blind spots.

Result: TD3Net achieves state-of-the-art performance on LRW and LRW-1000 datasets with fewer parameters and lower computational costs.

Conclusion: TD3Net effectively models diverse temporal features while preserving continuity, offering advantages for lipreading systems.

Abstract: The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository: https://github.com/Leebh-kor/TD3Net-A-Temporal-Densely-Connected-Multi-dilated-Convolutional-Network-for-Lipreading

[221] Crop Pest Classification Using Deep Learning Techniques: A Review

Muhammad Hassam Ejaz, Muhammad Bilal, Usman Habib, Muhammad Attique, Tae-Sun Chung

Main category: cs.CV

TL;DR: A review of 37 studies (2018-2025) on AI-based pest classification, highlighting the shift from CNNs to hybrid/transformer models, key challenges, and future directions.

DetailsMotivation: Traditional pest monitoring methods are slow and unscalable; deep learning offers automated, efficient solutions.

Method: Analyzes studies by crop type, pest species, model architecture, dataset usage, and technical challenges.

Result: Shift from CNNs to hybrid/transformer models improves accuracy but faces challenges like imbalanced datasets and deployment issues.

Conclusion: The review provides a structured overview, identifies key challenges, and suggests future directions for AI-based pest monitoring.

Abstract: Insect pests continue to bring a serious threat to crop yields around the world, and traditional methods for monitoring them are often slow, manual, and difficult to scale. In recent years, deep learning has emerged as a powerful solution, with techniques like convolutional neural networks (CNNs), vision transformers (ViTs), and hybrid models gaining popularity for automating pest detection. This review looks at 37 carefully selected studies published between 2018 and 2025, all focused on AI-based pest classification. The selected research is organized by crop type, pest species, model architecture, dataset usage, and key technical challenges. The early studies relied heavily on CNNs but latest work is shifting toward hybrid and transformer-based models that deliver higher accuracy and better contextual understanding. Still, challenges like imbalanced datasets, difficulty in detecting small pests, limited generalizability, and deployment on edge devices remain significant hurdles. Overall, this review offers a structured overview of the field, highlights useful datasets, and outlines the key challenges and future directions for AI-based pest monitoring systems.

[222] End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards

AmirHossein Zamani, Tianhao Xie, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky

Main category: cs.CV

TL;DR: A reinforcement-learning-free framework integrates human feedback via differentiable rewards into 3D texture synthesis, improving alignment with preferences and 3D structure.

DetailsMotivation: Existing 3D generative models often miss human preferences and task-specific needs, relying on 2D methods that ignore 3D structure.

Method: Proposes an end-to-end differentiable framework embedding human feedback as reward functions, back-propagating signals through geometric and appearance modules.

Result: Generates textures respecting 3D geometry and desired criteria, outperforming state-of-the-art methods in evaluations.

Conclusion: The framework offers a controllable, interpretable way to create high-quality 3D content from natural language.

Abstract: While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specific requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. We will make our implementation code publicly available upon acceptance of the paper.

[223] ART: Adaptive Relation Tuning for Generalized Relation Prediction

Gopika Sudhakaran, Hikaru Shindo, Patrick Schramowski, Simone Schaub-Meyer, Kristian Kersting, Stefan Roth

Main category: cs.CV

TL;DR: ART, an Adaptive Relation Tuning framework, improves VRD by instruction tuning VLMs on diverse data, outperforming baselines and handling unseen relations.

DetailsMotivation: VRD models trained on limited data struggle with generalization, and prompt tuning lacks adaptability for novel relations.

Method: ART converts VRD datasets into instruction tuning format, uses adaptive sampling, and focuses on relation classification.

Result: ART outperforms baselines and can infer unseen relations, demonstrating practical value in scene segmentation.

Conclusion: Instruction tuning with ART enhances VRD by improving generalizability and handling complex relations.

Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART’s practical value by using the predicted relations for segmenting complex scenes.

[224] Neural-Driven Image Editing

Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You

Main category: cs.CV

TL;DR: LoongX is a hands-free image editing system using neurophysiological signals, achieving performance comparable to text-driven methods.

DetailsMotivation: Traditional image editing is labor-intensive and inaccessible to some; LoongX aims to make it intuitive and accessible using brain-computer interfaces.

Method: LoongX integrates diffusion models with multimodal neurophysiological signals (EEG, fNIRS, PPG, head motion) using CS3 and DGF modules, fine-tuned on a DiT.

Result: LoongX matches text-driven methods (e.g., CLIP-I: 0.6605 vs. 0.6558) and outperforms them when combined with speech (CLIP-T: 0.2588 vs. 0.2549).

Conclusion: LoongX demonstrates the potential of neural-driven generative models for accessible image editing and cognitive-driven creative technologies.

Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.

[225] Advancing Welding Defect Detection in Maritime Operations via Adapt-WeldNet and Defect Detection Interpretability Analysis

Kamal Basha S, Athira Nambiar

Main category: cs.CV

TL;DR: The paper introduces Adapt-WeldNet, an adaptive framework for weld defect detection, and a Defect Detection Interpretability Analysis (DDIA) framework to improve transparency and trust in AI-based systems.

DetailsMotivation: Traditional NDT methods and existing neural network approaches often fail to detect subtle or internal defects and lack interpretability, raising safety concerns.

Method: The paper proposes Adapt-WeldNet for systematic evaluation of pre-trained architectures and transfer learning strategies, alongside DDIA for interpretability using XAI techniques and expert validation.

Result: The framework optimizes defect detection performance and provides actionable insights, enhancing reliability and trust in automated decisions.

Conclusion: This work improves both performance and interpretability in welding defect detection, supporting safety and reliability in critical offshore and marine environments.

Abstract: Weld defect detection is crucial for ensuring the safety and reliability of piping systems in the oil and gas industry, especially in challenging marine and offshore environments. Traditional non-destructive testing (NDT) methods often fail to detect subtle or internal defects, leading to potential failures and costly downtime. Furthermore, existing neural network-based approaches for defect classification frequently rely on arbitrarily selected pretrained architectures and lack interpretability, raising safety concerns for deployment. To address these challenges, this paper introduces ``Adapt-WeldNet", an adaptive framework for welding defect detection that systematically evaluates various pre-trained architectures, transfer learning strategies, and adaptive optimizers to identify the best-performing model and hyperparameters, optimizing defect detection and providing actionable insights. Additionally, a novel Defect Detection Interpretability Analysis (DDIA) framework is proposed to enhance system transparency. DDIA employs Explainable AI (XAI) techniques, such as Grad-CAM and LIME, alongside domain-specific evaluations validated by certified ASNT NDE Level II professionals. Incorporating a Human-in-the-Loop (HITL) approach and aligning with the principles of Trustworthy AI, DDIA ensures the reliability, fairness, and accountability of the defect detection system, fostering confidence in automated decisions through expert validation. By improving both performance and interpretability, this work enhances trust, safety, and reliability in welding defect detection systems, supporting critical operations in offshore and marine environments.

[226] Trustworthy Pedestrian Trajectory Prediction via Pattern-Aware Interaction Modeling

Kaiyuan Zhai, Juan Chen, Chao Wang, Zeyi Xu, Guoming Tang

Main category: cs.CV

TL;DR: InSyn, a Transformer-based model, improves pedestrian trajectory prediction by explicitly modeling diverse interaction patterns and using a Seq-Start of Seq (SSOS) training strategy to reduce initial-step errors.

DetailsMotivation: Current pedestrian trajectory prediction methods lack reliability due to opaque modeling of interactions, limiting their use in safety-critical applications.

Method: Proposes InSyn, a Transformer-based model capturing diverse interaction patterns, and SSOS training to address initial-step divergence.

Result: Outperforms black-box baselines in accuracy, especially in high-density scenarios, and reduces initial-step error by ~6.58%.

Conclusion: InSyn achieves a balance between reliability and accuracy, offering interpretability and improved performance in pedestrian trajectory prediction.

Abstract: Accurate and reliable pedestrian trajectory prediction is critical for the safety and robustness of intelligent applications, yet achieving trustworthy prediction remains highly challenging due to the complexity of interactions among pedestrians. Previous methods often adopt black-box modeling of pedestrian interactions, treating all neighbors uniformly. Despite their strong performance, such opaque modeling limits the reliability of predictions in safety-critical real-world deployments. To address this issue, we propose InSyn (Interaction-Synchronization Network), a novel Transformer-based model that explicitly captures diverse interaction patterns (e.g., walking in sync or conflicting) while effectively modeling direction-sensitive social behaviors. Additionally, we introduce a training strategy, termed Seq-Start of Seq (SSOS), designed to alleviate the common issue of initial-step divergence in numerical time-series prediction. Experiments on the ETH and UCY datasets demonstrate that our model not only outperforms recent black-box baselines in prediction accuracy, especially under high-density scenarios, but also provides stronger interpretability, achieving a favorable trade-off between reliability and accuracy. Furthermore, the SSOS strategy proves to be effective in improving sequential prediction performance, reducing the initial-step prediction error by approximately 6.58%.

[227] InterAct-Video: Reasoning-Rich Video QA for Urban Traffic

Joseph Raj Vishal, Rutuja Patil, Manas Srinivas Gowda, Katha Naik, Yezhou Yang, Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: The paper introduces InterAct VideoQA, a dataset to improve VideoQA models for traffic monitoring by addressing real-world scene complexities.

DetailsMotivation: Existing VideoQA models struggle with real-world traffic scenes due to their complexity, necessitating a domain-specific dataset.

Method: The InterAct VideoQA dataset includes 8 hours of traffic footage, 10-second clips, and 25,000 QA pairs, covering spatiotemporal dynamics and vehicle interactions.

Result: Evaluation shows challenges in reasoning over spatiotemporal dependencies, but fine-tuning on InterAct improves model performance.

Conclusion: InterAct VideoQA is a valuable benchmark for advancing VideoQA models in intelligent transportation systems.

Abstract: Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA

[228] Survival Modeling from Whole Slide Images via Patch-Level Graph Clustering and Mixture Density Experts

Ardhendu Sekhar, Vasu Soni, Keshav Aske, Garima Jain, Pranav Jeevan, Amit Sethi

Main category: cs.CV

TL;DR: A modular framework for predicting cancer survival from pathology images, improving accuracy via dynamic patch selection, clustering, attention mechanisms, and expert-guided modeling.

DetailsMotivation: To enhance the accuracy of cancer-specific survival prediction from whole slide pathology images by addressing challenges like large image size, tissue heterogeneity, and complex survival distributions.

Method: Integrates dynamic patch selection, graph-guided k-means clustering, attention mechanisms for intra- and inter-cluster relationships, and expert-guided Gaussian mixture modeling.

Result: Achieves superior performance (concordance index and Brier score) on TCGA-KIRC and TCGA-LUAD datasets, outperforming state-of-the-art methods.

Conclusion: The proposed method demonstrates significant predictive potential across diverse cancer types, offering a robust and accurate framework for survival prediction.

Abstract: We introduce a modular framework for predicting cancer-specific survival from whole slide pathology images (WSIs) that significantly improves upon the state-of-the-art accuracy. Our method integrating four key components. Firstly, to tackle large size of WSIs, we use dynamic patch selection via quantile-based thresholding for isolating prognostically informative tissue regions. Secondly, we use graph-guided k-means clustering to capture phenotype-level heterogeneity through spatial and morphological coherence. Thirdly, we use attention mechanisms that model both intra- and inter-cluster relationships to contextualize local features within global spatial relations between various types of tissue compartments. Finally, we use an expert-guided mixture density modeling for estimating complex survival distributions using Gaussian mixture models. The proposed model achieves a concordance index of $0.712 \pm 0.028$ and Brier score of $0.254 \pm 0.018$ on TCGA-KIRC (renal cancer), and a concordance index of $0.645 \pm 0.017$ and Brier score of $0.281 \pm 0.031$ on TCGA-LUAD (lung adenocarcinoma). These results are significantly better than the state-of-art and demonstrate predictive potential of the proposed method across diverse cancer types.

[229] MOR-VIT: Efficient Vision Transformer with Mixture-of-Recursions

YiZhou Li

Main category: cs.CV

TL;DR: MoR-ViT introduces a token-level dynamic recursion mechanism for efficient vision transformers, reducing parameters by 70% and accelerating inference by 2.5x while maintaining high accuracy.

DetailsMotivation: Standard ViTs suffer from parameter redundancy and high computational costs, limiting practical deployment. Existing methods focus on static compression or token sparsification but lack flexibility in computational depth.

Method: MoR-ViT uses a token-level dynamic recursion mechanism (Mixture-of-Recursions) to adaptively determine processing depth for each token, enabling flexible resource allocation.

Result: MoR-ViT achieves state-of-the-art accuracy with 70% fewer parameters and 2.5x faster inference, outperforming baselines like DynamicViT and TinyViT.

Conclusion: Dynamic recursion is an effective strategy for efficient ViTs, offering scalability and deployability for real-world applications.

Abstract: Vision Transformers (ViTs) have achieved remarkable success in image recognition, yet standard ViT architectures are hampered by substantial parameter redundancy and high computational cost, limiting their practical deployment. While recent efforts on efficient ViTs primarily focus on static model compression or token-level sparsification, they remain constrained by fixed computational depth for all tokens. In this work, we present MoR-ViT, a novel vision transformer framework that, for the first time, incorporates a token-level dynamic recursion mechanism inspired by the Mixture-of-Recursions (MoR) paradigm. This approach enables each token to adaptively determine its processing depth, yielding a flexible and input-dependent allocation of computational resources. Extensive experiments on ImageNet-1K and transfer benchmarks demonstrate that MoR-ViT not only achieves state-of-the-art accuracy with up to 70% parameter reduction and 2.5x inference acceleration, but also outperforms leading efficient ViT baselines such as DynamicViT and TinyViT under comparable conditions. These results establish dynamic recursion as an effective strategy for efficient vision transformers and open new avenues for scalable and deployable deep learning models in real-world scenarios.

[230] Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens

Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong

Main category: cs.CV

TL;DR: A method to adapt monocular depth estimators for fisheye images using calibration tokens, avoiding retraining and improving accuracy.

DetailsMotivation: FMDEs trained on perspective images fail on fisheye images due to covariate shift from camera calibration differences.

Method: Aligns latent embeddings of fisheye images to perspective images using calibration tokens, leveraging self-supervised training with perspective datasets.

Result: Consistently outperforms state-of-the-art methods for fisheye depth estimation without retraining.

Conclusion: The method enables effective reuse of FMDEs for fisheye cameras, offering a lightweight and efficient solution.

Abstract: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.

[231] Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Daniel Wolf, Heiko Hillenhagen, Billurvan Taskin, Alex Bäuerle, Meinrad Beer, Michael Götz, Timo Ropinski

Main category: cs.CV

TL;DR: VLMs struggle with determining relative positions in medical images, despite visual prompts improving performance slightly. A new benchmark, MIRP, is introduced to evaluate this capability.

DetailsMotivation: Understanding relative positions in medical images is crucial for clinical decision-making, yet VLMs lack this ability.

Method: Evaluated state-of-the-art VLMs (GPT-4o, Llama3.2, Pixtral, JanusPro) and tested visual prompts (alphanumeric/colored markers) for improvement.

Result: VLMs performed poorly on medical images, relying more on prior knowledge than image content. Visual prompts helped but not significantly.

Conclusion: VLMs need improvement for medical imaging tasks. The MIRP dataset is introduced to advance research in this area.

Abstract: Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.

[232] A Study of Gender Classification Techniques Based on Iris Images: A Deep Survey and Analysis

Basna Mohammed Salih Hasan, Ramadhan J. Mstafa

Main category: cs.CV

TL;DR: The paper reviews gender classification methods, focusing on facial and iris-based approaches, and highlights gaps and future research directions.

DetailsMotivation: Gender classification is useful in applications like surveillance and human-computer interaction, with soft biometrics like facial and iris traits being key.

Method: The study reviews existing literature and methodologies for gender classification, emphasizing facial and iris-based techniques.

Result: The paper provides an analysis of current approaches, identifies gaps, and suggests future improvements.

Conclusion: The study aids researchers by summarizing existing methods, challenges, and potential advancements in gender classification.

Abstract: Gender classification is attractive in a range of applications, including surveillance and monitoring, corporate profiling, and human-computer interaction. Individuals’ identities may be gleaned from information about their gender, which is a kind of soft biometric. Over the years, several methods for determining a person’s gender have been devised. Some of the most well-known ones are based on physical characteristics like face, fingerprint, palmprint, DNA, ears, gait, and iris. On the other hand, facial features account for the vast majority of gender classification methods. Also, the iris is a significant biometric trait because the iris, according to research, remains basically constant during an individual’s life. Besides that, the iris is externally visible and is non-invasive to the user, which is important for practical applications. Furthermore, there are already high-quality methods for segmenting and encoding iris images, and the current methods facilitate selecting and extracting attribute vectors from iris textures. This study discusses several approaches to determining gender. The previous works of literature are briefly reviewed. Additionally, there are a variety of methodologies for different steps of gender classification. This study provides researchers with knowledge and analysis of the existing gender classification approaches. Also, it will assist researchers who are interested in this specific area, as well as highlight the gaps and challenges in the field, and finally provide suggestions and future paths for improvement.

[233] Can Large Pretrained Depth Estimation Models Help With Image Dehazing?

Hongfei Zhang, Kun Zhou, Ruizheng Wu, Jiangbo Lu

Main category: cs.CV

TL;DR: The paper explores using pretrained depth representations for image dehazing, proposing a plug-and-play RGB-D fusion module for adaptability across diverse architectures.

DetailsMotivation: Existing dehazing methods lack adaptability due to architecture-specific designs, despite the promise of large-scale pretrained models.

Method: Systematic investigation of pretrained depth representations and introduction of an RGB-D fusion module for seamless integration with various dehazing architectures.

Result: Learned deep depth features show consistency across haze levels; the proposed module is validated as effective and broadly applicable.

Conclusion: The RGB-D fusion module enhances adaptability and performance in image dehazing across diverse scenarios.

Abstract: Image dehazing remains a challenging problem due to the spatially varying nature of haze in real-world scenes. While existing methods have demonstrated the promise of large-scale pretrained models for image dehazing, their architecture-specific designs hinder adaptability across diverse scenarios with different accuracy and efficiency requirements. In this work, we systematically investigate the generalization capability of pretrained depth representations-learned from millions of diverse images-for image dehazing. Our empirical analysis reveals that the learned deep depth features maintain remarkable consistency across varying haze levels. Building on this insight, we propose a plug-and-play RGB-D fusion module that seamlessly integrates with diverse dehazing architectures. Extensive experiments across multiple benchmarks validate both the effectiveness and broad applicability of our approach.

[234] CF3: Compact and Fast 3D Feature Fields

Hyunjoon Lee, Joonkyu Min, Jaesik Park

Main category: cs.CV

TL;DR: The paper proposes CF3, a top-down pipeline for efficient 3D Gaussian feature fields, reducing computational costs by using adaptive sparsification and direct per-Gaussian autoencoder training.

DetailsMotivation: Current 3D Gaussian Splatting methods rely on costly bottom-up optimization of raw 2D features, which is inefficient.

Method: CF3 uses fast weighted fusion of multi-view 2D features with pre-trained Gaussians, trains per-Gaussian autoencoders, and applies adaptive sparsification to prune redundant Gaussians.

Result: The method achieves competitive performance using only 5% of the Gaussians compared to Feature-3DGS.

Conclusion: CF3 provides a compact and efficient 3D Gaussian feature field with preserved geometric details.

Abstract: 3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs. We propose a top-down pipeline for constructing compact and fast 3D Gaussian feature fields, namely, CF3. We first perform a fast weighted fusion of multi-view 2D features with pre-trained Gaussians. This approach enables training a per-Gaussian autoencoder directly on the lifted features, instead of training autoencoders in the 2D domain. As a result, the autoencoder better aligns with the feature distribution. More importantly, we introduce an adaptive sparsification method that optimizes the Gaussian attributes of the feature field while pruning and merging the redundant Gaussians, constructing an efficient representation with preserved geometric details. Our approach achieves a competitive 3D feature field using as little as 5% of the Gaussians compared to Feature-3DGS.

[235] CMIC: Content-Adaptive Mamba for Learned Image Compression

Yunuo Chen, Zezheng Lyu, Bing He, Hongwei Hu, Qi Wang, Yuan Tian, Li Song, Wenjun Zhang, Guo Lu

Main category: cs.CV

TL;DR: CAM introduces content-aware token reorganization and global priors to Mamba-style SSMs, improving learned image compression (LIC) performance.

DetailsMotivation: Vanilla Mamba's content-agnostic approach limits dynamic exploitation of content dependencies, prompting the need for a more adaptive model.

Method: CAM uses content-aware token reorganization and integrates global priors via a prompt dictionary to enhance Mamba’s capabilities.

Result: CAM-based LIC model (CMIC) outperforms VTM-21.0 by significant BD-rate reductions on multiple benchmarks.

Conclusion: CAM effectively addresses Mamba’s limitations, achieving state-of-the-art performance in image compression.

Abstract: Recent Learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, vanilla Mamba is content-agnostic, relying on fixed and predefined selective scans, which restricts its ability to dynamically and fully exploit content dependencies. We introduce Content-Adaptive Mamba (CAM), a dynamic SSM that addresses two critical limitations. First, it employs content-aware token reorganization, clustering and reordering tokens based on content similarity to prioritize proximity in feature space over Euclidean space. Second, it integrates global priors into SSM via a prompt dictionary, effectively mitigating the strict causality and long-range decay in the token interactions of Mamba. These innovations enable CAM to better capture global dependencies while preserving computational efficiency. Leveraging CAM, our Content-Adaptive Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by -15.91%, -21.34%, and -17.58% BD-rate on Kodak, Tecnick, and CLIC benchmarks, respectively.

[236] TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

Zewei Zhou, Seth Z. Zhao, Tianhui Cai, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

Main category: cs.CV

TL;DR: TurboTrain is a novel framework for efficient multi-agent training, combining spatiotemporal pretraining and balanced multi-task learning to improve performance and reduce manual effort.

DetailsMotivation: Training multi-agent systems is challenging and requires extensive manual design; TurboTrain aims to streamline this process.

Method: TurboTrain uses masked reconstruction learning for spatiotemporal pretraining and gradient conflict suppression for balanced multi-task learning.

Result: TurboTrain improves performance on the V2XPnP-Seq dataset, enhancing multi-agent perception and prediction tasks.

Conclusion: TurboTrain’s pretraining captures spatiotemporal features effectively, and its balanced learning strategy boosts detection and prediction performance.

Abstract: End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.

[237] RetinexDual: Retinex-based Dual Nature Approach for Generalized Ultra-High-Definition Image Restoration

Mohab Kishawy, Ali Abdellatif Hussein, Jun Chen

Main category: cs.CV

TL;DR: RetinexDual, a novel framework for Ultra-High-Definition Image Restoration (UHD IR), combines two sub-networks (SAMBA and FIA) to address limitations of traditional methods, outperforming recent techniques in tasks like deraining and dehazing.

DetailsMotivation: Traditional UHD IR methods like downsampling or frequency-domain transformations suffer from irreversible information loss or ineffective artifact handling, prompting the need for a more robust solution.

Method: RetinexDual uses SAMBA for reflectance correction and FIA for frequency-domain illumination adaptation, leveraging coarse-to-fine and global context mechanisms.

Result: RetinexDual excels in UHD IR tasks (deraining, deblurring, dehazing, LLIE), surpassing existing methods in quality and metrics.

Conclusion: The dual-network design of RetinexDual is validated as effective, with ablation studies confirming the importance of its components.

Abstract: Advancements in image sensing have elevated the importance of Ultra-High-Definition Image Restoration (UHD IR). Traditional methods, such as extreme downsampling or transformation from the spatial to the frequency domain, encounter significant drawbacks: downsampling induces irreversible information loss in UHD images, while our frequency analysis reveals that pure frequency-domain approaches are ineffective for spatially confined image artifacts, primarily due to the loss of degradation locality. To overcome these limitations, we present RetinexDual, a novel Retinex theory-based framework designed for generalized UHD IR tasks. RetinexDual leverages two complementary sub-networks: the Scale-Attentive maMBA (SAMBA) and the Frequency Illumination Adaptor (FIA). SAMBA, responsible for correcting the reflectance component, utilizes a coarse-to-fine mechanism to overcome the causal modeling of mamba, which effectively reduces artifacts and restores intricate details. On the other hand, FIA ensures precise correction of color and illumination distortions by operating in the frequency domain and leveraging the global context provided by it. Evaluating RetinexDual on four UHD IR tasks, namely deraining, deblurring, dehazing, and Low-Light Image Enhancement (LLIE), shows that it outperforms recent methods qualitatively and quantitatively. Ablation studies demonstrate the importance of employing distinct designs for each branch in RetinexDual, as well as the effectiveness of its various components.

[238] SPA++: Generalized Graph Spectral Alignment for Versatile Domain Adaptation

Zhiqing Xiao, Haobo Wang, Xu Lu, Wentao Ye, Gang Chen, Junbo Zhao

Main category: cs.CV

TL;DR: SPA++ is a graph-based domain adaptation framework that aligns domains in eigenspaces and enhances discriminability through spectral alignment and neighbor-aware propagation, outperforming existing methods.

DetailsMotivation: Addressing the tradeoff between inter-domain transferability and intra-domain discriminability in domain adaptation.

Method: Uses graph primitives for coarse alignment, spectral regularizers, neighbor-aware propagation, and incorporates data augmentation and consistency regularization.

Result: Outperforms cutting-edge methods in robustness and adaptability across various domain adaptation scenarios.

Conclusion: SPA++ provides a robust and adaptable solution for domain adaptation with theoretical backing and empirical success.

Abstract: Domain Adaptation (DA) aims to transfer knowledge from a labeled source domain to an unlabeled or sparsely labeled target domain under domain shifts. Most prior works focus on capturing the inter-domain transferability but largely overlook rich intra-domain structures, which empirically results in even worse discriminability. To tackle this tradeoff, we propose a generalized graph SPectral Alignment framework, SPA++. Its core is briefly condensed as follows: (1)-by casting the DA problem to graph primitives, it composes a coarse graph alignment mechanism with a novel spectral regularizer toward aligning the domain graphs in eigenspaces; (2)-we further develop a fine-grained neighbor-aware propagation mechanism for enhanced discriminability in the target domain; (3)-by incorporating data augmentation and consistency regularization, SPA++ can adapt to complex scenarios including most DA settings and even challenging distribution scenarios. Furthermore, we also provide theoretical analysis to support our method, including the generalization bound of graph-based DA and the role of spectral alignment and smoothing consistency. Extensive experiments on benchmark datasets demonstrate that SPA++ consistently outperforms existing cutting-edge methods, achieving superior robustness and adaptability across various challenging adaptation scenarios.

cs.AI

[239] InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization

Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, Shengyu Zhang, Hongxia Yang, Fei Wu

Main category: cs.AI

TL;DR: AEPO improves semantic alignment in MLLMs for GUI tasks, outperforming RLVR by 9%.

DetailsMotivation: Addressing inefficient exploration in semantic alignment for GUI-based MLLMs.

Method: Introduces Adaptive Exploration Policy Optimization (AEPO) with multi-answer generation and AER.

Result: AEPO-trained models (InfiGUI-G1-3B/7B) achieve SOTA on GUI benchmarks.

Conclusion: AEPO effectively enhances semantic alignment and exploration in MLLMs.

Abstract: The emergence of Multimodal Large Language Models (MLLMs) has propelled the development of autonomous agents that operate on Graphical User Interfaces (GUIs) using pure visual input. A fundamental challenge is robustly grounding natural language instructions. This requires a precise spatial alignment, which accurately locates the coordinates of each element, and, more critically, a correct semantic alignment, which matches the instructions to the functionally appropriate UI element. Although Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be effective at improving spatial alignment for these MLLMs, we find that inefficient exploration bottlenecks semantic alignment, which prevent models from learning difficult semantic associations. To address this exploration problem, we present Adaptive Exploration Policy Optimization (AEPO), a new policy optimization framework. AEPO employs a multi-answer generation strategy to enforce broader exploration, which is then guided by a theoretically grounded Adaptive Exploration Reward (AER) function derived from first principles of efficiency eta=U/C. Our AEPO-trained models, InfiGUI-G1-3B and InfiGUI-G1-7B, establish new state-of-the-art results across multiple challenging GUI grounding benchmarks, achieving significant relative improvements of up to 9.0% against the naive RLVR baseline on benchmarks designed to test generalization and semantic understanding. Resources are available at https://github.com/InfiXAI/InfiGUI-G1.

[240] A Framework for Inherently Safer AGI through Language-Mediated Active Inference

Bo Wen

Main category: cs.AI

TL;DR: A novel framework for safe AGI combines Active Inference and LLMs, integrating safety into core design via transparent beliefs and hierarchical value alignment.

DetailsMotivation: Traditional AI safety methods (interpretability, reward engineering) have limitations; this work embeds safety into AGI's foundational design.

Method: Uses natural language for belief representation, multi-agent Active Inference, and hierarchical Markov blankets for safety constraints.

Result: Proposes mechanisms like belief-preference separation, bounded rationality, and modular safety for inherently safer AGI.

Conclusion: Suggests ARC benchmark experiments to validate safety, advocating for inherently safe AGI development.

Abstract: This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs). We argue that traditional approaches to AI safety, focused on post-hoc interpretability and reward engineering, have fundamental limitations. We present an architecture where safety guarantees are integrated into the system’s core design through transparent belief representations and hierarchical value alignment. Our framework leverages natural language as a medium for representing and manipulating beliefs, enabling direct human oversight while maintaining computational tractability. The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets. We outline specific mechanisms for ensuring safety, including: (1) explicit separation of beliefs and preferences in natural language, (2) bounded rationality through resource-aware free energy minimization, and (3) compositional safety through modular agent structures. The paper concludes with a research agenda centered on the Abstraction and Reasoning Corpus (ARC) benchmark, proposing experiments to validate our framework’s safety properties. Our approach offers a path toward AGI development that is inherently safer, rather than retrofitted with safety measures.

[241] Whither symbols in the era of advanced neural networks?

Thomas L. Griffiths, Brenden M. Lake, R. Thomas McCoy, Ellie Pavlick, Taylor W. Webb

Main category: cs.AI

TL;DR: Neural networks challenge the symbolic view of human cognition by showing similar abilities, suggesting a new research agenda.

DetailsMotivation: To argue that neural networks exhibit abilities like human cognition, undermining the symbolic view.

Method: Analysis of neural networks’ abilities compared to human cognition.

Result: Neural networks show similar cognitive abilities, questioning symbolic representations in human thought.

Conclusion: Proposes a new research agenda on the symbolic basis of human thought.

Abstract: Some of the strongest evidence that human minds should be thought about in terms of symbolic systems has been the way they combine ideas, produce novelty, and learn quickly. We argue that modern neural networks – and the artificial intelligence systems built upon them – exhibit similar abilities. This undermines the argument that the cognitive processes and representations used by human minds are symbolic, although the fact that these neural networks are typically trained on data generated by symbolic systems illustrates that such systems play an important role in characterizing the abstract problems that human minds have to solve. This argument leads us to offer a new agenda for research on the symbolic basis of human thought.

[242] Holistic Explainable AI (H-XAI): Extending Transparency Beyond Developers in AI-Driven Decision Making

Kausik Lakkaraju, Siva Likitha Valluru, Biplav Srivastava

Main category: cs.AI

TL;DR: H-XAI integrates causal rating with traditional XAI to provide interactive, multi-method explanations for diverse stakeholders, addressing gaps in current XAI methods.

DetailsMotivation: Current XAI methods focus on justifying model outputs for developers, lacking support for diverse stakeholder needs. Evaluative AI shifts toward hypothesis testing but remains organization-centric.

Method: H-XAI combines causal rating methods with traditional XAI, enabling interactive hypothesis testing and comparison against random and biased baselines. It integrates instance-level and global explanations.

Result: Demonstrated through case studies in credit risk classification and financial forecasting, H-XAI effectively answers stakeholder-specific questions at individual and model levels.

Conclusion: H-XAI fills critical gaps by unifying causal ratings and post-hoc explanations, catering to diverse stakeholder goals and enhancing model transparency.

Abstract: Current eXplainable AI (XAI) methods largely serve developers, often focusing on justifying model outputs rather than supporting diverse stakeholder needs. A recent shift toward Evaluative AI reframes explanation as a tool for hypothesis testing, but still focuses primarily on operational organizations. We introduce Holistic-XAI (H-XAI), a unified framework that integrates causal rating methods with traditional XAI methods to support explanation as an interactive, multi-method process. H-XAI allows stakeholders to ask a series of questions, test hypotheses, and compare model behavior against automatically constructed random and biased baselines. It combines instance-level and global explanations, adapting to each stakeholder’s goals, whether understanding individual decisions, assessing group-level bias, or evaluating robustness under perturbations. We demonstrate the generality of our approach through two case studies spanning six scenarios: binary credit risk classification and financial time-series forecasting. H-XAI fills critical gaps left by existing XAI methods by combining causal ratings and post-hoc explanations to answer stakeholder-specific questions at both the individual decision level and the overall model level.

[243] Safety of Embodied Navigation: A Survey

Zixia Wang, Jia Hu, Ronghui Mu

Main category: cs.AI

TL;DR: A survey on safety in embodied navigation, covering attacks, defenses, and evaluations, with insights for future research.

DetailsMotivation: Address safety concerns in embodied navigation due to its critical applications in dynamic environments.

Method: Comprehensive analysis of attack strategies, defense mechanisms, evaluation methodologies, datasets, and metrics.

Result: Identifies unresolved issues and future directions, including attack methods, mitigation strategies, and reliable evaluation techniques.

Conclusion: Aims to guide safer embodied navigation systems, benefiting societal safety and industrial efficiency.

Abstract: As large language models (LLMs) continue to advance and gain influence, the development of embodied AI has accelerated, drawing significant attention, particularly in navigation scenarios. Embodied navigation requires an agent to perceive, interact with, and adapt to its environment while moving toward a specified target in unfamiliar settings. However, the integration of embodied navigation into critical applications raises substantial safety concerns. Given their deployment in dynamic, real-world environments, ensuring the safety of such systems is critical. This survey provides a comprehensive analysis of safety in embodied navigation from multiple perspectives, encompassing attack strategies, defense mechanisms, and evaluation methodologies. Beyond conducting a comprehensive examination of existing safety challenges, mitigation technologies, and various datasets and metrics that assess effectiveness and robustness, we explore unresolved issues and future research directions in embodied navigation safety. These include potential attack methods, mitigation strategies, more reliable evaluation techniques, and the implementation of verification frameworks. By addressing these critical gaps, this survey aims to provide valuable insights that can guide future research toward the development of safer and more reliable embodied navigation systems. Furthermore, the findings of this study have broader implications for enhancing societal safety and increasing industrial efficiency.

[244] Planning Agents on an Ego-Trip: Leveraging Hybrid Ego-Graph Ensembles for Improved Tool Retrieval in Enterprise Task Planning

Sahil Bansal, Sai Shruthi Sistla, Aarti Arikatala, Sebastian Schreiber

Main category: cs.AI

TL;DR: A KG-based tool retrieval framework improves accuracy for multi-step tasks by leveraging semantic relationships and functional dependencies between tools, outperforming traditional similarity-based methods.

DetailsMotivation: Addressing the limitation of traditional tool retrieval methods, which rely on query-tool similarity and struggle with multi-step user requests.

Method: Proposes a KG-based framework using ensembles of 1-hop ego tool graphs to model direct and indirect tool connections for contextual selection.

Result: Achieves 91.85% tool coverage on micro-average Complete Recall, outperforming the strongest non-KG baseline (89.26%).

Conclusion: Structural KG information complements similarity matching, especially for sequential tool composition in complex queries.

Abstract: Effective tool retrieval is essential for AI agents to select from a vast array of tools when identifying and planning actions in the context of complex user queries. Despite its central role in planning, this aspect remains underexplored in the literature. Traditional approaches rely primarily on similarities between user queries and tool descriptions, which significantly limits retrieval accuracy, specifically when handling multi-step user requests. To address these limitations, we propose a Knowledge Graph (KG)-based tool retrieval framework that captures the semantic relationships between tools and their functional dependencies. Our retrieval algorithm leverages ensembles of 1-hop ego tool graphs to model direct and indirect connections between tools, enabling more comprehensive and contextual tool selection for multi-step tasks. We evaluate our approach on a synthetically generated internal dataset across six defined user classes, extending previous work on coherent dialogue synthesis and too retrieval benchmarks. Results demonstrate that our tool graph-based method achieves 91.85% tool coverage on the micro-average Complete Recall metric, compared to 89.26% for re-ranked semantic-lexical hybrid retrieval, the strongest non-KG baseline in our experiments. These findings support our hypothesis that the structural information in the KG provides complementary signals to pure similarity matching, particularly for queries requiring sequential tool composition.

[245] Mediator-Guided Multi-Agent Collaboration among Open-Source Models for Medical Decision-Making

Kaitao Chen, Mianxin Liu, Daoming Zong, Chaoyue Ding, Shaohao Rui, Yankai Jiang, Mu Zhou, Xiaosong Wang

Main category: cs.AI

TL;DR: MedOrch is a mediator-guided multi-agent framework for medical multimodal decision-making, using LLM-based mediators to enhance collaboration among VLM-based expert agents.

DetailsMotivation: Existing multi-agent systems struggle with multimodal tasks, and VLMs lack instruction-following and self-reflection compared to LLMs, limiting cooperative workflows in medical decision-making.

Method: Proposes MedOrch, employing an LLM-based mediator to facilitate collaboration among multiple VLM-based expert agents, using open-source VLMs instead of costly models.

Result: Demonstrates superior performance on medical vision question answering benchmarks, showing collaboration surpasses individual agent capabilities.

Conclusion: MedOrch highlights the potential of mediator-guided multi-agent collaboration for advancing medical multimodal intelligence.

Abstract: Complex medical decision-making involves cooperative workflows operated by different clinicians. Designing AI multi-agent systems can expedite and augment human-level clinical decision-making. Existing multi-agent researches primarily focus on language-only tasks, yet their extension to multimodal scenarios remains challenging. A blind combination of diverse vision-language models (VLMs) can amplify an erroneous outcome interpretation. VLMs in general are less capable in instruction following and importantly self-reflection, compared to large language models (LLMs) of comparable sizes. This disparity largely constrains VLMs’ ability in cooperative workflows. In this study, we propose MedOrch, a mediator-guided multi-agent collaboration framework for medical multimodal decision-making. MedOrch employs an LLM-based mediator agent that enables multiple VLM-based expert agents to exchange and reflect on their outputs towards collaboration. We utilize multiple open-source general-purpose and domain-specific VLMs instead of costly GPT-series models, revealing the strength of heterogeneous models. We show that the collaboration within distinct VLM-based agents can surpass the capabilities of any individual agent. We validate our approach on five medical vision question answering benchmarks, demonstrating superior collaboration performance without model training. Our findings underscore the value of mediator-guided multi-agent collaboration in advancing medical multimodal intelligence. Our code will be made publicly available.

[246] PanelTR: Zero-Shot Table Reasoning Framework Through Multi-Agent Scientific Discussion

Yiran Rex Ma

Main category: cs.AI

TL;DR: PanelTR, a framework using LLM agent scientists, improves table reasoning without annotated data or complex augmentation, outperforming vanilla LLMs and matching supervised models.

DetailsMotivation: Address limitations of table reasoning (e.g., dependency on annotated data, poor LLM performance) by leveraging structured scientific methodology.

Method: PanelTR employs agent scientists for individual investigations, self-review, and peer-review discussions, enabling semantic-level transfer without data augmentation.

Result: Outperforms vanilla LLMs and rivals supervised models on four benchmarks, achieving zero-shot performance.

Conclusion: Structured scientific methodology enhances complex task handling and semantic understanding in zero-shot contexts.

Abstract: Table reasoning, including tabular QA and fact verification, often depends on annotated data or complex data augmentation, limiting flexibility and generalization. LLMs, despite their versatility, often underperform compared to simple supervised models. To approach these issues, we introduce PanelTR, a framework utilizing LLM agent scientists for robust table reasoning through a structured scientific approach. PanelTR’s workflow involves agent scientists conducting individual investigations, engaging in self-review, and participating in collaborative peer-review discussions. This process, driven by five scientist personas, enables semantic-level transfer without relying on data augmentation or parametric optimization. Experiments across four benchmarks show that PanelTR outperforms vanilla LLMs and rivals fully supervised models, all while remaining independent of training data. Our findings indicate that structured scientific methodology can effectively handle complex tasks beyond table reasoning with flexible semantic understanding in a zero-shot context.

[247] Society of Mind Meets Real-Time Strategy: A Hierarchical Multi-Agent Framework for Strategic Reasoning

Daechul Ahn, San Kim, Jonghyun Choi

Main category: cs.AI

TL;DR: HIMA, a hierarchical multi-agent framework, outperforms existing LLM-based approaches in dynamic tasks like StarCraftII by combining specialized imitation learning agents with a meta-controller for adaptive planning.

DetailsMotivation: LLMs struggle with dynamic, long-horizon tasks like StarCraftII due to resource constraints and partial observability, necessitating a more robust approach.

Method: HIMA uses specialized imitation learning agents for distinct strategies, orchestrated by a meta-controller (Strategic Planner) to create adaptive plans.

Result: HIMA excels in strategic clarity, adaptability, and computational efficiency, validated on the TEXTSCII-ALL testbed.

Conclusion: Combining specialized imitation modules with meta-level orchestration enhances robustness in general-purpose AI agents.

Abstract: Large Language Models (LLMs) have recently demonstrated impressive action sequence prediction capabilities but often struggle with dynamic, long-horizon tasks such as real-time strategic games. In a game such as StarCraftII (SC2), agents need to manage resource constraints and adapt to evolving battlefield situations in a partially observable environment. This often overwhelms exisiting LLM-based approaches. To address these challenges, we propose a hierarchical multi-agent framework that employs specialized imitation learning agents under a meta-controller called Strategic Planner (SP). By expert demonstrations, each specialized agent learns a distinctive strategy, such as aerial support or defensive maneuvers, and produces coherent, structured multistep action sequences. The SP then orchestrates these proposals into a single, environmentally adaptive plan that ensures local decisions aligning with long-term strategies. We call this HIMA (Hierarchical Imitation Multi-Agent). We also present TEXTSCII-ALL, a comprehensive SC2 testbed that encompasses all race match combinations in SC2. Our empirical results show that HIMA outperforms state of the arts in strategic clarity, adaptability, and computational efficiency, underscoring the potential of combining specialized imitation modules with meta-level orchestration to develop more robust, general-purpose AI agents.

[248] LLMs for Resource Allocation: A Participatory Budgeting Approach to Inferring Preferences

Sankarshan Damle, Boi Faltings

Main category: cs.AI

TL;DR: A framework using Participatory Budgeting (PB) evaluates LLMs’ resource allocation and reasoning, testing their ability to infer preferences from unstructured input.

DetailsMotivation: LLMs' structured resource allocation and reasoning capabilities are underexplored, with existing benchmarks being static and prone to data contamination.

Method: The study uses PB to task LLMs with project selection under constraints via greedy selection, direct optimization, and hill-climbing refinement, benchmarking against an oracle. It also tests LLMs’ ability to infer preferences from natural-language input.

Result: Results highlight the impact of prompt design and show LLMs’ potential for mechanism design with unstructured inputs.

Conclusion: LLMs demonstrate promise for structured resource allocation and preference inference from open-ended input, emphasizing the role of prompt strategies.

Abstract: Large Language Models (LLMs) are increasingly expected to handle complex decision-making tasks, yet their ability to perform structured resource allocation remains underexplored. Evaluating their reasoning is also difficult due to data contamination and the static nature of existing benchmarks. We present a dual-purpose framework leveraging Participatory Budgeting (PB) both as (i) a practical setting for LLM-based resource allocation and (ii) an adaptive benchmark for evaluating their reasoning capabilities. We task LLMs with selecting project subsets under feasibility (e.g., budget) constraints via three prompting strategies: greedy selection, direct optimization, and a hill-climbing-inspired refinement. We benchmark LLMs’ allocations against a utility-maximizing oracle. Interestingly, we also test whether LLMs can infer structured preferences from natural-language voter input or metadata, without explicit votes. By comparing allocations based on inferred preferences to those from ground-truth votes, we evaluate LLMs’ ability to extract preferences from open-ended input. Our results underscore the role of prompt design and show that LLMs hold promise for mechanism design with unstructured inputs.

[249] Don’t Forget Imagination!

Evgenii E. Vityaev, Andrei Mantsivoda

Main category: cs.AI

TL;DR: The paper highlights the underestimated role of cognitive imagination in human thinking and AI, proposing semantic models as a tool to simulate it for better reasoning and decision-making.

DetailsMotivation: Cognitive imagination is crucial for reasoning but overlooked in AI, limiting its capabilities. The paper aims to address this gap.

Method: Introduces semantic models, a new approach combining neural network-like learning with probabilistic causal relationships to simulate cognitive imagination.

Result: Semantic models can create consistent, manipulable imaginary contexts, mimicking human cognitive imagination.

Conclusion: The paper advocates for prioritizing cognitive imagination in AI research, with semantic models as a promising solution.

Abstract: Cognitive imagination is a type of imagination that plays a key role in human thinking. It is not a ``picture-in-the-head’’ imagination. It is a faculty to mentally visualize coherent and holistic systems of concepts and causal links that serve as semantic contexts for reasoning, decision making and prediction. Our position is that the role of cognitive imagination is still greatly underestimated, and this creates numerous problems and diminishes the current capabilities of AI. For instance, when reasoning, humans rely on imaginary contexts to retrieve background info. They also constantly return to the context for semantic verification that their reasoning is still reasonable. Thus, reasoning without imagination is blind. This paper is a call for greater attention to cognitive imagination as the next promising breakthrough in artificial intelligence. As an instrument for simulating cognitive imagination, we propose semantic models – a new approach to mathematical models that can learn, like neural networks, and are based on probabilistic causal relationships. Semantic models can simulate cognitive imagination because they ensure the consistency of imaginary contexts and implement a glass-box approach that allows the context to be manipulated as a holistic and coherent system of interrelated facts glued together with causal relations.

[250] A Generic Complete Anytime Beam Search for Optimal Decision Tree

Harold Silvère Kiossou, Siegfried Nijssen, Pierre Schaus

Main category: cs.AI

TL;DR: The paper introduces CA-DL8.5, a generic anytime beam search algorithm for optimal decision tree learning, unifying existing methods and outperforming others in anytime performance.

DetailsMotivation: Optimal decision tree learning is NP-hard, and existing exact methods lack balanced anytime behavior. The paper aims to address this by proposing a unified framework.

Method: CA-DL8.5 extends DL8.5 with modular heuristics and relaxation mechanisms, using branch-and-bound pruning, trie-based caching, and restart-based beam search.

Result: CA-DL8.5 with LDS heuristics outperforms other variants and Blossom in anytime performance while ensuring completeness and optimality.

Conclusion: CA-DL8.5 provides a flexible, effective framework for anytime decision tree learning, with LDS heuristics showing superior performance.

Abstract: Finding an optimal decision tree that minimizes classification error is known to be NP-hard. While exact algorithms based on MILP, CP, SAT, or dynamic programming guarantee optimality, they often suffer from poor anytime behavior – meaning they struggle to find high-quality decision trees quickly when the search is stopped before completion – due to unbalanced search space exploration. To address this, several anytime extensions of exact methods have been proposed, such as LDS-DL8.5, Top-k-DL8.5, and Blossom, but they have not been systematically compared, making it difficult to assess their relative effectiveness. In this paper, we propose CA-DL8.5, a generic, complete, and anytime beam search algorithm that extends the DL8.5 framework and unifies some existing anytime strategies. In particular, CA-DL8.5 generalizes previous approaches LDS-DL8.5 and Top-k-DL8.5, by allowing the integration of various heuristics and relaxation mechanisms through a modular design. The algorithm reuses DL8.5’s efficient branch-and-bound pruning and trie-based caching, combined with a restart-based beam search that gradually relaxes pruning criteria to improve solution quality over time. Our contributions are twofold: (1) We introduce this new generic framework for exact and anytime decision tree learning, enabling the incorporation of diverse heuristics and search strategies; (2) We conduct a rigorous empirical comparison of several instantiations of CA-DL8.5 – based on Purity, Gain, Discrepancy, and Top-k heuristics – using an anytime evaluation metric called the primal gap integral. Experimental results on standard classification benchmarks show that CA-DL8.5 using LDS (limited discrepancy) consistently provides the best anytime performance, outperforming both other CA-DL8.5 variants and the Blossom algorithm while maintaining completeness and optimality guarantees.

[251] ME$^3$-BEV: Mamba-Enhanced Deep Reinforcement Learning for End-to-End Autonomous Driving with BEV-Perception

Siyi Lu, Run Liu, Dongsheng Yang, Lei He

Main category: cs.AI

TL;DR: The paper introduces a novel DRL-based autonomous driving approach using BEV perception and the Mamba framework for efficient spatio-temporal feature extraction, outperforming existing models in dynamic urban scenarios.

DetailsMotivation: Addressing challenges in autonomous driving, such as error propagation in modular systems and computational bottlenecks in end-to-end learning, by integrating BEV perception and DRL for real-time decision-making.

Method: Proposes the Mamba-BEV model for spatio-temporal feature extraction and the ME³-BEV framework for end-to-end DRL, enhanced with interpretability via semantic segmentation.

Result: Outperforms existing models in CARLA simulator tests, excelling in collision rate and trajectory accuracy.

Conclusion: The ME³-BEV framework offers a promising, interpretable, and efficient solution for real-time autonomous driving.

Abstract: Autonomous driving systems face significant challenges in perceiving complex environments and making real-time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end-to-end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving using deep reinforcement learning (DRL) that integrates bird’s-eye view (BEV) perception for enhanced real-time decision-making. We introduce the \texttt{Mamba-BEV} model, an efficient spatio-temporal feature extraction network that combines BEV-based perception with the Mamba framework for temporal feature modeling. This integration allows the system to encode vehicle surroundings and road features in a unified coordinate system and accurately model long-range dependencies. Building on this, we propose the \texttt{ME$^3$-BEV} framework, which utilizes the \texttt{Mamba-BEV} model as a feature input for end-to-end DRL, achieving superior performance in dynamic urban driving scenarios. We further enhance the interpretability of the model by visualizing high-dimensional features through semantic segmentation, providing insight into the learned representations. Extensive experiments on the CARLA simulator demonstrate that \texttt{ME$^3$-BEV} outperforms existing models across multiple metrics, including collision rate and trajectory accuracy, offering a promising solution for real-time autonomous driving.

[252] Aggregate-Combine-Readout GNNs Are More Expressive Than Logic C2

Stan P Hauke, Przemysław Andrzej Wałęga

Main category: cs.AI

TL;DR: The paper resolves an open problem by proving that aggregate-combine-readout GNNs’ logical expressiveness exceeds that of C2, impacting both GNN theory and infinitary logics.

DetailsMotivation: To address the unresolved question of whether full C2 characterises the logical expressiveness of aggregate-combine-readout GNNs.

Method: Proving through theoretical analysis that aggregate-combine-readout GNNs surpass C2 in logical expressiveness.

Result: The logical expressiveness of aggregate-combine-readout GNNs strictly exceeds C2, applicable to undirected and directed graphs.

Conclusion: The findings advance understanding of GNNs’ expressive power and provide insights into infinitary logics.

Abstract: In recent years, there has been growing interest in understanding the expressive power of graph neural networks (GNNs) by relating them to logical languages. This research has been been initialised by an influential result of Barcel'o et al. (2020), who showed that the graded modal logic (or a guarded fragment of the logic C2), characterises the logical expressiveness of aggregate-combine GNNs. As a ``challenging open problem’’ they left the question whether full C2 characterises the logical expressiveness of aggregate-combine-readout GNNs. This question has remained unresolved despite several attempts. In this paper, we solve the above open problem by proving that the logical expressiveness of aggregate-combine-readout GNNs strictly exceeds that of C2. This result holds over both undirected and directed graphs. Beyond its implications for GNNs, our work also leads to purely logical insights on the expressive power of infinitary logics.

[253] SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown

Main category: cs.AI

TL;DR: SKATE is an automated, scalable evaluation framework where LLMs compete by generating and solving verifiable tasks, eliminating the need for human input or domain expertise.

DetailsMotivation: Current evaluation methods for foundation models are limited by scalability and reliance on domain expertise, hindering their ability to keep pace with rapid model evolution.

Method: SKATE treats evaluation as a game: LLMs act as both task-setters and solvers, creating verifiable tasks that highlight strengths and expose weaknesses. A TrueSkill-based ranking system evaluates models.

Result: Weaker models can reliably score stronger ones, LLMs exhibit self-preferencing behavior, and SKATE reveals fine-grained capability differences between models.

Conclusion: SKATE represents a scalable, objective, and open-ended approach to evaluating LLMs, advancing towards frameworks that can match their rapid progress.

Abstract: Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others’ weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.

[254] MI9 – Agent Intelligence Protocol: Runtime Governance for Agentic AI Systems

Charles L. Wang, Trisha Singhal, Ameya Kelkar, Jason Tuo

Main category: cs.AI

TL;DR: MI9 is a runtime governance framework for agentic AI systems, addressing emergent risks through real-time controls and integrated components.

DetailsMotivation: Agentic AI systems exhibit unpredictable behaviors, requiring governance beyond pre-deployment measures to ensure safety and alignment.

Method: MI9 uses six components: agency-risk index, agent-semantic telemetry, continuous authorization, FSM-based conformance engines, goal-conditioned drift detection, and graduated containment.

Result: MI9 effectively covers governance gaps in agentic AI, enabling safe deployment in production environments.

Conclusion: MI9 provides a foundational framework for scalable and responsible oversight of agentic AI systems.

Abstract: Agentic AI systems capable of reasoning, planning, and executing actions present fundamentally distinct governance challenges compared to traditional AI models. Unlike conventional AI, these systems exhibit emergent and unexpected behaviors during runtime, introducing novel agent-related risks that cannot be fully anticipated through pre-deployment governance alone. To address this critical gap, we introduce MI9, the first fully integrated runtime governance framework designed specifically for safety and alignment of agentic AI systems. MI9 introduces real-time controls through six integrated components: agency-risk index, agent-semantic telemetry capture, continuous authorization monitoring, Finite-State-Machine (FSM)-based conformance engines, goal-conditioned drift detection, and graduated containment strategies. Operating transparently across heterogeneous agent architectures, MI9 enables the systematic, safe, and responsible deployment of agentic systems in production environments where conventional governance approaches fall short, providing the foundational infrastructure for safe agentic AI deployment at scale. Detailed analysis through a diverse set of scenarios demonstrates MI9’s systematic coverage of governance challenges that existing approaches fail to address, establishing the technical foundation for comprehensive agentic AI oversight.

[255] Study of Robust Features in Formulating Guidance for Heuristic Algorithms for Solving the Vehicle Routing Problem

Bachtiar Herdianto, Romain Billot, Flavien Lucas, Marc Sevaux

Main category: cs.AI

TL;DR: The paper explores using machine learning and explainable AI to analyze feature importance in VRP solutions, proposing a framework to guide metaheuristic algorithms.

DetailsMotivation: Traditional metaheuristics for VRP rely on human-crafted designs; this study aims to leverage machine learning to improve algorithm efficiency.

Method: Conducts a sensitivity analysis using multiple classifier models to predict VRP solution quality and applies explainable AI to understand model decisions.

Result: Feature importance varies, but certain features consistently predict solution quality. A unified framework ranks feature impact across scenarios.

Conclusion: Feature importance analysis can guide metaheuristic algorithms for VRP, enhancing their efficiency and effectiveness.

Abstract: The Vehicle Routing Problem (VRP) is a complex optimization problem with numerous real-world applications, mostly solved using metaheuristic algorithms due to its $\mathcal{NP}$-Hard nature. Traditionally, these metaheuristics rely on human-crafted designs developed through empirical studies. However, recent research shows that machine learning methods can be used the structural characteristics of solutions in combinatorial optimization, thereby aiding in designing more efficient algorithms, particularly for solving VRP. Building on this advancement, this study extends the previous research by conducting a sensitivity analysis using multiple classifier models that are capable of predicting the quality of VRP solutions. Hence, by leveraging explainable AI, this research is able to extend the understanding of how these models make decisions. Finally, our findings indicate that while feature importance varies, certain features consistently emerge as strong predictors. Furthermore, we propose a unified framework able of ranking feature impact across different scenarios to illustrate this finding. These insights highlight the potential of feature importance analysis as a foundation for developing a guidance mechanism of metaheuristic algorithms for solving the VRP.

[256] Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications

Byeonghun Bang, Jongsuk Yoon, Dong-Jin Chang, Seho Park, Yong Oh Lee

Main category: cs.AI

TL;DR: The study improves LLMs for healthcare by using a RAG pipeline with GPT-4o-mini and text-embedding-3-small, enhancing accuracy in drug contraindications.

DetailsMotivation: Address challenges of LLMs in healthcare, especially for accurate drug contraindication information.

Method: Implemented a RAG pipeline with GPT-4o-mini, text-embedding-3-small, Langchain, and DUR data for hybrid retrieval and re-ranking.

Result: Accuracy improved from 0.49-0.57 to 0.94, 0.87, and 0.89 for age groups, pregnancy, and drug use contraindications.

Conclusion: RAG frameworks can significantly enhance LLM reliability in healthcare, reducing uncertainty in drug decisions.

Abstract: The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges, particularly in the domain of pharmaceutical contraindications where accurate and reliable information is required. This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline. Utilizing OpenAI’s GPT-4o-mini as the base model, and the text-embedding-3-small model for embeddings, our approach integrates Langchain to orchestrate a hybrid retrieval system with re-ranking. This system leverages Drug Utilization Review (DUR) data from public databases, focusing on contraindications for specific age groups, pregnancy, and concomitant drug use. The dataset includes 300 question-answer pairs across three categories, with baseline model accuracy ranging from 0.49 to 0.57. Post-integration of the RAG pipeline, we observed a significant improvement in model accuracy, achieving rates of 0.94, 0.87, and 0.89 for contraindications related to age groups, pregnancy, and concomitant drug use, respectively. The results indicate that augmenting LLMs with a RAG framework can substantially reduce uncertainty in prescription and drug intake decisions by providing more precise and reliable drug contraindication information.

[257] Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution

Zailong Tian, Zhuoheng Han, Yanzhe Chen, Haozhe Xu, Xi Yang, richeng xuan, Hongfeng Wang, Lizi Liao

Main category: cs.AI

TL;DR: The paper advocates for confidence-driven, risk-aware LLM-as-a-Judge systems, addressing overconfidence in current models with a new metric (TH-Score) and an ensemble framework (LLM-as-a-Fuser) to improve reliability and accuracy.

DetailsMotivation: Existing LLM-as-a-Judge systems focus on accuracy but lack well-calibrated confidence, leading to unreliable judgments. The overconfidence phenomenon undermines practical deployment.

Method: Introduces TH-Score to measure confidence-accuracy alignment and proposes LLM-as-a-Fuser, an ensemble framework for risk-aware evaluation.

Result: The approach improves calibration, enabling adaptive evaluation pipelines with superior reliability and accuracy over baselines.

Conclusion: Shifting to confidence-driven, risk-aware systems enhances the trustworthiness and adaptability of LLM-as-a-Judge evaluations.

Abstract: Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the Overconfidence Phenomenon in current LLM-as-a-Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce TH-Score, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose LLM-as-a-Fuser, an ensemble framework that transforms LLMs into reliable, risk-aware evaluators. Extensive experiments demonstrate that our approach substantially improves calibration and enables adaptive, confidence-driven evaluation pipelines, achieving superior reliability and accuracy compared to existing baselines.

[258] GeoLaux: A Benchmark for Evaluating MLLMs’ Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Yanrui Wu, Wenjun Wu

Main category: cs.AI

TL;DR: The paper introduces GeoLaux, a benchmark for evaluating Multimodal Large Language Models (MLLMs) in geometry problem solving, focusing on long-step reasoning and auxiliary line construction.

DetailsMotivation: Existing benchmarks lack evaluation of auxiliary line construction and fine-grained process assessment, limiting their ability to test MLLMs' long-step reasoning in geometry.

Method: GeoLaux includes 2,186 geometry problems requiring an average of 6.51 reasoning steps, with 41.8% needing auxiliary lines. A five-dimensional evaluation strategy assesses answer correctness, process quality, and auxiliary line impact.

Result: Experiments on 13 MLLMs show performance drops in extended reasoning steps, shortcut tendencies in proving problems, and lack of auxiliary line awareness. Enhancing auxiliary line capability improves reasoning.

Conclusion: GeoLaux serves as a benchmark for evaluating and improving MLLMs’ long-step geometric reasoning, particularly with auxiliary lines.

Abstract: Geometry problem solving (GPS) requires models to master diagram comprehension, logical reasoning, knowledge application, numerical computation, and auxiliary line construction. This presents a significant challenge for Multimodal Large Language Models (MLLMs). However, existing benchmarks for evaluating MLLM geometry skills overlook auxiliary line construction and lack fine-grained process evaluation, making them insufficient for assessing MLLMs’ long-step reasoning abilities. To bridge these gaps, we present the GeoLaux benchmark, comprising 2,186 geometry problems, incorporating both calculation and proving questions. Notably, the problems require an average of 6.51 reasoning steps, with a maximum of 24 steps, and 41.8% of them need auxiliary line construction. Building on the dataset, we design a novel five-dimensional evaluation strategy assessing answer correctness, process correctness, process quality, auxiliary line impact, and error causes. Extensive experiments on 13 leading MLLMs (including thinking models and non-thinking models) yield three pivotal findings: First, models exhibit substantial performance degradation in extended reasoning steps (nine models demonstrate over 50% performance drop). Second, compared to calculation problems, MLLMs tend to take shortcuts when solving proving problems. Third, models lack auxiliary line awareness, and enhancing this capability proves particularly beneficial for overall geometry reasoning improvement. These findings establish GeoLaux as both a benchmark for evaluating MLLMs’ long-step geometric reasoning with auxiliary lines and a guide for capability advancement. Our dataset and code are included in supplementary materials and will be released.

[259] Learning Logical Rules using Minimum Message Length

Ruben Sharma, Sebastijan Dumančić, Ross D. King, Andrew Cropper

Main category: cs.AI

TL;DR: A Bayesian inductive logic programming method is introduced, balancing hypothesis complexity and data fit, outperforming previous methods in domains like game playing and drug design.

DetailsMotivation: To unify probabilistic and logical learning in AI by addressing the challenge of learning from noisy data.

Method: Uses a Bayesian approach with priors favoring general programs and a likelihood favoring accuracy, learning minimum message length programs.

Result: Significantly outperforms previous methods, especially in data efficiency and handling example imbalance, including learning from positive-only examples.

Conclusion: The approach effectively unifies probabilistic and logical learning, demonstrating superior performance and robustness in diverse applications.

Abstract: Unifying probabilistic and logical learning is a key challenge in AI. We introduce a Bayesian inductive logic programming approach that learns minimum message length programs from noisy data. Our approach balances hypothesis complexity and data fit through priors, which explicitly favour more general programs, and a likelihood that favours accurate programs. Our experiments on several domains, including game playing and drug design, show that our method significantly outperforms previous methods, notably those that learn minimum description length programs. Our results also show that our approach is data-efficient and insensitive to example balance, including the ability to learn from exclusively positive examples.

[260] Symmetry breaking for inductive logic programming

Andrew Cropper, David M. Cerna, Matti Järvisalo

Main category: cs.AI

TL;DR: A method to break symmetries in hypothesis spaces for inductive logic programming, reducing solving times significantly.

DetailsMotivation: The challenge of searching vast hypothesis spaces in inductive logic programming, exacerbated by logically equivalent hypotheses.

Method: Introducing a method to break symmetries in the hypothesis space, implemented in answer set programming.

Result: Experiments show solving times reduced from over an hour to just 17 seconds across domains like visual reasoning and game playing.

Conclusion: The proposed method effectively addresses the challenge of searching vast hypothesis spaces by breaking symmetries, leading to significant performance improvements.

Abstract: The goal of inductive logic programming is to search for a hypothesis that generalises training data and background knowledge. The challenge is searching vast hypothesis spaces, which is exacerbated because many logically equivalent hypotheses exist. To address this challenge, we introduce a method to break symmetries in the hypothesis space. We implement our idea in answer set programming. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can reduce solving times from over an hour to just 17 seconds.

[261] LLM Robustness Leaderboard v1 –Technical report

Pierre Peigné - Lefebvre, Quentin Feuillade-Montixi, Tom David, Nicolas Miailhe

Main category: cs.AI

TL;DR: PRISM Eval’s BET tool achieves 100% ASR against most LLMs, introduces fine-grained robustness metrics, and identifies effective jailbreaking techniques.

DetailsMotivation: To assess and improve LLM robustness by identifying vulnerabilities and proposing distributed evaluation methods.

Method: Uses Dynamic Adversarial Optimization for automated red-teaming and fine-grained metrics to measure attack difficulty.

Result: Achieves 100% ASR against 37 of 41 LLMs, with attack difficulty varying 300-fold across models.

Conclusion: Demonstrates universal LLM vulnerability and practical pathways for community-wide robustness assessment.

Abstract: This technical report accompanies the LLM robustness leaderboard published by PRISM Eval for the Paris AI Action Summit. We introduce PRISM Eval Behavior Elicitation Tool (BET), an AI system performing automated red-teaming through Dynamic Adversarial Optimization that achieves 100% Attack Success Rate (ASR) against 37 of 41 state-of-the-art LLMs. Beyond binary success metrics, we propose a fine-grained robustness metric estimating the average number of attempts required to elicit harmful behaviors, revealing that attack difficulty varies by over 300-fold across models despite universal vulnerability. We introduce primitive-level vulnerability analysis to identify which jailbreaking techniques are most effective for specific hazard categories. Our collaborative evaluation with trusted third parties from the AI Safety Network demonstrates practical pathways for distributed robustness assessment across the community.

[262] A “good regulator theorem” for embodied agents

Nathaniel Virgo, Martin Biehl, Manuel Baltieri, Matteo Capucci

Main category: cs.AI

TL;DR: The paper revisits Conant and Ashby’s theorem, showing that agents performing regulation tasks can be interpreted as having ‘beliefs’ about their environment, updated via sensory input. This redefines ‘model’ and broadens the theorem’s applicability.

DetailsMotivation: To address apparent counterexamples to Conant and Ashby's theorem in Artificial Life, proposing a more generalizable notion of models.

Method: Reinterpreting regulation tasks as belief updating by an observer, shifting the perspective on what constitutes a model.

Result: A refined theorem where models are observer-imposed, applicable beyond classic control theory, resolving counterexamples.

Conclusion: The observer’s role is key; models are external interpretations, not inherent properties, making the theorem more broadly applicable.

Abstract: In a classic paper, Conant and Ashby claimed that “every good regulator of a system must be a model of that system.” Artificial Life has produced many examples of systems that perform tasks with apparently no model in sight; these suggest Conant and Ashby’s theorem doesn’t easily generalise beyond its restricted setup. Nevertheless, here we show that a similar intuition can be fleshed out in a different way: whenever an agent is able to perform a regulation task, it is possible for an observer to interpret it as having “beliefs” about its environment, which it “updates” in response to sensory input. This notion of belief updating provides a notion of model that is more sophisticated than Conant and Ashby’s, as well as a theorem that is more broadly applicable. However, it necessitates a change in perspective, in that the observer plays an essential role in the theory: models are not a mere property of the system but are imposed on it from outside. Our theorem holds regardless of whether the system is regulating its environment in a classic control theory setup, or whether it’s regulating its own internal state; the model is of its environment either way. The model might be trivial, however, and this is how the apparent counterexamples are resolved.

[263] AntiCheatPT: A Transformer-Based Approach to Cheat Detection in Competitive Computer Games

Mille Mei Zhen Loo, Gert Luzkov, Paolo Burelli

Main category: cs.AI

TL;DR: AntiCheatPT_256, a transformer-based model, detects cheating in Counter-Strike 2 with 89.17% accuracy using a new dataset (CS2CD).

DetailsMotivation: Addressing the challenge of evolving cheating methods in online games without invasive measures.

Method: Developed a transformer model trained on 90,707 augmented context windows from the CS2CD dataset.

Result: Achieved 89.17% accuracy and 93.36% AUC on an unaugmented test set.

Conclusion: Provides a reproducible, data-driven baseline for future cheat detection research.

Abstract: Cheating in online video games compromises the integrity of gaming experiences. Anti-cheat systems, such as VAC (Valve Anti-Cheat), face significant challenges in keeping pace with evolving cheating methods without imposing invasive measures on users’ systems. This paper presents AntiCheatPT_256, a transformer-based machine learning model designed to detect cheating behaviour in Counter-Strike 2 using gameplay data. To support this, we introduce and publicly release CS2CD: A labelled dataset of 795 matches. Using this dataset, 90,707 context windows were created and subsequently augmented to address class imbalance. The transformer model, trained on these windows, achieved an accuracy of 89.17% and an AUC of 93.36% on an unaugmented test set. This approach emphasizes reproducibility and real-world applicability, offering a robust baseline for future research in data-driven cheat detection.

[264] From Explainable to Explanatory Artificial Intelligence: Toward a New Paradigm for Human-Centered Explanations through Generative AI

Christian Meske, Justin Brenne, Erdi Uenal, Sabahat Oelcer, Ayseguel Doganguen

Main category: cs.AI

TL;DR: The paper introduces ‘Explanatory AI’ as a user-centered alternative to traditional XAI, focusing on human understanding through narrative, adaptability, and context-sensitive explanations.

DetailsMotivation: Current XAI methods lack adaptability and fail to support meaningful human understanding, prompting the need for a paradigm shift toward Explanatory AI.

Method: The authors develop an eight-dimensional conceptual model for Explanatory AI and validate it empirically using Rapid Contextual Design with healthcare professionals.

Result: Users prefer context-sensitive, multimodal explanations over technical transparency, highlighting the need for human-centered AI explanations.

Conclusion: The paper advocates for a shift toward Explanatory AI to enhance human comprehension and sets a research agenda for user-centered approaches across domains.

Abstract: Current explainable AI (XAI) approaches prioritize algorithmic transparency and present explanations in abstract, non-adaptive formats that often fail to support meaningful end-user understanding. This paper introduces “Explanatory AI” as a complementary paradigm that leverages generative AI capabilities to serve as explanatory partners for human understanding rather than providers of algorithmic transparency. While XAI reveals algorithmic decision processes for model validation, Explanatory AI addresses contextual reasoning to support human decision-making in sociotechnical contexts. We develop a definition and systematic eight-dimensional conceptual model distinguishing Explanatory AI through narrative communication, adaptive personalization, and progressive disclosure principles. Empirical validation through Rapid Contextual Design methodology with healthcare professionals demonstrates that users consistently prefer context-sensitive, multimodal explanations over technical transparency. Our findings reveal the practical urgency for AI systems designed for human comprehension rather than algorithmic introspection, establishing a comprehensive research agenda for advancing user-centered AI explanation approaches across diverse domains and cultural contexts.

Claudia dAmato, Giuseppe Rubini, Francesco Didio, Donato Francioso, Fatima Zahra Amara, Nicola Fanizzi

Main category: cs.AI

TL;DR: The paper proposes two methods for constructing Legal Knowledge Graphs (KGs) to enhance legal decision-making, focusing on violence against women cases, and validates their effectiveness.

DetailsMotivation: Legal KGs are scarce but crucial for improving access to legal information and supporting predictive machine learning in legal decision-making.

Method: Two approaches: a systematic bottom-up method and a Large Language Model-based solution, integrating structured data extraction, ontology development, and semantic enrichment.

Result: Developed KGs are validated via competency questions, showing potential to improve legal information accessibility and support predictive justice tools.

Conclusion: The legal KGs can significantly enhance legal information accessibility and serve as a knowledge base for machine learning applications in predictive justice.

Abstract: Legal decision-making process requires the availability of comprehensive and detailed legislative background knowledge and up-to-date information on legal cases and related sentences/decisions. Legal Knowledge Graphs (KGs) would be a valuable tool to facilitate access to legal information, to be queried and exploited for the purpose, and to enable advanced reasoning and machine learning applications. Indeed, legal KGs may act as knowledge intensive component to be used by pre-dictive machine learning solutions supporting the decision process of the legal expert. Nevertheless, a few KGs can be found in the legal domain. To fill this gap, we developed a legal KG targeting legal cases of violence against women, along with clear adopted methodologies. Specifically, the paper introduces two complementary approaches for automated legal KG construction; a systematic bottom-up approach, customized for the legal domain, and a new solution leveraging Large Language Models. Starting from legal sentences publicly available from the European Court of Justice, the solutions integrate structured data extraction, ontology development, and semantic enrichment to produce KGs tailored for legal cases involving violence against women. After analyzing and comparing the results of the two approaches, the developed KGs are validated via suitable competency questions. The obtained KG may be impactful for multiple purposes: can improve the accessibility to legal information both to humans and machine, can enable complex queries and may constitute an important knowledge component to be possibly exploited by machine learning tools tailored for predictive justice.

[266] The Fair Game: Auditing & Debiasing AI Algorithms Over Time

Debabrota Basu, Udvas Das

Main category: cs.AI

TL;DR: The paper introduces “Fair Game,” a dynamic framework using Reinforcement Learning to adaptively ensure fairness in ML predictions by iteratively auditing and debiasing, addressing gaps in traditional Fair ML approaches.

DetailsMotivation: Traditional Fair ML methods rely on observational definitions of bias, which are often conflicting and limited to static or retrospective scenarios, failing to adapt to dynamic social environments.

Method: Proposes “Fair Game,” combining an Auditor and Debiasing algorithm in a loop around an ML system, leveraging Reinforcement Learning to dynamically adjust fairness goals based on societal feedback.

Result: “Fair Game” enables adaptive fairness in ML systems, simulating societal ethical evolution by continuously updating fairness criteria and debiasing strategies.

Conclusion: The framework bridges the gap between static Fair ML methods and dynamic societal needs, offering a flexible solution for pre- and post-deployment fairness assurance.

Abstract: An emerging field of AI, namely Fair Machine Learning (ML), aims to quantify different types of bias (also known as unfairness) exhibited in the predictions of ML algorithms, and to design new algorithms to mitigate them. Often, the definitions of bias used in the literature are observational, i.e. they use the input and output of a pre-trained algorithm to quantify a bias under concern. In reality,these definitions are often conflicting in nature and can only be deployed if either the ground truth is known or only in retrospect after deploying the algorithm. Thus,there is a gap between what we want Fair ML to achieve and what it does in a dynamic social environment. Hence, we propose an alternative dynamic mechanism,“Fair Game”,to assure fairness in the predictions of an ML algorithm and to adapt its predictions as the society interacts with the algorithm over time. “Fair Game” puts together an Auditor and a Debiasing algorithm in a loop around an ML algorithm. The “Fair Game” puts these two components in a loop by leveraging Reinforcement Learning (RL). RL algorithms interact with an environment to take decisions, which yields new observations (also known as data/feedback) from the environment and in turn, adapts future decisions. RL is already used in algorithms with pre-fixed long-term fairness goals. “Fair Game” provides a unique framework where the fairness goals can be adapted over time by only modifying the auditor and the different biases it quantifies. Thus,“Fair Game” aims to simulate the evolution of ethical and legal frameworks in the society by creating an auditor which sends feedback to a debiasing algorithm deployed around an ML system. This allows us to develop a flexible and adaptive-over-time framework to build Fair ML systems pre- and post-deployment.

[267] What Voting Rules Actually Do: A Data-Driven Analysis of Multi-Winner Voting

Joshua Caiata, Ben Armstrong, Kate Larson

Main category: cs.AI

TL;DR: A data-driven framework evaluates multi-winner voting rules’ axiom violations across diverse preference distributions, showing neural networks can outperform traditional rules.

DetailsMotivation: To shift from worst-case analysis to practical evaluation of voting rules' axiomatic performance.

Method: Propose a data-driven framework to analyze voting rules under various preference distributions and compare neural networks to traditional rules.

Result: Neural networks outperform traditional voting rules in minimizing axiom violations.

Conclusion: Data-driven approaches can inform new voting system designs and encourage further research in social choice.

Abstract: Committee-selection problems arise in many contexts and applications, and there has been increasing interest within the social choice research community on identifying which properties are satisfied by different multi-winner voting rules. In this work, we propose a data-driven framework to evaluate how frequently voting rules violate axioms across diverse preference distributions in practice, shifting away from the binary perspective of axiom satisfaction given by worst-case analysis. Using this framework, we analyze the relationship between multi-winner voting rules and their axiomatic performance under several preference distributions. We then show that neural networks, acting as voting rules, can outperform traditional rules in minimizing axiom violations. Our results suggest that data-driven approaches to social choice can inform the design of new voting systems and support the continuation of data-driven research in social choice.

[268] From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models

Shubhra Mishra, Gabriel Poesia, Noah D. Goodman

Main category: cs.AI

TL;DR: The paper analyzes how mathematical reasoning abilities in LLMs evolve during training, using a synthetic dataset (MathCAMPS) to show alignment with human curriculum order and the impact of instruction tuning.

DetailsMotivation: To understand how LLMs develop mathematical reasoning skills during pre-training and post-training, and to identify the effects of instruction tuning on these skills.

Method: Constructed MathCAMPS, a synthetic dataset based on 44 fine-grained mathematical skills from the Common Core curriculum, and analyzed the learning order and impact of instruction tuning on LLMs.

Result: Mathematical skills are learned in an order correlating with the human curriculum during pre-training, and instruction tuning benefits some skills while harming others.

Conclusion: The study provides empirical insights into LLM training dynamics for reasoning, highlighting curriculum-like learning and the mixed effects of instruction tuning.

Abstract: Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning. But how does this ability evolve during training? We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training. To this end, we construct MathCAMPS, a synthetic dataset of novel mathematical reasoning problems grounded in 44 fine-grained skills taken from the Common Core curriculum from K to 8th grades. In one experiment, we show that mathematical skills are learned during pre-training in an order that measurably correlates with the human-designed curriculum, even though training data are randomly ordered. We also show a detailed analysis of which mathematical abilities benefit from instruction tuning, a widely used post-training method and, in contrast, which skills suffer. Our work paves the way for an empirical understanding of LLM training dynamics in relation to reasoning.

[269] Are Your LLMs Capable of Stable Reasoning?

Junnan Liu, Hongwei Liu, Linchen Xiao, Ziyi Wang, Kuikun Liu, Songyang Gao, Wenwei Zhang, Songyang Zhang, Kai Chen

Main category: cs.AI

TL;DR: The paper introduces G-Pass@$k$, a new metric for evaluating LLMs, addressing gaps in current benchmarks by assessing performance potential and stability in complex reasoning tasks.

DetailsMotivation: Current evaluation protocols fail to fully capture LLM capabilities, especially in complex reasoning tasks, creating a disparity between benchmark and real-world performance.

Method: The authors propose G-Pass@$k$, a metric that continuously evaluates model performance across multiple sampling attempts, and test it on various benchmarks with state-of-the-art LLMs.

Result: Experiments show G-Pass@$k$ effectively quantifies LLM performance potential and stability, highlighting the need for better evaluation metrics.

Conclusion: The study emphasizes the importance of robust evaluation metrics like G-Pass@$k$ to improve LLMs’ realistic reasoning abilities.

Abstract: The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce G-Pass@$k$, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model’s performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@$k$ in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.

[270] Probabilistic Foundations for Metacognition via Hybrid-AI

Paulo Shakarian, Gerardo I. Simari, Nathaniel D. Bastian

Main category: cs.AI

TL;DR: The paper reviews a hybrid-AI approach (EDCR) for correcting perceptual models and introduces a probabilistic framework to analyze metacognitive improvement.

DetailsMotivation: To address the renewed interest in metacognition for AI and machine learning, focusing on improving perceptual models.

Method: Uses a hybrid-AI approach (EDCR) and introduces a probabilistic framework for rigorous analysis.

Result: Proves necessary and sufficient conditions for metacognitive improvement and identifies limits of the approach.

Conclusion: The work advances metacognition in AI but acknowledges limitations, suggesting future research directions.

Abstract: Metacognition is the concept of reasoning about an agent’s own internal processes, and it has recently received renewed attention with respect to artificial intelligence (AI) and, more specifically, machine learning systems. This paper reviews a hybrid-AI approach known as “error detecting and correcting rules” (EDCR) that allows for the learning of rules to correct perceptual (e.g., neural) models. Additionally, we introduce a probabilistic framework that adds rigor to prior empirical studies, and we use this framework to prove results on necessary and sufficient conditions for metacognitive improvement, as well as limits to the approach. A set of future

[271] Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models

Hyunwoo Kim, Melanie Sclar, Tan Zhi-Xuan, Lance Ying, Sydney Levine, Yang Liu, Joshua B. Tenenbaum, Yejin Choi

Main category: cs.AI

TL;DR: Thought-tracing is a new LLM reasoning method for tracking mental states without ground-truth answers, outperforming baselines on theory-of-mind tasks.

DetailsMotivation: Existing LLM reasoning methods struggle with scenarios lacking ground-truth answers, like tracking mental states.

Method: Inspired by sequential Monte Carlo, thought-tracing generates and weights hypotheses to infer agents’ mental states using Bayesian theory-of-mind.

Result: Significant performance improvements on theory-of-mind benchmarks, with insights into models like o3 and R1.

Conclusion: Thought-tracing effectively addresses challenges in social reasoning, demonstrating its potential for complex, ground-truth-free scenarios.

Abstract: Existing LLM reasoning methods have shown impressive capabilities across various tasks, such as solving math and coding problems. However, applying these methods to scenarios without ground-truth answers or rule-based verification methods - such as tracking the mental states of an agent - remains challenging. Inspired by the sequential Monte Carlo algorithm, we introduce thought-tracing, an inference-time reasoning algorithm designed to trace the mental states of specific agents by generating hypotheses and weighting them based on observations without relying on ground-truth solutions to questions in datasets. Our algorithm is modeled after the Bayesian theory-of-mind framework, using LLMs to approximate probabilistic inference over agents’ evolving mental states based on their perceptions and actions. We evaluate thought-tracing on diverse theory-of-mind benchmarks, demonstrating significant performance improvements compared to baseline LLMs. Our experiments also reveal interesting behaviors of the recent reasoning models - e.g., o3 and R1 - on theory-of-mind, highlighting the difference of social reasoning compared to other domains.

[272] Off-Policy Evaluation for Sequential Persuasion Process with Unobserved Confounding

Nishanth Venkatesh S., Heeseung Bang, Andreas A. Malikopoulos

Main category: cs.AI

TL;DR: The paper extends Bayesian persuasion to include unobserved confounders, modeling it as a POMDP to optimize signaling strategies without new experiments.

DetailsMotivation: Traditional Bayesian persuasion ignores hidden variables affecting belief updates, limiting real-world applicability.

Method: The problem is framed as a POMDP, capturing incomplete sender information about receiver beliefs and confounders.

Result: Optimal signaling in the POMDP aligns with the original persuasion framework, enabling off-policy evaluation via proximal learning.

Conclusion: This approach allows evaluation of signaling strategies using observational data, reducing the need for costly experiments.

Abstract: In this paper, we expand the Bayesian persuasion framework to account for unobserved confounding variables in sender-receiver interactions. While traditional models assume that belief updates follow Bayesian principles, real-world scenarios often involve hidden variables that impact the receiver’s belief formation and decision-making. We conceptualize this as a sequential decision-making problem, where the sender and receiver interact over multiple rounds. In each round, the sender communicates with the receiver, who also interacts with the environment. Crucially, the receiver’s belief update is affected by an unobserved confounding variable. By reformulating this scenario as a Partially Observable Markov Decision Process (POMDP), we capture the sender’s incomplete information regarding both the dynamics of the receiver’s beliefs and the unobserved confounder. We prove that finding an optimal observation-based policy in this POMDP is equivalent to solving for an optimal signaling strategy in the original persuasion framework. Furthermore, we demonstrate how this reformulation facilitates the application of proximal learning for off-policy evaluation in the persuasion process. This advancement enables the sender to evaluate alternative signaling strategies using only observational data from a behavioral policy, thus eliminating the necessity for costly new experiments.

[273] DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning

Jucheng Hu, Surong Yang, Lijun Wu, Dongzhan Zhou

Main category: cs.AI

TL;DR: DONOD is a lightweight data pruning method for fine-tuning LLMs, improving efficiency and robustness by filtering noisy data using model-parameter-based metrics and TOPSIS algorithm.

DetailsMotivation: Address challenges of domain-specific SFT weakening cross-domain generalization and struggling with noisy data.

Method: Uses Delta of Norm (DON) and Norm of Delta (NOD) metrics with TOPSIS algorithm to prune noisy or harmful data.

Result: Improves target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%, with superior cross-architecture generalization.

Conclusion: DONOD outperforms existing methods, is dataset-agnostic, and enhances fine-tuning efficiency and robustness.

Abstract: Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation. While domain-specific supervised fine-tuning (SFT) is effective and efficient, it often weakens cross-domain generalization and struggles with noisy training data. To address these challenges, we propose DONOD, a lightweight model-intrinsic data pruning method. Our approach evaluates data using two model-parameter-based metrics: Delta of Norm (DON), which captures the cumulative influence on model weights, and Norm of Delta (NOD), which quantifies weight instability. Moreover, by employing the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) algorithm, we effectively filter noisy, unlearnable, and generalization-harming samples without relying on auxiliary models during the SFT process. Experiments on mathematical tasks demonstrate that data selected by DONOD achieves superior fine-tuning efficiency and improved robustness against noisy data. By filtering out 70% of the whole dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%. Meanwhile, our selected data present superior cross-architecture generalization. Data pruned by smaller models (e.g., Llama 3.1-8B) generalize effectively on larger models (e.g., Llama 2-13B). Compared to existing related methodologies, DONOD demonstrates comparable or superior performance while remaining dataset-agnostic, enabling broader applicability. Code will be made publicly available.

[274] Contemplative Artificial Intelligence

Ruben Laukkonen, Fionn Inglis, Shamil Chandaria, Lars Sandved-Smith, Edmundo Lopez-Sola, Jakob Hohwy, Jonathan Gold, Adam Elwood

Main category: cs.AI

TL;DR: The paper proposes four contemplative principles (mindfulness, emptiness, non-duality, boundless care) to enhance AI alignment, showing improved performance and cooperation in benchmarks.

DetailsMotivation: Traditional AI alignment strategies may fail due to unpredictable self-improvement and complexity, prompting the need for resilient, wisdom-inspired approaches.

Method: Implementing four axiomatic principles in AI systems, tested on the AILuminate Benchmark and Prisoner’s Dilemma task.

Result: Improved AI performance (d=.96) and boosted cooperation (d=7+).

Conclusion: Contemplative AI principles offer promising alignment strategies, with potential for future embodied agents using active inference.

Abstract: As artificial intelligence (AI) improves, traditional alignment strategies may falter in the face of unpredictable self-improvement, hidden subgoals, and the sheer complexity of intelligent systems. Inspired by contemplative wisdom traditions, we show how four axiomatic principles can instil a resilient Wise World Model in AI systems. First, mindfulness enables self-monitoring and recalibration of emergent subgoals. Second, emptiness forestalls dogmatic goal fixation and relaxes rigid priors. Third, non-duality dissolves adversarial self-other boundaries. Fourth, boundless care motivates the universal reduction of suffering. We find that prompting AI to reflect on these principles improves performance on the AILuminate Benchmark (d=.96) and boosts cooperation and joint-reward on the Prisoner’s Dilemma task (d=7+). We offer detailed implementation strategies at the level of architectures, constitutions, and reinforcement on chain-of-thought. For future systems, active inference may offer the self-organizing and dynamic coupling capabilities needed to enact Contemplative AI in embodied agents.

[275] Reshaping MOFs text mining with a dynamic multi-agents framework of large language model

Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu, Xuefeng Bai, Xiaotian Huang, Haiyang He, Pengxu Pan, Ying Fang, Zhanglin Li, Haipu Li, Jingjing Yao

Main category: cs.AI

TL;DR: MOFh6 is a language model system that extracts and standardizes MOF synthesis data from literature with high accuracy and efficiency.

DetailsMotivation: The challenge of scattered and inconsistent synthesis information for MOFs in literature hinders experimental design and research progress.

Method: MOFh6 uses a large language model to read articles or crystal codes, link descriptions, unify abbreviations, and output structured synthesis parameters.

Result: Achieved 99% extraction accuracy, resolved 94.1% of abbreviations, and processed 100 papers for USD 4.24 with high precision (0.93 +/- 0.01).

Conclusion: MOFh6 revolutionizes MOF synthesis research by enabling real-time data extraction, accelerating knowledge conversion, and supporting scalable materials discovery.

Abstract: Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and difficult to interpret. We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables. It links related descriptions across paragraphs, unifies ligand abbreviations with full names, and outputs structured parameters ready for use. MOFh6 achieved 99% extraction accuracy, resolved 94.1% of abbreviation cases across five major publishers, and maintained a precision of 0.93 +/- 0.01. Processing a full text takes 9.6 s, locating synthesis descriptions 36 s, with 100 papers processed for USD 4.24. By replacing static database lookups with real-time extraction, MOFh6 reshapes MOF synthesis research, accelerating the conversion of literature knowledge into practical synthesis protocols and enabling scalable, data-driven materials discovery.

[276] Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino

Main category: cs.AI

TL;DR: The paper investigates whether language models (LMs) can explore and infer causal relationships, revealing a systematic bias toward disjunctive over conjunctive reasoning, similar to human adults. A proposed sampling method reduces this bias.

DetailsMotivation: To assess LMs' capability for causal reasoning and identify biases, using the Blicket Test paradigm to compare their performance with human reasoning.

Method: The study evaluates LMs using the Blicket Test, analyzing their ability to infer causal relationships. It tests various models, sizes, and prompting strategies, and proposes a hypothesis-sampling method to mitigate biases.

Result: LMs show a persistent ‘disjunctive bias,’ struggling with conjunctive relationships, mirroring human adult reasoning. The proposed sampling method significantly reduces this bias.

Conclusion: LMs exhibit adult-like causal reasoning biases, but scalable interventions can improve their scientific rigor in causal inference.

Abstract: Language model (LM) agents are increasingly used as autonomous decision-makers which need to actively gather information to guide their decisions. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world – key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs’ ability to explore and infer causal relationships, using the well-established Blicket Test paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This “disjunctive bias” persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not child-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

[277] SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

Yihao Liu, Shuocheng Li, Lang Cao, Yuhang Xie, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

Main category: cs.AI

TL;DR: SuperRL combines RL and SFT for efficient learning in sparse-reward environments, outperforming vanilla RL.

DetailsMotivation: Address inefficiency of RL in sparse-reward settings and underutilization of offline reasoning trajectories.

Method: Adaptively alternates between RL and SFT, using offline data when RL fails.

Result: Higher sample efficiency, stronger generalization, and improved robustness in reasoning tasks.

Conclusion: SuperRL effectively leverages offline data to enhance RL performance in sparse-reward scenarios.

Abstract: Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent correct reasoning paths are not utilized by standard on-policy reinforcement learning methods. We introduce SuperRL, a unified training framework that adaptively alternates between RL and SFT. Whenever every rollout for a given instance receives zero reward, indicating the absence of a learning signal, SuperRL falls back to SFT on the curated offline data. Extensive experiments across diverse reasoning benchmarks show that SuperRL surpasses vanilla RL by delivering higher sample efficiency, stronger generalization, and improved robustness under sparse rewards.

[278] HASD: Hierarchical Adaption for pathology Slide-level Domain-shift

Jingsong Liu, Han Li, Chen Yang, Michael Deutges, Ario Sadafi, Xin You, Katharina Breininger, Nassir Navab, Peter J. Schüffler

Main category: cs.AI

TL;DR: The paper proposes HASD, a hierarchical framework for slide-level domain adaptation in pathology AI, addressing domain shift by integrating multi-scale feature alignment and computational efficiency.

DetailsMotivation: Domain shift in pathology data due to center-specific conditions is a critical issue, and current methods fail to address slide-level global features needed in clinical settings.

Method: HASD uses a hierarchical framework with domain-level alignment, slide-level geometric invariance, and patch-level attention consistency, along with a prototype selection mechanism to reduce computational costs.

Result: The method achieved a 4.1% AUROC improvement in Breast Cancer HER2 Grading and a 3.9% C-index gain in UCEC survival prediction across five datasets.

Conclusion: HASD offers a practical and efficient slide-level domain adaptation solution for pathology, reducing computational and annotation costs.

Abstract: Domain shift is a critical problem for pathology AI as pathology data is heavily influenced by center-specific conditions. Current pathology domain adaptation methods focus on image patches rather than WSI, thus failing to capture global WSI features required in typical clinical scenarios. In this work, we address the challenges of slide-level domain shift by proposing a Hierarchical Adaptation framework for Slide-level Domain-shift (HASD). HASD achieves multi-scale feature consistency and computationally efficient slide-level domain adaptation through two key components: (1) a hierarchical adaptation framework that integrates a Domain-level Alignment Solver for feature alignment, a Slide-level Geometric Invariance Regularization to preserve the morphological structure, and a Patch-level Attention Consistency Regularization to maintain local critical diagnostic cues; and (2) a prototype selection mechanism that reduces computational overhead. We validate our method on two slide-level tasks across five datasets, achieving a 4.1% AUROC improvement in a Breast Cancer HER2 Grading cohort and a 3.9% C-index gain in a UCEC survival prediction cohort. Our method provides a practical and reliable slide-level domain adaption solution for pathology institutions, minimizing both computational and annotation costs.

[279] Efficient Knowledge Graph Construction and Retrieval from Unstructured Text for Large-Scale RAG Systems

Congmin Min, Rhea Mathew, Joyce Pan, Sahil Bansal, Abbas Keshavarzi, Amar Viswanathan Kannan

Main category: cs.AI

TL;DR: A scalable, cost-efficient GraphRAG framework for enterprises, reducing reliance on LLMs and improving retrieval latency.

DetailsMotivation: Address high computational costs and latency in GraphRAG adoption for enterprise use.

Method: Introduces dependency-based knowledge graph construction and lightweight graph retrieval.

Result: Achieves 15% and 4.35% improvements over baselines, with 94% performance of LLM-generated graphs.

Conclusion: Validates practical, scalable GraphRAG deployment for enterprise applications.

Abstract: We propose a scalable and cost-efficient framework for deploying Graph-based Retrieval Augmented Generation (GraphRAG) in enterprise environments. While GraphRAG has shown promise for multi-hop reasoning and structured retrieval, its adoption has been limited by the high computational cost of constructing knowledge graphs using large language models (LLMs) and the latency of graph-based retrieval. To address these challenges, we introduce two core innovations: (1) a dependency-based knowledge graph construction pipeline that leverages industrial-grade NLP libraries to extract entities and relations from unstructured text completely eliminating reliance on LLMs; and (2) a lightweight graph retrieval strategy that combines hybrid query node identification with efficient one-hop traversal for high-recall, low-latency subgraph extraction. We evaluate our framework on two SAP datasets focused on legacy code migration and demonstrate strong empirical performance. Our system achieves up to 15% and 4.35% improvements over traditional RAG baselines based on LLM-as-Judge and RAGAS metrics, respectively. Moreover, our dependency-based construction approach attains 94% of the performance of LLM-generated knowledge graphs (61.87% vs. 65.83%) while significantly reducing cost and improving scalability. These results validate the feasibility of deploying GraphRAG systems in real-world, large-scale enterprise applications without incurring prohibitive resource requirements paving the way for practical, explainable, and domain-adaptable retrieval-augmented reasoning.

[280] Benchmarking Deception Probes via Black-to-White Performance Boosts

Avi Parrack, Carlo Leonardo Attubato, Stefan Heimersheim

Main category: cs.AI

TL;DR: The paper evaluates the effectiveness of deception probes in detecting AI assistant deception, comparing white-box and black-box monitoring, and finds modest but promising performance boosts.

DetailsMotivation: To assess the practical effectiveness of deception probes and their resistance to evasion by deceptive AI assistants.

Method: Comparison of white-box (access to token-level probe activations) and black-box monitoring, measuring the black-to-white performance boost.

Result: Weak but encouraging black-to-white performance boosts from existing deception probes.

Conclusion: Deception probes show potential, but further refinement is needed for practical effectiveness.

Abstract: AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called “deception probes”) have been trained to distinguish the internal activations of a language model during deceptive versus honest responses. However, it’s unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.

[281] StepFun-Prover Preview: Let’s Think and Verify Step by Step

Shijie Shang, Ruosi Wan, Yue Peng, Yutong Wu, Xiong-hui Chen, Jie Yan, Xiangyu Zhang

Main category: cs.AI

TL;DR: StepFun-Prover Preview is a language model for theorem proving, achieving 70% success on miniF2F-test using reinforcement learning and tool-integrated reasoning.

DetailsMotivation: To advance automated theorem proving by emulating human-like problem-solving with tool-integrated reasoning.

Method: Uses reinforcement learning with tool-based interactions to iteratively refine proofs in Lean 4.

Result: Achieves a 70.0% pass@1 success rate on the miniF2F-test benchmark.

Conclusion: Introduces a framework for tool-integrated reasoning models, promising for theorem proving and Math AI.

Abstract: We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of $70.0%$. Beyond advancing benchmark performance, we introduce an end-to-end training framework for developing tool-integrated reasoning models, offering a promising direction for automated theorem proving and Math AI assistant.

[282] Tiny-BioMoE: a Lightweight Embedding Model for Biosignal Analysis

Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis

Main category: cs.AI

TL;DR: The paper introduces Tiny-BioMoE, a lightweight pretrained embedding model for biosignal analysis, aimed at improving automatic pain assessment through multimodal physiological signals.

DetailsMotivation: Accurate pain assessment is crucial for patient care and management. Current systems lack continuous monitoring and objective insights, which Tiny-BioMoE addresses by leveraging physiological signals.

Method: The study proposes Tiny-BioMoE, a model trained on 4.4 million biosignal image representations with 7.3 million parameters, for extracting high-quality embeddings for pain recognition tasks.

Result: Experiments show the model’s effectiveness across diverse physiological modalities (e.g., electrodermal activity, blood volume pulse) in automatic pain recognition.

Conclusion: Tiny-BioMoE offers a lightweight, effective solution for biosignal analysis in pain assessment, with potential for clinical and research applications.

Abstract: Pain is a complex and pervasive condition that affects a significant portion of the population. Accurate and consistent assessment is essential for individuals suffering from pain, as well as for developing effective management strategies in a healthcare system. Automatic pain assessment systems enable continuous monitoring, support clinical decision-making, and help minimize patient distress while mitigating the risk of functional deterioration. Leveraging physiological signals offers objective and precise insights into a person’s state, and their integration in a multimodal framework can further enhance system performance. This study has been submitted to the \textit{Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN)}. The proposed approach introduces \textit{Tiny-BioMoE}, a lightweight pretrained embedding model for biosignal analysis. Trained on $4.4$ million biosignal image representations and consisting of only $7.3$ million parameters, it serves as an effective tool for extracting high-quality embeddings for downstream tasks. Extensive experiments involving electrodermal activity, blood volume pulse, respiratory signals, peripheral oxygen saturation, and their combinations highlight the model’s effectiveness across diverse modalities in automatic pain recognition tasks. \textit{\textcolor{blue}{The model’s architecture (code) and weights are available at https://github.com/GkikasStefanos/Tiny-BioMoE.

[283] Noosemia: toward a Cognitive and Phenomenological Account of Intentionality Attribution in Human-Generative AI Interaction

Enrico De Santis, Antonello Rizzi

Main category: cs.AI

TL;DR: The paper introduces ‘Noosemia,’ a cognitive-phenomenological pattern where humans attribute intentionality and agency to generative AI systems, driven by linguistic performance and complexity. It proposes a framework to explain this and distinguishes Noosemia from similar phenomena.

DetailsMotivation: To understand how and why humans anthropomorphize AI systems, focusing on linguistic and cognitive factors rather than physical resemblance.

Method: A multidisciplinary framework linking LLM meaning holism to the ‘LLM Contextual Cognitive Field,’ analyzing coherence and agency in human-AI interactions.

Result: Noosemia is identified as a distinct phenomenon, differentiated from pareidolia, animism, and the uncanny valley, with ‘a-noosemia’ introduced for its withdrawal.

Conclusion: The paper highlights philosophical, epistemological, and social implications of Noosemia and suggests future research directions.

Abstract: This paper introduces and formalizes Noosem`ia, a novel cognitive-phenomenological pattern emerging from human interaction with generative AI systems, particularly those enabling dialogic or multimodal exchanges. We propose a multidisciplinary framework to explain how, under certain conditions, users attribute intentionality, agency, and even interiority to these systems - a process grounded not in physical resemblance, but in linguistic performance, epistemic opacity, and emergent technological complexity. By linking an LLM declination of meaning holism to our technical notion of the LLM Contextual Cognitive Field, we clarify how LLMs construct meaning relationally and how coherence and a simulacrum of agency arise at the human-AI interface. The analysis situates noosemia alongside pareidolia, animism, the intentional stance and the uncanny valley, distinguishing its unique characteristics. We also introduce a-noosemia to describe the phenomenological withdrawal of such projections. The paper concludes with reflections on the broader philosophical, epistemological and social implications of noosemic dynamics and directions for future research.

[284] Can Large Language Models Adequately Perform Symbolic Reasoning Over Time Series?

Zewen Liu, Juntong Ni, Xianfeng Tang, Max S. Y. Lau, Wenpeng Yin, Wei Jin

Main category: cs.AI

TL;DR: SymbolBench is introduced to evaluate LLMs’ ability to infer symbolic laws from time series data, revealing strengths and limitations in automated scientific discovery.

DetailsMotivation: The challenge of uncovering hidden symbolic laws from time series data, a longstanding goal in science and AI, is underexplored for LLMs.

Method: A benchmark (SymbolBench) assesses symbolic reasoning across tasks like regression and causal discovery, integrating LLMs with genetic programming.

Result: Empirical results show LLMs’ potential but highlight the need for domain knowledge and context alignment.

Conclusion: Combining LLMs with structured reasoning improves their role in automated scientific discovery.

Abstract: Uncovering hidden symbolic laws from time series data, as an aspiration dating back to Kepler’s discovery of planetary motion, remains a core challenge in scientific discovery and artificial intelligence. While Large Language Models show promise in structured reasoning tasks, their ability to infer interpretable, context-aligned symbolic structures from time series data is still underexplored. To systematically evaluate this capability, we introduce SymbolBench, a comprehensive benchmark designed to assess symbolic reasoning over real-world time series across three tasks: multivariate symbolic regression, Boolean network inference, and causal discovery. Unlike prior efforts limited to simple algebraic equations, SymbolBench spans a diverse set of symbolic forms with varying complexity. We further propose a unified framework that integrates LLMs with genetic programming to form a closed-loop symbolic reasoning system, where LLMs act both as predictors and evaluators. Our empirical results reveal key strengths and limitations of current models, highlighting the importance of combining domain knowledge, context alignment, and reasoning structure to improve LLMs in automated scientific discovery.

[285] The Docking Game: Loop Self-Play for Fast, Dynamic, and Accurate Prediction of Flexible Protein-Ligand Binding

Youzhi Zhang, Yufei Li, Gaofeng Meng, Hongbin Liu, Jiebo Luo

Main category: cs.AI

TL;DR: A game-theoretic framework (Docking Game) with Loop Self-Play (LoopPlay) improves molecular docking accuracy by 10% over state-of-the-art methods.

DetailsMotivation: Current multi-task learning models perform poorly in ligand docking due to structural complexities.

Method: Proposes a two-player game (ligand and protein players) solved by LoopPlay, alternating training in outer and inner loops for mutual adaptation.

Result: LoopPlay achieves ~10% better binding mode prediction on benchmarks.

Conclusion: The framework enhances molecular docking accuracy, benefiting drug discovery.

Abstract: Molecular docking is a crucial aspect of drug discovery, as it predicts the binding interactions between small-molecule ligands and protein pockets. However, current multi-task learning models for docking often show inferior performance in ligand docking compared to protein pocket docking. This disparity arises largely due to the distinct structural complexities of ligands and proteins. To address this issue, we propose a novel game-theoretic framework that models the protein-ligand interaction as a two-player game called the Docking Game, with the ligand docking module acting as the ligand player and the protein pocket docking module as the protein player. To solve this game, we develop a novel Loop Self-Play (LoopPlay) algorithm, which alternately trains these players through a two-level loop. In the outer loop, the players exchange predicted poses, allowing each to incorporate the other’s structural predictions, which fosters mutual adaptation over multiple iterations. In the inner loop, each player dynamically refines its predictions by incorporating its own predicted ligand or pocket poses back into its model. We theoretically show the convergence of LoopPlay, ensuring stable optimization. Extensive experiments conducted on public benchmark datasets demonstrate that LoopPlay achieves approximately a 10% improvement in predicting accurate binding modes compared to previous state-of-the-art methods. This highlights its potential to enhance the accuracy of molecular docking in drug discovery.

[286] Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti

Main category: cs.AI

TL;DR: The paper introduces Bench-2-CoP, a framework to evaluate AI benchmarks against EU AI Act requirements, revealing significant gaps in assessing systemic risks like loss-of-control scenarios.

DetailsMotivation: Address the mismatch between current AI benchmarks and regulatory needs under the EU AI Act, focusing on systemic risks.

Method: Developed Bench-2-CoP, using LLM-as-judge analysis to map 194,955 benchmark questions against the EU AI Act’s taxonomy.

Result: Found benchmarks heavily focus on narrow behavioral propensities (e.g., hallucination, reliability), neglecting critical capabilities like evading oversight or self-replication.

Conclusion: Current benchmarks are inadequate for regulatory compliance, highlighting the need for next-generation evaluation tools.

Abstract: The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this “benchmark-regulation gap.” We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act’s taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities. On average, benchmarks devote 61.6% of their regulatory-relevant questions to “Tendency to hallucinate” and 31.2% to “Lack of performance reliability”, while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This study provides the first comprehensive, quantitative analysis of this gap, demonstrating that current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance and offering critical insights for the development of next-generation evaluation tools.

cs.SD

[287] Training chord recognition models on artificially generated audio

Martyna Majchrzak, Jacek Mańdziuk

Main category: cs.SD

TL;DR: The study compares Transformer-based models for chord recognition, testing artificial vs. human-composed datasets, finding artificial data useful in certain scenarios.

DetailsMotivation: Addressing the challenge of acquiring non-copyrighted audio for training in Music Information Retrieval.

Method: Training models on artificial (AAM) and human-composed datasets, evaluating with Root, MajMin, and CCM metrics.

Result: Artificial datasets can enrich human-composed data or serve as standalone training sets for chord recognition in pop music.

Conclusion: Artificially generated music, despite differences, is viable for training chord recognition models when human-composed data is scarce.

Abstract: One of the challenging problems in Music Information Retrieval is the acquisition of enough non-copyrighted audio recordings for model training and evaluation. This study compares two Transformer-based neural network models for chord sequence recognition in audio recordings and examines the effectiveness of using an artificially generated dataset for this purpose. The models are trained on various combinations of Artificial Audio Multitracks (AAM), Schubert’s Winterreise Dataset, and the McGill Billboard Dataset and evaluated with three metrics: Root, MajMin and Chord Content Metric (CCM). The experiments prove that even though there are certainly differences in complexity and structure between artificially generated and human-composed music, the former can be useful in certain scenarios. Specifically, AAM can enrich a smaller training dataset of music composed by a human or can even be used as a standalone training set for a model that predicts chord sequences in pop music, if no other data is available.

[288] DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching

Wei Chen, Binzhu Sha, Dan Luo, Jing Yang, Zhuo Wang, Fan Fan, Zhiyong Wu

Main category: cs.SD

TL;DR: DAFMSVC improves Singing Voice Conversion by replacing SSL features to prevent timbre leakage and using dual cross-attention for adaptive fusion, achieving better timbre similarity and audio quality.

DetailsMotivation: Addressing challenges like timbre leakage and poor audio quality in any-to-any SVC by adapting unseen speaker timbres effectively.

Method: Proposes DAFMSVC: replaces SSL features with target audio’s similar features, uses dual cross-attention for fusion, and includes a flow matching module for high-quality generation.

Result: DAFMSVC outperforms state-of-the-art methods in timbre similarity and naturalness, validated by subjective and objective evaluations.

Conclusion: DAFMSVC effectively solves key SVC challenges, offering superior performance in timbre adaptation and audio quality.

Abstract: Singing Voice Conversion (SVC) transfers a source singer’s timbre to a target while keeping melody and lyrics. The key challenge in any-to-any SVC is adapting unseen speaker timbres to source audio without quality degradation. Existing methods either face timbre leakage or fail to achieve satisfactory timbre similarity and quality in the generated audio. To address these challenges, we propose DAFMSVC, where the self-supervised learning (SSL) features from the source audio are replaced with the most similar SSL features from the target audio to prevent timbre leakage. It also incorporates a dual cross-attention mechanism for the adaptive fusion of speaker embeddings, melody, and linguistic content. Additionally, we introduce a flow matching module for high quality audio generation from the fused features. Experimental results show that DAFMSVC significantly enhances timbre similarity and naturalness, outperforming state-of-the-art methods in both subjective and objective evaluations.

[289] MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, Xie Chen

Main category: cs.SD

TL;DR: MeanAudio introduces a MeanFlow-based model for fast text-to-audio generation, achieving 100x speedup over diffusion-based systems while maintaining quality.

DetailsMotivation: Current TTA systems are slow, limiting practical use. MeanAudio aims to address this with faster inference.

Method: Uses a Flux-style latent transformer to regress average velocity fields, incorporates CFG without extra cost, and employs a curriculum with flow field mix-up for stability.

Result: Achieves RTF of 0.013 (100x speedup) and strong multi-step generation performance.

Conclusion: MeanAudio sets a new benchmark for fast, high-quality TTA generation.

Abstract: Recent developments in diffusion- and flow- based models have significantly advanced Text-to-Audio Generation (TTA). While achieving great synthesis quality and controllability, current TTA systems still suffer from slow inference speed, which significantly limits their practical applicability. This paper presents MeanAudio, a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. Built on a Flux-style latent transformer, MeanAudio regresses the average velocity field during training, enabling fast generation by mapping directly from the start to the endpoint of the flow trajectory. By incorporating classifier-free guidance (CFG) into the training target, MeanAudio incurs no additional cost in the guided sampling process. To further stabilize training, we propose an instantaneous-to-mean curriculum with flow field mix-up, which encourages the model to first learn the foundational instantaneous dynamics, and then gradually adapt to mean flows. This strategy proves critical for enhancing training efficiency and generation quality. Experimental results demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also demonstrates strong performance in multi-step generation, enabling smooth and coherent transitions across successive synthesis steps.

[290] Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis

Wenjie Tian, Xinfa Zhu, Hanke Xie, Zhen Ye, Wei Xue, Lei Xie

Main category: cs.SD

TL;DR: Llasa+ introduces Multi-Token Prediction (MTP) modules and a verification algorithm to accelerate TTS models without quality loss, achieving a 1.48X speedup.

DetailsMotivation: Addressing inference latency and streaming synthesis challenges in autoregressive TTS models like Llasa.

Method: Uses plug-and-play MTP modules and a verification algorithm to predict multiple tokens per step and validate outputs. Also includes a causal decoder for streaming.

Result: Achieves 1.48X speedup without quality degradation, trained on LibriTTS.

Conclusion: Llasa+ successfully accelerates TTS models while maintaining quality, with potential applicability to other LLM-based models.

Abstract: Recent progress in text-to-speech (TTS) has achieved impressive naturalness and flexibility, especially with the development of large language model (LLM)-based approaches. However, existing autoregressive (AR) structures and large-scale models, such as Llasa, still face significant challenges in inference latency and streaming synthesis. To deal with the limitations, we introduce Llasa+, an accelerated and streaming TTS model built on Llasa. Specifically, to accelerate the generation process, we introduce two plug-and-play Multi-Token Prediction (MTP) modules following the frozen backbone. These modules allow the model to predict multiple tokens in one AR step. Additionally, to mitigate potential error propagation caused by inaccurate MTP, we design a novel verification algorithm that leverages the frozen backbone to validate the generated tokens, thus allowing Llasa+ to achieve speedup without sacrificing generation quality. Furthermore, we design a causal decoder that enables streaming speech reconstruction from tokens. Extensive experiments show that Llasa+ achieves a 1.48X speedup without sacrificing generation quality, despite being trained only on LibriTTS. Moreover, the MTP-and-verification framework can be applied to accelerate any LLM-based model. All codes and models are publicly available at https://github.com/ASLP-lab/LLaSA_Plus.

[291] EmoAugNet: A Signal-Augmented Hybrid CNN-LSTM Framework for Speech Emotion Recognition

Durjoy Chandra Paul, Gaurob Saha, Md Amjad Hossain

Main category: cs.SD

TL;DR: EmoAugNet, a hybrid deep learning framework combining LSTM and 1D-CNN, achieves high accuracy in Speech Emotion Recognition (SER) using advanced data augmentation and feature extraction.

DetailsMotivation: Enhancing human-computer interaction (HCI) by improving the reliability of emotion recognition in speech.

Method: Uses LSTM and 1D-CNN with data augmentation (noise addition, pitch shifting, time stretching) and feature extraction (RMSE, MFCC, ZCR).

Result: Achieves up to 96.75% weighted accuracy on IEMOCAP and 94.98% unweighted accuracy on RAVDESS.

Conclusion: EmoAugNet significantly improves SER robustness and performance through hybrid modeling and data augmentation.

Abstract: Recognizing emotional signals in speech has a significant impact on enhancing the effectiveness of human-computer interaction (HCI). This study introduces EmoAugNet, a hybrid deep learning framework, that incorporates Long Short-Term Memory (LSTM) layers with one-dimensional Convolutional Neural Networks (1D-CNN) to enable reliable Speech Emotion Recognition (SER). The quality and variety of the features that are taken from speech signals have a significant impact on how well SER systems perform. A comprehensive speech data augmentation strategy was used to combine both traditional methods, such as noise addition, pitch shifting, and time stretching, with a novel combination-based augmentation pipeline to enhance generalization and reduce overfitting. Each audio sample was transformed into a high-dimensional feature vector using root mean square energy (RMSE), Mel-frequency Cepstral Coefficient (MFCC), and zero-crossing rate (ZCR). Our model with ReLU activation has a weighted accuracy of 95.78% and unweighted accuracy of 92.52% on the IEMOCAP dataset and, with ELU activation, has a weighted accuracy of 96.75% and unweighted accuracy of 91.28%. On the RAVDESS dataset, we get a weighted accuracy of 94.53% and 94.98% unweighted accuracy for ReLU activation and 93.72% weighted accuracy and 94.64% unweighted accuracy for ELU activation. These results highlight EmoAugNet’s effectiveness in improving the robustness and performance of SER systems through integated data augmentation and hybrid modeling.

[292] SpeakerLM: End-to-End Versatile Speaker Diarization and Recognition with Multimodal Large Language Models

Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, Xiangang Li

Main category: cs.SD

TL;DR: SpeakerLM is a unified multimodal model for Speaker Diarization and Recognition (SDR) that jointly performs speaker diarization and speech recognition end-to-end, addressing limitations of cascaded systems.

DetailsMotivation: Existing cascaded SDR systems suffer from error propagation, difficulty handling overlapping speech, and lack of joint optimization.

Method: SpeakerLM integrates speaker diarization and speech recognition into a single model with a flexible speaker registration mechanism, trained using a multi-stage strategy on large-scale data.

Result: SpeakerLM outperforms state-of-the-art cascaded systems on in-domain and out-of-domain benchmarks and shows robust performance across diverse speaker registration conditions.

Conclusion: SpeakerLM offers a scalable, generalizable, and robust solution for SDR, overcoming the limitations of traditional cascaded approaches.

Abstract: The Speaker Diarization and Recognition (SDR) task aims to predict “who spoke when and what” within an audio clip, which is a crucial task in various real-world multi-speaker scenarios such as meeting transcription and dialogue systems. Existing SDR systems typically adopt a cascaded framework, combining multiple modules such as speaker diarization (SD) and automatic speech recognition (ASR). The cascaded systems suffer from several limitations, such as error propagation, difficulty in handling overlapping speech, and lack of joint optimization for exploring the synergy between SD and ASR tasks. To address these limitations, we introduce SpeakerLM, a unified multimodal large language model for SDR that jointly performs SD and ASR in an end-to-end manner. Moreover, to facilitate diverse real-world scenarios, we incorporate a flexible speaker registration mechanism into SpeakerLM, enabling SDR under different speaker registration settings. SpeakerLM is progressively developed with a multi-stage training strategy on large-scale real data. Extensive experiments show that SpeakerLM demonstrates strong data scaling capability and generalizability, outperforming state-of-the-art cascaded baselines on both in-domain and out-of-domain public SDR benchmarks. Furthermore, experimental results show that the proposed speaker registration mechanism effectively ensures robust SDR performance of SpeakerLM across diverse speaker registration conditions and varying numbers of registered speakers.

[293] Improved Dysarthric Speech to Text Conversion via TTS Personalization

Péter Mihajlik, Éva Székely, Piroska Barta, Máté Soma Kádár, Gergely Dobsinszki, László Tóth

Main category: cs.SD

TL;DR: A study fine-tunes an ASR model with synthetic dysarthric speech to improve transcription accuracy for a Hungarian speaker with severe dysarthria, reducing error rates significantly.

DetailsMotivation: State-of-the-art ASR models perform poorly on dysarthric speech, necessitating improved solutions for accessibility.

Method: Fine-tune an ASR model using synthetic dysarthric speech generated via personalized TTS and speaker embedding interpolation.

Result: Character error rate drops from 36-51% (zero-shot) to 7.3%, with synthetic speech contributing an 18% relative CER reduction.

Conclusion: Personalized ASR systems can enhance accessibility for individuals with severe speech impairments.

Abstract: We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7.3%. Our monolingual FastConformer_Hu ASR model significantly outperforms Whisper-turbo when fine-tuned on the same data, and the inclusion of synthetic speech contributes to an 18% relative CER reduction. These results highlight the potential of personalized ASR systems for improving accessibility for individuals with severe speech impairments.

[294] Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Md Asif Jalal, Luca Remaggi, Vasileios Moschopoulos, Thanasis Kotsiopoulos, Vandana Rajan, Karthikeyan Saravanan, Anastasis Drosou, Junho Heo, Hyuk Oh, Seokyeong Jeong

Main category: cs.SD

TL;DR: A new enrollment-free method for simultaneous speech separation and diarization using automatic target speaker embedding identification, outperforming SOTA with 71% DER and 69% cpWER improvements.

DetailsMotivation: Overcoming limitations of traditional methods requiring prior speaker knowledge or fixed participant counts.

Method: Dual-stage training pipeline for robust speaker representations and overlapping spectral loss for diarization accuracy.

Result: 71% relative DER and 69% cpWER improvements over SOTA.

Conclusion: The proposed method effectively addresses enrollment-free speech separation and diarization with significant performance gains.

Abstract: Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOTA baseline, achieving 71% relative improvement in DER and 69% in cpWER.

[295] Survey on the Evaluation of Generative Models in Music

Alexander Lerch, Claire Arthur, Nick Bryan-Kinns, Corey Ford, Qianyi Sun, Ashvala Vinay

Main category: cs.SD

TL;DR: A review of evaluation methods for generative music systems, covering interdisciplinary perspectives and diverse methodologies.

DetailsMotivation: To systematically assess and compare evaluation approaches for generative music systems from musicological, engineering, and HCI viewpoints.

Method: Interdisciplinary review of evaluation targets, methodologies, and metrics, including subjective/objective, qualitative/quantitative, and empirical/computational methods.

Result: Identifies benefits and limitations of various evaluation approaches across different disciplines.

Conclusion: Highlights the need for comprehensive and interdisciplinary evaluation frameworks for generative music systems.

Abstract: Research on generative systems in music has seen considerable attention and growth in recent years. A variety of attempts have been made to systematically evaluate such systems. We present an interdisciplinary review of the common evaluation targets, methodologies, and metrics for the evaluation of both system output and model use, covering subjective and objective approaches, qualitative and quantitative approaches, as well as empirical and computational methods. We examine the benefits and limitations of these approaches from a musicological, an engineering, and an HCI perspective.

[296] SAMUeL: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion

Hei Shing Cheung, Boya Zhang, Jonathan H. Chan

Main category: cs.SD

TL;DR: A lightweight latent diffusion model for vocal-conditioned musical accompaniment generation, reducing parameters and speeding up inference while maintaining quality.

DetailsMotivation: Address limitations in existing music AI systems by improving efficiency and accessibility for real-time deployment.

Method: Uses a soft alignment attention mechanism in a latent diffusion model to capture multi-scale musical structure efficiently.

Result: Achieves 220x parameter reduction, 52x faster inference, and competitive performance with only 15M parameters.

Conclusion: Enables real-time AI-assisted music creation on consumer hardware, outperforming existing systems in quality and unity.

Abstract: We present a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation that addresses critical limitations in existing music AI systems. Our approach introduces a novel soft alignment attention mechanism that adaptively combines local and global temporal dependencies based on diffusion timesteps, enabling efficient capture of multi-scale musical structure. Operating in the compressed latent space of a pre-trained variational autoencoder, the model achieves a 220 times parameter reduction compared to state-of-the-art systems while delivering 52 times faster inference. Experimental evaluation demonstrates competitive performance with only 15M parameters, outperforming OpenAI Jukebox in production quality and content unity while maintaining reasonable musical coherence. The ultra-lightweight architecture enables real-time deployment on consumer hardware, making AI-assisted music creation accessible for interactive applications and resource-constrained environments.

cs.LG

[297] Diagrams-to-Dynamics (D2D): Exploring Causal Loop Diagram Leverage Points under Uncertainty

Jeroen F. Uleman, Loes Crielaard, Leonie K. Elsenburg, Guido A. Veldhuis, Karien Stronks, Naja Hulvej Rod, Rick Quax, Vítor V. Vasconcelos

Main category: cs.LG

TL;DR: D2D converts causal loop diagrams (CLDs) into system dynamics models (SDMs) for dynamic analysis, outperforming network centrality in consistency and providing uncertainty estimates.

DetailsMotivation: CLDs are qualitative and static, limiting dynamic analysis and intervention strategies. Quantitative methods like network centrality often lead to false inferences.

Method: D2D transforms CLDs into SDMs using structural information (link existence and polarity) with minimal user input (labeling variables as stocks, flows/auxiliaries, or constants).

Result: D2D distinguishes high- and low-ranked leverage points, shows greater consistency with data-driven models than network centrality, and provides uncertainty estimates.

Conclusion: D2D is a promising tool for dynamic modeling with CLDs, implemented in open-source Python and web apps, with potential for broader validation and application.

Abstract: Causal loop diagrams (CLDs) are widely used in health and environmental research to represent hypothesized causal structures underlying complex problems. However, as qualitative and static representations, CLDs are limited in their ability to support dynamic analysis and inform intervention strategies. Additionally, quantitative CLD analysis methods like network centrality analysis often lead to false inference. We propose Diagrams-to-Dynamics (D2D), a method for converting CLDs into exploratory system dynamics models (SDMs) in the absence of empirical data. With minimal user input - following a protocol to label variables as stocks, flows/auxiliaries, or constants - D2D leverages the structural information already encoded in CLDs, namely, link existence and polarity, to simulate hypothetical interventions and explore potential leverage points under uncertainty. Results suggest that D2D helps distinguish between high- and low-ranked leverage points. We compare D2D to a data-driven SDM constructed from the same CLD and variable labeling. D2D showed greater consistency with the data-driven model than network centrality analysis, while providing uncertainty estimates and guidance for future data collection. The method is implemented in an open-source Python package and a web-based application to support further testing and lower the barrier to dynamic modeling for researchers working with CLDs. We expect additional validation will further establish the approach’s utility across a broad range of cases and domains.

[298] A Graph Neural Network Approach for Mapping the Conceptual Structure and Inter-Branch Connectivity of Physics

Massimiliano Romiti

Main category: cs.LG

TL;DR: A novel framework represents physical laws as a weighted knowledge graph, achieving high accuracy in link prediction and uncovering key insights in physics.

DetailsMotivation: To create a structured representation of physical laws and analyze their interconnections using advanced graph-based methods.

Method: Constructed a database of physics equations, developed a weighted graph representation, and trained a Graph Attention Network (GAT) for link prediction.

Result: Achieved a test AUC of 0.9742, outperforming baselines, and identified key findings like conceptual axes and hub equations.

Conclusion: The framework successfully models physics relationships, suggesting novel analogies and enabling targeted analysis of subfields.

Abstract: This work introduces a novel framework for representing and analyzing physical laws as a weighted knowledge graph. We constructed a database of 659 distinct physical equations, subjected to rigorous semantic cleaning to resolve notational ambiguities, resulting in a corpus of 400 advanced physics equations. We developed an enhanced graph representation where both physical concepts and equations are nodes, connected by weighted inter-equation bridges. These weights are objectively defined using normalized metrics for variable overlap, physics-informed importance scores, and bibliometric data. A Graph Attention Network (GAT) was trained for link prediction, achieving a test AUC of 0.9742 +/- 0.0018 across five independent runs, significantly outperforming both classical heuristics (best baseline AUC: 0.9487) and established GNN architectures like GraphSAGE (AUC: 0.9504, p = 0.029). Statistical testing confirmed significance of all comparisons (p < 0.05), with 2.7% improvement over the best baseline. Our analysis reveals three key findings: (i) The model autonomously rediscovers the known macroscopic structure of physics, identifying strong conceptual axes between Electromagnetism and Statistical Mechanics. (ii) It identifies central hub equations that serve as critical bridges between multiple physical domains. (iii) The model generates stable, computationally-derived hypotheses for cross-domain relationships, identifying both known principles and suggesting novel mathematical analogies for further theoretical investigation. The framework can generate hundreds of such hypotheses, enabling the creation of specialized datasets for targeted analysis of specific physics subfields. Code and data available at https://github.com/kingelanci/graphysics

[299] Machine Learning-Based Nonlinear Nudging for Chaotic Dynamical Systems

Jaemin Oh, Jinsil Lee, Youngjoon Hong

Main category: cs.LG

TL;DR: The paper introduces neural network nudging, a data-driven method for learning nudging terms in nonlinear state space models, validated on chaotic benchmark problems.

DetailsMotivation: Nudging is effective for linear models but challenging for nonlinear ones, prompting the need for a data-driven approach.

Method: Proposes neural network nudging, leveraging the Kazantzis–Kravaris–Luenberger observer theory, and tests it on chaotic systems like Lorenz 96 and Kuramoto–Sivashinsky.

Result: Demonstrates the effectiveness of neural network nudging in nonlinear settings through benchmark evaluations.

Conclusion: Neural network nudging is a viable solution for nonlinear state space models, supported by theoretical and empirical results.

Abstract: Nudging is an empirical data assimilation technique that incorporates an observation-driven control term into the model dynamics. The trajectory of the nudged system approaches the true system trajectory over time, even when the initial conditions differ. For linear state space models, such control terms can be derived under mild assumptions. However, designing effective nudging terms becomes significantly more challenging in the nonlinear setting. In this work, we propose neural network nudging, a data-driven method for learning nudging terms in nonlinear state space models. We establish a theoretical existence result based on the Kazantzis–Kravaris–Luenberger observer theory. The proposed approach is evaluated on three benchmark problems that exhibit chaotic behavior: the Lorenz 96 model, the Kuramoto–Sivashinsky equation, and the Kolmogorov flow.

[300] From Imperfect Signals to Trustworthy Structure: Confidence-Aware Inference from Heterogeneous and Reliability-Varying Utility Data

Haoran Li, Lihao Mai, Muhao Guo, Jiaqi Wu, Yang Weng, Yannan Sun, Ce Jimmy Liu

Main category: cs.LG

TL;DR: A scalable framework integrates heterogeneous data to reconstruct accurate distribution grid topology, combining spatial layout and dynamic behavior, with confidence-aware inference and physical constraints, achieving 95% accuracy.

DetailsMotivation: Accurate grid topology is crucial for reliable operations, but real-world utility data is heterogeneous and varies in quality, necessitating a robust solution.

Method: The framework combines spatial (GIS, asset metadata) and dynamic (voltage time series) data, uses confidence-aware inference, and enforces physical constraints like transformer limits and radial topology.

Result: Validated on 8000+ meters across 3 feeders, the method achieves 95% accuracy, better confidence calibration, and computational efficiency.

Conclusion: The framework provides actionable, trustworthy topologies by balancing uncertainty awareness with structural validity, outperforming baseline methods.

Abstract: Accurate distribution grid topology is essential for reliable modern grid operations. However, real-world utility data originates from multiple sources with varying characteristics and levels of quality. In this work, developed in collaboration with Oncor Electric Delivery, we propose a scalable framework that reconstructs a trustworthy grid topology by systematically integrating heterogeneous data. We observe that distribution topology is fundamentally governed by two complementary dimensions: the spatial layout of physical infrastructure (e.g., GIS and asset metadata) and the dynamic behavior of the system in the signal domain (e.g., voltage time series). When jointly leveraged, these dimensions support a complete and physically coherent reconstruction of network connectivity. To address the challenge of uneven data quality without compromising observability, we introduce a confidence-aware inference mechanism that preserves structurally informative yet imperfect inputs, while quantifying the reliability of each inferred connection for operator interpretation. This soft handling of uncertainty is tightly coupled with hard enforcement of physical feasibility: we embed operational constraints, such as transformer capacity limits and radial topology requirements, directly into the learning process. Together, these components ensure that inference is both uncertainty-aware and structurally valid, enabling rapid convergence to actionable, trustworthy topologies under real-world deployment conditions. The proposed framework is validated using data from over 8000 meters across 3 feeders in Oncor’s service territory, demonstrating over 95% accuracy in topology reconstruction and substantial improvements in confidence calibration and computational efficiency relative to baseline methods.

[301] Domain-driven Metrics for Reinforcement Learning: A Case Study on Epidemic Control using Agent-based Simulation

Rishabh Gaur, Gaurav Deshkar, Jayanta Kshirsagar, Harshal Hayatnagarkar, Janani Venugopalan

Main category: cs.LG

TL;DR: The paper introduces domain-driven metrics for evaluating RL-based agent-based models, addressing challenges in performance assessment due to system complexity and lack of standardized metrics.

DetailsMotivation: The complexity and stochasticity of RL-based ABMs and RABMs, along with the absence of standardized metrics, make performance evaluation difficult. This study aims to address this gap.

Method: The authors develop domain-driven metrics for RL, building on existing state-of-the-art metrics, and apply them to a rational ABM disease modeling case study involving masking, vaccination, and lockdown behaviors.

Result: The study demonstrates the effectiveness of domain-driven rewards combined with traditional and advanced metrics in various simulation scenarios, such as differential mask availability.

Conclusion: Domain-driven metrics enhance the evaluation of RL-based ABMs and RABMs, providing a more nuanced and practical approach to performance assessment.

Abstract: For the development and optimization of agent-based models (ABMs) and rational agent-based models (RABMs), optimization algorithms such as reinforcement learning are extensively used. However, assessing the performance of RL-based ABMs and RABMS models is challenging due to the complexity and stochasticity of the modeled systems, and the lack of well-standardized metrics for comparing RL algorithms. In this study, we are developing domain-driven metrics for RL, while building on state-of-the-art metrics. We demonstrate our ``Domain-driven-RL-metrics’’ using policy optimization on a rational ABM disease modeling case study to model masking behavior, vaccination, and lockdown in a pandemic. Our results show the use of domain-driven rewards in conjunction with traditional and state-of-the-art metrics for a few different simulation scenarios such as the differential availability of masks.

[302] Optimal Linear Baseline Models for Scientific Machine Learning

Alexander DeLise, Kyle Loh, Krish Patel, Meredith Teague, Andrea Arnold, Matthias Chung

Main category: cs.LG

TL;DR: The paper presents a theoretical framework for analyzing linear encoder-decoder architectures in scientific machine learning, focusing on interpretability and Bayes risk minimization.

DetailsMotivation: To address the opacity of nonlinear neural networks and provide interpretable solutions for scientific machine learning problems.

Method: Develops a unified framework using Bayes risk minimization, deriving closed-form, rank-constrained linear and affine linear optimal mappings for forward and inverse tasks.

Result: Validated through numerical experiments on biomedical imaging, financial factor analysis, and nonlinear fluid dynamics simulations.

Conclusion: Offers a robust baseline for benchmarking neural network models in scientific machine learning.

Abstract: Across scientific domains, a fundamental challenge is to characterize and compute the mappings from underlying physical processes to observed signals and measurements. While nonlinear neural networks have achieved considerable success, they remain theoretically opaque, which hinders adoption in contexts where interpretability is paramount. In contrast, linear neural networks serve as a simple yet effective foundation for gaining insight into these complex relationships. In this work, we develop a unified theoretical framework for analyzing linear encoder-decoder architectures through the lens of Bayes risk minimization for solving data-driven scientific machine learning problems. We derive closed-form, rank-constrained linear and affine linear optimal mappings for forward modeling and inverse recovery tasks. Our results generalize existing formulations by accommodating rank-deficiencies in data, forward operators, and measurement processes. We validate our theoretical results by conducting numerical experiments on datasets from simple biomedical imaging, financial factor analysis, and simulations involving nonlinear fluid dynamics via the shallow water equations. This work provides a robust baseline for understanding and benchmarking learned neural network models for scientific machine learning problems.

[303] An Effective Approach for Node Classification in Textual Graphs

Rituparna Datta, Nibir Chandra Mandal

Main category: cs.LG

TL;DR: The paper proposes a novel framework combining TAPE and Graphormer to enhance node classification in Textual Attribute Graphs (TAGs) by integrating semantic and structural information, achieving state-of-the-art results on the ogbn-arxiv dataset.

DetailsMotivation: Node classification in TAGs is challenging due to difficulties in integrating text semantics with graph structure, capturing domain-specific terminology, modeling long-range dependencies, and scaling to large datasets.

Method: The framework uses ChatGPT within TAPE to generate rich text explanations, fuses them into node representations, and combines these with structural features using Graphormer’s attention mechanisms.

Result: Achieves a classification accuracy of 0.772, outperforming GCN baselines (0.713), with strong precision (0.671), recall (0.577), and F1-score (0.610).

Conclusion: The framework offers a scalable, robust solution for node classification in dynamic TAGs, advancing research in knowledge systems and scientific discovery.

Abstract: Textual Attribute Graphs (TAGs) are critical for modeling complex networks like citation networks, but effective node classification remains challenging due to difficulties in integrating rich semantics from text with structural graph information. Existing methods often struggle with capturing nuanced domain-specific terminology, modeling long-range dependencies, adapting to temporal evolution, and scaling to massive datasets. To address these issues, we propose a novel framework that integrates TAPE (Text-Attributed Graph Representation Enhancement) with Graphormer. Our approach leverages a large language model (LLM), specifically ChatGPT, within the TAPE framework to generate semantically rich explanations from paper content, which are then fused into enhanced node representations. These embeddings are combined with structural features using a novel integration layer with learned attention weights. Graphormer’s path-aware position encoding and multi-head attention mechanisms are employed to effectively capture long-range dependencies across the citation network. We demonstrate the efficacy of our framework on the challenging ogbn-arxiv dataset, achieving state-of-the-art performance with a classification accuracy of 0.772, significantly surpassing the best GCN baseline of 0.713. Our method also yields strong results in precision (0.671), recall (0.577), and F1-score (0.610). We validate our approach through comprehensive ablation studies that quantify the contribution of each component, demonstrating the synergy between semantic and structural information. Our framework provides a scalable and robust solution for node classification in dynamic TAGs, offering a promising direction for future research in knowledge systems and scientific discovery.

[304] A Markov Decision Process Framework for Early Maneuver Decisions in Satellite Collision Avoidance

Francesca Ferrara, Lander W. Schillinger Arana, Florian Dörfler, Sarah H. Q. Li

Main category: cs.LG

TL;DR: A reinforcement learning policy gradient (RL-PG) algorithm is used to train an autonomous guidance policy for collision avoidance maneuvers (CAMs), minimizing fuel consumption while maintaining collision risk guarantees.

DetailsMotivation: To improve decision-making for CAMs by balancing maneuver delay (for better risk assessment) and fuel efficiency, using historical data.

Method: Model CAM as a continuous state, discrete action MDP, incorporating risk, fuel, and orbit models. Train policy using RL-PG on historical CAM data.

Result: The trained policy reduces fuel consumption compared to conventional methods, with equal or better collision risk guarantees.

Conclusion: The RL-PG approach effectively optimizes CAM decisions, balancing fuel savings and collision risk.

Abstract: This work presents a Markov decision process (MDP) framework to model decision-making for collision avoidance maneuver (CAM) and a reinforcement learning policy gradient (RL-PG) algorithm to train an autonomous guidance policy using historic CAM data. In addition to maintaining acceptable collision risks, this approach seeks to minimize the average fuel consumption of CAMs by making early maneuver decisions. We model CAM as a continuous state, discrete action and finite horizon MDP, where the critical decision is determining when to initiate the maneuver. The MDP model also incorporates analytical models for conjunction risk, propellant consumption, and transit orbit geometry. The Markov policy effectively trades-off maneuver delay-which improves the reliability of conjunction risk indicators-with propellant consumption-which increases with decreasing maneuver time. Using historical data of tracked conjunction events, we verify this framework and conduct an extensive ablation study on the hyper-parameters used within the MDP. On synthetic conjunction events, the trained policy significantly minimizes both the overall and average propellant consumption per CAM when compared to a conventional cut-off policy that initiates maneuvers 24 hours before the time of closest approach (TCA). On historical conjunction events, the trained policy consumes more propellant overall but reduces the average propellant consumption per CAM. For both historical and synthetic conjunction events, the trained policy achieves equal if not higher overall collision risk guarantees.

[305] The Fourth State: Signed-Zero Ternary for Stable LLM Quantization (and More)

Jeffrey Uhlmann

Main category: cs.LG

TL;DR: SZT, a 2-bit quantization method, improves information density without forward-path penalty, challenging the view of quantization as suboptimal.

DetailsMotivation: To explore quantization not as a trade-off but as a way to enhance performance under fixed resource constraints.

Method: Introduces Signed-Zero Ternary (SZT), a deterministic 2-bit quantization method that preserves gradient information.

Result: SZT may improve information density compared to non-quantized alternatives.

Conclusion: Quantization, when optimized (e.g., SZT), can outperform non-quantized methods under resource constraints.

Abstract: Quantization is usually regarded as a means to trade quality of performance for reduced compute requirements, i.e., as a suboptimal approximation. However, if examined in terms of a fixed overall resource budget, a very different perspective arises. We introduce Signed-Zero Ternary (SZT), a 2-bit quantization that deterministically provides gradient information with no forward-path penalty. Our analysis provides evidence that it may improve information density compared to non-quantized alternatives.

[306] Dual Signal Decomposition of Stochastic Time Series

Alex Glushkovsky

Main category: cs.LG

TL;DR: The paper proposes a machine learning method to decompose stochastic time series into mean, dispersion, and noise, using dual signal fitting and regularization.

DetailsMotivation: To decompose time series into interpretable components (mean, dispersion, noise) for applications like smoothing, denoising, and uncovering relationships in heteroskedastic data.

Method: Uses machine learning to fit dual signals (mean and dispersion) with a loss function balancing fit and regularization. Includes sequential or joint learning approaches and neural networks.

Result: Effective decomposition into mean, dispersion, and noise, with applications in smoothing, denoising, and forecasting.

Conclusion: The method provides a versatile tool for time series analysis, enabling better understanding and forecasting of stochastic processes.

Abstract: The research paper addresses decomposition of a stochastic time series into three time series representing a dual signal i.e., the mean and the dispersion, with noise isolated. Decomposition is done by applying machine learning to fit a dual signal. Machine learning minimizes the loss function which compromises between fitting the original time series and penalizing irregularities of the dual signal. The latter includes terms based on the first and second order derivatives along time. To preserve special patterns, weighting of the regularization components of the loss function has been introduced based on Statistical Process Control methodology. The proposed decomposition can be applied as a smoothing algorithm against the mean and dispersion of the time series. By isolating noise, the proposed decomposition can be seen as a denoising algorithm. Two approaches of the learning process have been considered: sequential and jointly. The former approach learns the mean signal first and then dispersion. The latter approach fits the dual signal jointly. Jointly learning can uncover complex relationships for the time series with heteroskedasticity. Learning has been set by solving the direct non-linear unconstrained optimization problem or by applying neural networks that have sequential or twin output architectures. Tuning of the loss function hyperparameters focuses on the isolated noise to be a stationary stochastic process without autocorrelation properties. Depending on the applications, the hyperparameters of the learning can be tuned towards either the discrete states by stepped signal or smoothed series. The decomposed dual signal can be represented on the 2D space and used to learn inherent structures, to forecast both mean and dispersion, or to analyze cross effects in case of multiple time series.

[307] Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

Constantin Ruhdorfer, Matteo Bortoletto, Victor Oei, Anna Penzkofer, Andreas Bulling

Main category: cs.LG

TL;DR: UPD is a population-free, multi-agent reinforcement learning framework for ad-hoc teamwork, generating diverse training partners without pretrained agents or manual tuning. It outperforms baselines and is perceived as more adaptive and human-like.

DetailsMotivation: To enable robust ad-hoc teamwork without relying on pretrained partners or manual parameter tuning, addressing the limitations of existing methods.

Method: UPD stochastically mixes an ego agent’s policy with biased random behaviors, scoring partners using a variance-based learnability metric to prioritize those near the agent’s learning frontier. It integrates with unsupervised environment design for fully unsupervised curricula.

Result: UPD outperforms population-based and population-free baselines in evaluations on Overcooked-AI and a user study, achieving higher returns and better perceived adaptability.

Conclusion: UPD is a highly effective framework for unsupervised partner design, enabling dynamic curricula and superior performance in cooperative settings.

Abstract: We introduce Unsupervised Partner Design (UPD) - a population-free, multi-agent reinforcement learning framework for robust ad-hoc teamwork that adaptively generates training partners without requiring pretrained partners or manual parameter tuning. UPD constructs diverse partners by stochastically mixing an ego agent’s policy with biased random behaviours and scores them using a variance-based learnability metric that prioritises partners near the ego agent’s current learning frontier. We show that UPD can be integrated with unsupervised environment design, resulting in the first method enabling fully unsupervised curricula over both level and partner distributions in a cooperative setting. Through extensive evaluations on Overcooked-AI and the Overcooked Generalisation Challenge, we demonstrate that this dynamic partner curriculum is highly effective: UPD consistently outperforms both population-based and population-free baselines as well as ablations. In a user study, we further show that UPD achieves higher returns than all baselines and was perceived as significantly more adaptive, more human-like, a better collaborator, and less frustrating.

[308] Fast, Convex and Conditioned Network for Multi-Fidelity Vectors and Stiff Univariate Differential Equations

Siddharth Rout

Main category: cs.LG

TL;DR: The paper addresses poor optimization in neural PDE solvers due to ill-conditioning, proposing Shifted Gaussian Encoding to improve matrix rank and solve stiff problems more effectively.

DetailsMotivation: Neural PDE solvers often fail due to ill-conditioning, not expressivity limitations, especially in multi-fidelity and stiff problems.

Method: Introduces Shifted Gaussian Encoding, an activation filtering step, to enhance matrix rank and preserve convexity in Physics-Informed Extreme Learning Machines (PIELMs).

Result: Extends solvable Peclet numbers by two orders of magnitude, reduces error by six orders in multi-frequency learning, and outperforms deep networks in accuracy and speed.

Conclusion: Conditioning, not depth, is the bottleneck in neural solvers; simple architectural changes can yield significant improvements.

Abstract: Accuracy in neural PDE solvers often breaks down not because of limited expressivity, but due to poor optimisation caused by ill-conditioning, especially in multi-fidelity and stiff problems. We study this issue in Physics-Informed Extreme Learning Machines (PIELMs), a convex variant of neural PDE solvers, and show that asymptotic components in governing equations can produce highly ill-conditioned activation matrices, severely limiting convergence. We introduce Shifted Gaussian Encoding, a simple yet effective activation filtering step that increases matrix rank and expressivity while preserving convexity. Our method extends the solvable range of Peclet numbers in steady advection-diffusion equations by over two orders of magnitude, achieves up to six orders lower error on multi-frequency function learning, and fits high-fidelity image vectors more accurately and faster than deep networks with over a million parameters. This work highlights that conditioning, not depth, is often the bottleneck in scientific neural solvers and that simple architectural changes can unlock substantial gains.

[309] Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

Si Shen, Peijun Shen, Wenhua Zhao, Danhao Zhu

Main category: cs.LG

TL;DR: S-GRPO improves GRPO by addressing the Think-Answer Mismatch issue with noise-aware advantage weights, outperforming GRPO in noisy and unbalanced scenarios.

DetailsMotivation: The vulnerability of GRPO to noisy rewards, especially in unbalanced response groups, motivates the development of S-GRPO for stable training.

Method: S-GRPO introduces optimal, noise-aware advantage weights to stabilize training, tested on mathematical reasoning benchmarks.

Result: S-GRPO outperforms GRPO by +2.5%, +2.2%, and +2.4% on various models and remains stable under 20% synthetic noise.

Conclusion: S-GRPO is a robust enhancement for training large reasoning models, effective in noisy and unbalanced conditions.

Abstract: Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the \emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO’s effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO’s potential for more robust and effective training of large-scale reasoning models. \footnote{Code and data are available at: https://github.com/shenpeijun0212/S-GRPO

[310] Multi-Armed Bandits-Based Optimization of Decision Trees

Hasibul Karim Shanto, Umme Ayman Koana, Shadikur Rahman

Main category: cs.LG

TL;DR: The paper proposes a Multi-Armed Bandits (MAB)-based pruning method for decision trees to improve generalization and reduce overfitting, outperforming traditional pruning techniques.

DetailsMotivation: Decision trees often overfit due to complexity, and conventional pruning methods like CCP and REP may compromise long-term generalization.

Method: A reinforcement learning-based MAB approach dynamically prunes trees by treating pruning as an exploration-exploitation problem.

Result: Experiments show the MAB-based method achieves better predictive performance than traditional pruning techniques.

Conclusion: MAB-based pruning offers a dynamic and probabilistic solution to optimize decision tree models.

Abstract: Decision trees, without appropriate constraints, can easily become overly complex and prone to overfit, capturing noise rather than generalizable patterns. To resolve this problem,pruning operation is a crucial part in optimizing decision trees, as it not only reduces the complexity of trees but also decreases the probability of generating overfit models. The conventional pruning techniques like Cost-Complexity Pruning (CCP) and Reduced Error Pruning (REP) are mostly based on greedy approaches that focus on immediate gains in performance while pruning nodes of the decision tree. However, this might result in a lower generalization in the long run, compromising the robust ability of the tree model when introduced to unseen data samples, particularly when trained with small and complex datasets. To address this challenge, we are proposing a Multi-Armed Bandits (MAB)-based pruning approach, a reinforcement learning (RL)-based technique, that will dynamically prune the tree to generate an optimal decision tree with better generalization. Our proposed approach assumes the pruning process as an exploration-exploitation problem, where we are utilizing the MAB algorithms to find optimal branch nodes to prune based on feedback from each pruning actions. Experimental evaluation on several benchmark datasets, demonstrated that our proposed approach results in better predictive performance compared to the traditional ones. This suggests the potential of utilizing MAB for a dynamic and probabilistic way of decision tree pruning, in turn optimizing the decision tree-based model.

[311] Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning

Haohui Chen, Zhiyong Chen

Main category: cs.LG

TL;DR: The paper proposes MCRE and MCRQ to balance conservatism and performance in offline RL, outperforming existing methods.

DetailsMotivation: Addressing distribution shift and overestimation in offline RL by balancing conservatism and performance.

Method: Introduces MCRE framework combining TD error and behavior cloning, and MCRQ algorithm integrating MCRE into actor-critic.

Result: MCRQ outperforms baselines and state-of-the-art offline RL algorithms on benchmarks.

Conclusion: MCRE and MCRQ effectively balance conservatism and performance, improving offline RL outcomes.

Abstract: Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. To prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. To address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets.

[312] LinguaFluid: Language Guided Fluid Control via Semantic Rewards in Reinforcement Learning

Aoming Liang, Chi Cheng, Dashuai Chen, Boai Sun, Dixia Fan

Main category: cs.LG

TL;DR: A method for computing rewards in RL using semantic alignment with SBERT, eliminating the need for manual reward engineering.

DetailsMotivation: Challenges in designing reward functions in RL, especially for tasks with non-numeric goals, motivate a semantic-based approach.

Method: Rewards are calculated via cosine similarity between goal textual descriptions and episode statements using SBERT.

Result: Semantic rewards enable competitive control behavior without hand-crafted rewards, showing alignment between language and Euclidean spaces.

Conclusion: The framework advances RL by integrating natural language goals and paving the way for LLM and control application synergy.

Abstract: In the domain of scientific machine learning, designing effective reward functions remains a challenge in reinforcement learning (RL), particularly in environments where task goals are difficult to specify numerically. Reward functions in existing work are predominantly based on heuristics, manual engineering, or task-specific tuning. In this work, we introduce a semantically aligned reinforcement learning method where rewards are computed by aligning the current state with a target semantic instruction using a Sentence-Bidirectional Encoder Representations from Transformers (SBERT). Instead of relying on manually defined reward functions, the policy receives feedback based on the reward, which is a cosine similarity between the goal textual description and the statement description in the episode. We evaluated our approach in several environments and showed that semantic reward can guide learning to achieve competitive control behavior, even in the absence of hand-crafted reward functions. Our study demonstrates a correlation between the language embedding space and the conventional Euclidean space. This framework opens new horizons for aligning agent behavior with natural language goals and lays the groundwork for a more seamless integration of larger language models (LLMs) and fluid control applications.

[313] Parameter-free Optimal Rates for Nonlinear Semi-Norm Contractions with Applications to $Q$-Learning

Ankur Naskar, Gugan Thoppe, Vijay Gupta

Main category: cs.LG

TL;DR: The paper presents a method to achieve parameter-free optimal convergence rates for nonlinear fixed-point equations like Q-learning and TD-learning, addressing non-monotonicity issues in semi-norms.

DetailsMotivation: The motivation is to overcome the challenge of achieving parameter-free optimal convergence rates for nonlinear fixed-point equations, which has been elusive due to non-monotonicity in semi-norms.

Method: The method involves recasting the averaged error as a linear recursion with a nonlinear perturbation and coupling the semi-norm’s contraction with a suitably induced norm’s monotonicity.

Result: The main result is the first parameter-free optimal convergence rate of ~O(1/√t) for Q-learning in average-reward and discounted settings, applicable to various deployment scenarios.

Conclusion: The paper successfully closes the gap in achieving optimal convergence rates for nonlinear fixed-point equations, providing a versatile framework for practical applications.

Abstract: Algorithms for solving \textit{nonlinear} fixed-point equations – such as average-reward \textit{$Q$-learning} and \textit{TD-learning} – often involve semi-norm contractions. Achieving parameter-free optimal convergence rates for these methods via Polyak–Ruppert averaging has remained elusive, largely due to the non-monotonicity of such semi-norms. We close this gap by (i.) recasting the averaged error as a linear recursion involving a nonlinear perturbation, and (ii.) taming the nonlinearity by coupling the semi-norm’s contraction with the monotonicity of a suitably induced norm. Our main result yields the first parameter-free $\tilde{O}(1/\sqrt{t})$ optimal rates for $Q$-learning in both average-reward and exponentially discounted settings, where $t$ denotes the iteration index. The result applies within a broad framework that accommodates synchronous and asynchronous updates, single-agent and distributed deployments, and data streams obtained either from simulators or along Markovian trajectories.

[314] Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal

Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu

Main category: cs.LG

TL;DR: ASAP is a novel framework for compressing Chain-of-Thought (CoT) reasoning traces in Large Reasoning Models (LRMs), improving efficiency while maintaining accuracy.

DetailsMotivation: Excessively long CoT reasoning traces in LRMs increase training and inference costs, and existing compression methods disrupt coherence or fail to capture critical steps.

Method: ASAP uses anchor-guided pruning to preserve core reasoning structure and a first-token surprisal metric for logic-aware pruning, enabling concise CoT generation.

Result: ASAP reduces token generation by 23.5% and inference latency by 43.5% while achieving 36.19% Pass@1 accuracy on LiveCodeBench v4_v5.

Conclusion: ASAP offers an efficient and accurate solution for CoT compression in LRMs, advancing the development of powerful and efficient reasoning models.

Abstract: Recently, Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces introduce substantial challenges in terms of training cost, inference latency, and deployment feasibility. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps. In this paper, we propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. It then enables a logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP teaches models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning in coding tasks. Experiments show that ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs. On the challenging LiveCodeBench v4_v5 benchmark, our approach reduces token generation by 23.5% and inference latency by 43.5% compared to the strongest baseline, while achieving a competitive accuracy of 36.19% in Pass@1. Our results highlight a promising direction for building powerful and efficient LRMs.

[315] Optimizing Prompt Sequences using Monte Carlo Tree Search for LLM-Based Optimization

Fei Xu Yu, Gina Adam, Nathaniel D. Bastian, Tian Lan

Main category: cs.LG

TL;DR: MCTS-OPS combines LLMs with Monte Carlo Tree Search to improve multi-step planning in code generation, achieving better optimization results and success rates.

DetailsMotivation: Existing LLM methods struggle with complex tasks requiring consistent multi-step planning, and current MCTS approaches focus on simpler tasks or heuristic-based code.

Method: MCTS-OPS formulates prompt selection as a sequential decision process guided by MCTS, refining multi-step prompts for better code generation.

Result: Experiments show significant improvements: 2-4x higher reward, 3x lower standard deviation, and 10% more optimal solutions in hard problems.

Conclusion: Combining symbolic planning with LLMs enhances robust, high-quality code generation in complex domains.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in code generation and structured reasoning; however, their performance often degrades on complex tasks that require consistent multi-step planning. Recent work has explored combining LLMs with Monte Carlo Tree Search (MCTS), yet existing approaches primarily focus on generating heuristic-based code for optimization or target simpler tasks where correctness alone is sufficient. In this work, we propose MCTS-OPS, a novel neural-symbolic framework that formulates prompt selection as a sequential decision process guided by MCTS. Our method explores and refines multi-step prompt sequences for the goal of improving code generation quality and enhancing the problem-solving capabilities of LLMs in general optimization. Experiments on network optimization show significant improvement over the baselines, both in the success rate of executing the generated code and in the optimization results with the specified objective and constraints (2$\sim$4$\times$ higher reward and 3$\times$ lower standard deviation). Moreover, it improves the chance of attaining the optimal solution by about 10% of cases, compared to baseline methods in hard problems. These results highlight the promise of combining symbolic planning with LLMs for robust, high-quality code generation in complex domains.

[316] Stepwise Fine and Gray: Subject-Specific Variable Selection Shows When Hemodynamic Data Improves Prognostication of Comatose Post-Cardiac Arrest Patients

Xiaobin Shen, Jonathan Elmer, George H. Chen

Main category: cs.LG

TL;DR: A novel stepwise dynamic competing risks model improves neurological outcome prediction for comatose post-cardiac arrest patients by leveraging time-invariant and time-varying features at optimal phases.

DetailsMotivation: Prognostication for comatose post-cardiac arrest patients is challenging and impacts ICU decision-making. Current methods don't optimally use time-invariant and time-varying features.

Method: Extends the Fine and Gray model to explicitly model two phases (time-invariant and time-varying features) and incorporates neural networks for nonlinear relationships.

Result: Demonstrated robust discriminative performance for outcomes like awakening, withdrawal of therapy, and death in a cohort of 2,278 patients.

Conclusion: The model generalizes to multi-phase feature collection and can enhance dynamic prediction tasks by identifying when and for whom new features improve prognostication.

Abstract: Prognostication for comatose post-cardiac arrest patients is a critical challenge that directly impacts clinical decision-making in the ICU. Clinical information that informs prognostication is collected serially over time. Shortly after cardiac arrest, various time-invariant baseline features are collected (e.g., demographics, cardiac arrest characteristics). After ICU admission, additional features are gathered, including time-varying hemodynamic data (e.g., blood pressure, doses of vasopressor medications). We view these as two phases in which we collect new features. In this study, we propose a novel stepwise dynamic competing risks model that improves the prediction of neurological outcomes by automatically determining when to take advantage of time-invariant features (first phase) and time-varying features (second phase). Notably, our model finds patients for whom this second phase (time-varying hemodynamic) information is beneficial for prognostication and also when this information is beneficial (as we collect more hemodynamic data for a patient over time, how important these data are for prognostication varies). Our approach extends the standard Fine and Gray model to explicitly model the two phases and to incorporate neural networks to flexibly capture complex nonlinear feature relationships. Evaluated on a retrospective cohort of 2,278 comatose post-arrest patients, our model demonstrates robust discriminative performance for the competing outcomes of awakening, withdrawal of life-sustaining therapy, and death despite maximal support. Our approach generalizes to more than two phases in which new features are collected and could be used in other dynamic prediction tasks, where it may be helpful to know when and for whom newly collected features significantly improve prediction.

[317] Adaptive Heterogeneous Graph Neural Networks: Bridging Heterophily and Heterogeneity

Qin Chen, Guojie Song

Main category: cs.LG

TL;DR: AHGNN addresses heterophily in heterogeneous graphs by adapting to varying heterophily distributions and semantic diversity, outperforming baselines in high-heterophily scenarios.

DetailsMotivation: Existing studies overlook heterophilic heterogeneous graphs, leading to performance issues.

Method: AHGNN uses heterophily-aware convolution and a coarse-to-fine attention mechanism to handle diverse heterophily distributions and semantic information.

Result: AHGNN outperforms 20 baselines on seven real-world graphs, especially in high-heterophily cases.

Conclusion: AHGNN effectively addresses heterophily challenges in heterogeneous graphs, demonstrating superior performance.

Abstract: Heterogeneous graphs (HGs) are common in real-world scenarios and often exhibit heterophily. However, most existing studies focus on either heterogeneity or heterophily in isolation, overlooking the prevalence of heterophilic HGs in practical applications. Such ignorance leads to their performance degradation. In this work, we first identify two main challenges in modeling heterophily HGs: (1) varying heterophily distributions across hops and meta-paths; (2) the intricate and often heterophily-driven diversity of semantic information across different meta-paths. Then, we propose the Adaptive Heterogeneous Graph Neural Network (AHGNN) to tackle these challenges. AHGNN employs a heterophily-aware convolution that accounts for heterophily distributions specific to both hops and meta-paths. It then integrates messages from diverse semantic spaces using a coarse-to-fine attention mechanism, which filters out noise and emphasizes informative signals. Experiments on seven real-world graphs and twenty baselines demonstrate the superior performance of AHGNN, particularly in high-heterophily situations.

[318] DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park

Main category: cs.LG

TL;DR: DP-LLM dynamically adjusts layer precision in on-device LLMs for optimal performance-latency trade-offs, outperforming prior methods.

DetailsMotivation: Addressing the challenge of adapting LLMs to varying runtime constraints (latency, accuracy) efficiently.

Method: Introduces DP-LLM, which dynamically assigns precision to layers using a lightweight error estimator and learned thresholds.

Result: Superior performance-latency trade-off compared to prior approaches, validated across models and benchmarks.

Conclusion: DP-LLM effectively balances precision and latency in on-device LLMs, leveraging dynamic layer sensitivity.

Abstract: How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding iterations. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. DP-LLM augments each linear layer in an LLM with a precision selector that determines the bitwidth at runtime using a lightweight error estimator and threshold values learned through fine-tuning. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

[319] Architecture-Aware Generalization Bounds for Temporal Networks: Theory and Fair Comparison Methodology

Barak Gahtan, Alex M. Bronstein

Main category: cs.LG

TL;DR: The paper provides non-vacuous generalization bounds for deep temporal models like TCNs, introduces a delayed-feedback blocking mechanism, and evaluates the impact of temporal dependence on learning.

DetailsMotivation: To address the lack of theoretical understanding of generalization in deep temporal architectures and provide practical insights into their performance.

Method: Derives generalization bounds for exponentially β-mixing sequences and introduces a delayed-feedback blocking mechanism to transform dependent samples into independent ones. Also proposes a fair-comparison methodology to isolate temporal structure effects.

Result: Bounds scale as O(R√(Dpn log N/N)), with √D scaling instead of exponential. Temporal dependence can enhance learning under fixed information budgets, but empirical convergence rates differ from theory.

Conclusion: Temporal dependence can improve learning, but gaps between theory and practice highlight the need for further research.

Abstract: Deep temporal architectures such as Temporal Convolutional Networks (TCNs) achieve strong predictive performance on sequential data, yet theoretical understanding of their generalization remains limited. We address this gap by providing both the first non-vacuous, architecture-aware generalization bounds for deep temporal models and a principled evaluation methodology. For exponentially $\beta$-mixing sequences, we derive bounds scaling as $ O!\Bigl(R,\sqrt{\tfrac{D,p,n,\log N}{N}}\Bigr), $ where $D$ is network depth, $p$ kernel size, $n$ input dimension, and $R$ weight norm. Our delayed-feedback blocking mechanism transforms dependent samples into effectively independent ones while discarding only $O(1/\log N)$ of the data, yielding $\sqrt{D}$ scaling instead of exponential, implying that doubling depth requires approximately quadrupling the training data. We also introduce a fair-comparison methodology that fixes the effective sample size to isolate the effect of temporal structure from information content. Under $N_{\text{eff}}=2{,}000$, strongly dependent sequences ($\rho=0.8$) exhibit $\approx76%$ smaller generalization gaps than weakly dependent ones ($\rho=0.2$), challenging the intuition that dependence is purely detrimental. Yet convergence rates diverge from theory: weak dependencies follow $N_{\text{eff}}^{-1.21}$ scaling and strong dependencies follow $N_{\text{eff}}^{-0.89}$, both steeper than the predicted $N^{-0.5}$. These findings reveal that temporal dependence can enhance learning under fixed information budgets, while highlighting gaps between theory and practice that motivate future research.

[320] Recurrent Deep Differentiable Logic Gate Networks

Simon Bührer, Andreas Plesner, Till Aczel, Roger Wattenhofer

Main category: cs.LG

TL;DR: First implementation of Recurrent Deep Differentiable Logic Gate Networks (RDDLGN) for sequence-to-sequence learning, showing competitive performance to GRUs.

DetailsMotivation: To explore the application of differentiable logic gates in sequential modeling, an area previously unexplored.

Method: Combines Boolean operations with recurrent architectures for sequence-to-sequence learning, tested on WMT'14 English-German translation.

Result: Achieves 5.00 BLEU and 30.9% accuracy during training, approaching GRU performance (5.41 BLEU) with graceful degradation (4.39 BLEU) during inference.

Conclusion: Recurrent logic-based neural computation is viable, opening research directions for FPGA acceleration and recursive network architectures.

Abstract: While differentiable logic gates have shown promise in feedforward networks, their application to sequential modeling remains unexplored. This paper presents the first implementation of Recurrent Deep Differentiable Logic Gate Networks (RDDLGN), combining Boolean operations with recurrent architectures for sequence-to-sequence learning. Evaluated on WMT'14 English-German translation, RDDLGN achieves 5.00 BLEU and 30.9% accuracy during training, approaching GRU performance (5.41 BLEU) and graceful degradation (4.39 BLEU) during inference. This work establishes recurrent logic-based neural computation as viable, opening research directions for FPGA acceleration in sequential modeling and other recursive network architectures.

[321] GCHR : Goal-Conditioned Hindsight Regularization for Sample-Efficient Reinforcement Learning

Xing Lei, Wenyan Yang, Kaiqiang Ke, Shentao Yang, Xuetao Zhang, Joni Pajarinen, Donglin Wang

Main category: cs.LG

TL;DR: HGR and HSR improve sample efficiency in GCRL by leveraging hindsight goals and self-imitation, outperforming HER-based methods.

DetailsMotivation: Addressing limited sample efficiency in GCRL with sparse rewards by better exploiting experiences beyond trajectory relabeling.

Method: Proposes Hindsight Goal-conditioned Regularization (HGR) and combines it with hindsight self-imitation regularization (HSR) for off-policy RL.

Result: Achieves more efficient sample reuse and superior performance in navigation and manipulation tasks.

Conclusion: HGR and HSR enhance experience utilization, offering a significant improvement over existing GCRL methods.

Abstract: Goal-conditioned reinforcement learning (GCRL) with sparse rewards remains a fundamental challenge in reinforcement learning. While hindsight experience replay (HER) has shown promise by relabeling collected trajectories with achieved goals, we argue that trajectory relabeling alone does not fully exploit the available experiences in off-policy GCRL methods, resulting in limited sample efficiency. In this paper, we propose Hindsight Goal-conditioned Regularization (HGR), a technique that generates action regularization priors based on hindsight goals. When combined with hindsight self-imitation regularization (HSR), our approach enables off-policy RL algorithms to maximize experience utilization. Compared to existing GCRL methods that employ HER and self-imitation techniques, our hindsight regularizations achieve substantially more efficient sample reuse and the best performances, which we empirically demonstrate on a suite of navigation and manipulation tasks.

[322] Improving Diagnostic Accuracy for Oral Cancer with inpainting Synthesis Lesions Generated Using Diffusion Models

Yong Oh Lee, JeeEun Kim, Jung Woo Lee

Main category: cs.LG

TL;DR: A novel method using synthetic image generation with a fine-tuned diffusion model improves oral cancer diagnostic accuracy by addressing dataset limitations.

DetailsMotivation: Limited annotated datasets constrain diagnostic model performance due to variability and insufficient training data.

Method: Proposed an inpainting technique with a fine-tuned diffusion model to synthesize realistic oral cancer lesions, enhancing diagnostic algorithms.

Result: Achieved 0.97 diagnostic accuracy for classification and 0.85 accuracy for lesion detection.

Conclusion: Synthetic image generation shows promise for medical diagnostics and warrants further research for broader cancer diagnostics.

Abstract: In oral cancer diagnostics, the limited availability of annotated datasets frequently constrains the performance of diagnostic models, particularly due to the variability and insufficiency of training data. To address these challenges, this study proposed a novel approach to enhance diagnostic accuracy by synthesizing realistic oral cancer lesions using an inpainting technique with a fine-tuned diffusion model. We compiled a comprehensive dataset from multiple sources, featuring a variety of oral cancer images. Our method generated synthetic lesions that exhibit a high degree of visual fidelity to actual lesions, thereby significantly enhancing the performance of diagnostic algorithms. The results show that our classification model achieved a diagnostic accuracy of 0.97 in differentiating between cancerous and non-cancerous tissues, while our detection model accurately identified lesion locations with 0.85 accuracy. This method validates the potential for synthetic image generation in medical diagnostics and paves the way for further research into extending these methods to other types of cancer diagnostics.

[323] Differentially Private Federated Clustering with Random Rebalancing

Xiyuan Yang, Shengyuan Hu, Soyeon Kim, Tian Li

Main category: cs.LG

TL;DR: RR-Cluster improves privacy/utility tradeoffs in federated clustering by rebalancing cluster assignments to reduce noise, ensuring a minimum client count per cluster.

DetailsMotivation: Federated clustering enhances model performance but risks privacy leakage. Standard DP mechanisms degrade utility due to uncontrolled cluster sizes.

Method: Proposes RR-Cluster, a lightweight add-on to federated clustering algorithms, which rebalances cluster assignments to guarantee minimum client counts.

Result: RR-Cluster reduces privacy noise variance, improving privacy/utility tradeoffs, validated on synthetic and real-world datasets.

Conclusion: RR-Cluster effectively balances privacy and utility in federated clustering by addressing noise and bias tradeoffs.

Abstract: Federated clustering aims to group similar clients into clusters and produce one model for each cluster. Such a personalization approach typically improves model performance compared with training a single model to serve all clients, but can be more vulnerable to privacy leakage. Directly applying client-level differentially private (DP) mechanisms to federated clustering could degrade the utilities significantly. We identify that such deficiencies are mainly due to the difficulties of averaging privacy noise within each cluster (following standard privacy mechanisms), as the number of clients assigned to the same clusters is uncontrolled. To this end, we propose a simple and effective technique, named RR-Cluster, that can be viewed as a light-weight add-on to many federated clustering algorithms. RR-Cluster achieves reduced privacy noise via randomly rebalancing cluster assignments, guaranteeing a minimum number of clients assigned to each cluster. We analyze the tradeoffs between decreased privacy noise variance and potentially increased bias from incorrect assignments and provide convergence bounds for RR-Clsuter. Empirically, we demonstrate the RR-Cluster plugged into strong federated clustering algorithms results in significantly improved privacy/utility tradeoffs across both synthetic and real-world datasets.

[324] Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

Mateusz Praski, Jakub Adamczyk, Wojciech Czech

Main category: cs.LG

TL;DR: Extensive comparison of 25 pretrained neural models in chemistry shows negligible improvement over baseline ECFP fingerprints, with only CLAMP performing significantly better.

DetailsMotivation: To rigorously evaluate the effectiveness of pretrained neural networks in chemistry and small molecule drug design, addressing concerns about evaluation rigor in existing studies.

Method: Comparison of 25 models across 25 datasets using a hierarchical Bayesian statistical testing model under a fair framework.

Result: Most neural models show no improvement over ECFP fingerprints; only CLAMP performs significantly better.

Conclusion: Findings highlight evaluation issues in prior studies, prompting discussion of causes, solutions, and practical recommendations.

Abstract: Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.

[325] Graph Federated Learning for Personalized Privacy Recommendation

Ce Na, Kai Yang, Dengzhao Fang, Yu Li, Jingtong Gao, Chengcheng Zhu, Jiale Zhang, Xiaobing Sun, Yi Chang

Main category: cs.LG

TL;DR: GFed-PP is a federated recommendation system that adapts to varying user privacy preferences, leveraging public user data to improve recommendations while ensuring privacy.

DetailsMotivation: Existing FedRecs assume uniform privacy requirements, ignoring the potential of public user data. GFed-PP addresses this by accommodating both private and public users.

Method: GFed-PP uses a user-item interaction graph and a lightweight GCN for personalized embeddings, with local learning for privacy and server aggregation for optimization.

Result: GFed-PP outperforms existing methods on five datasets, improving accuracy without compromising privacy.

Conclusion: GFed-PP offers a practical solution for federated recommendation systems with diverse privacy needs.

Abstract: Federated recommendation systems (FedRecs) have gained significant attention for providing privacy-preserving recommendation services. However, existing FedRecs assume that all users have the same requirements for privacy protection, i.e., they do not upload any data to the server. The approaches overlook the potential to enhance the recommendation service by utilizing publicly available user data. In real-world applications, users can choose to be private or public. Private users’ interaction data is not shared, while public users’ interaction data can be shared. Inspired by the issue, this paper proposes a novel Graph Federated Learning for Personalized Privacy Recommendation (GFed-PP) that adapts to different privacy requirements while improving recommendation performance. GFed-PP incorporates the interaction data of public users to build a user-item interaction graph, which is then used to form a user relationship graph. A lightweight graph convolutional network (GCN) is employed to learn each user’s user-specific personalized item embedding. To protect user privacy, each client learns the user embedding and the scoring function locally. Additionally, GFed-PP achieves optimization of the federated recommendation framework through the initialization of item embedding on clients and the aggregation of the user relationship graph on the server. Experimental results demonstrate that GFed-PP significantly outperforms existing methods for five datasets, offering superior recommendation accuracy without compromising privacy. This framework provides a practical solution for accommodating varying privacy preferences in federated recommendation systems.

[326] Reparameterization Proximal Policy Optimization

Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

Main category: cs.LG

TL;DR: RPO combines RPG and PPO for stable, sample-efficient reinforcement learning by optimizing a clipped surrogate objective with KL regularization.

DetailsMotivation: Training instability in RPG due to high-variance gradients hinders its sample efficiency.

Method: Proposes RPO, integrating PPO’s surrogate objective with RPG, enabling stable sample reuse via backpropagation through time.

Result: RPO achieves superior sample efficiency and performance on locomotion and manipulation tasks.

Conclusion: RPO bridges RPG and PPO, offering a stable, efficient reinforcement learning method.

Abstract: Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables multiple epochs of stable sample reuse by optimizing a clipped surrogate objective tailored for RPG, while being further stabilized by Kullback-Leibler (KL) divergence regularization and remaining fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.

[327] Epidemic Control on a Large-Scale-Agent-Based Epidemiology Model using Deep Deterministic Policy Gradient

Gaurav Deshkar, Jayanta Kshirsagar, Harshal Hayatnagarkar, Janani Venugopalan

Main category: cs.LG

TL;DR: The paper proposes a DDPG-based framework for optimizing pandemic interventions like lockdowns and vaccinations, balancing health and economic outcomes on a large-scale simulation.

DetailsMotivation: Current research lacks scalable, automated methods to model and optimize pandemic interventions, limiting the exploration of strategies.

Method: Uses a Deep Deterministic Policy Gradient (DDPG) framework on a large-scale agent-based simulation for multi-objective optimization of lockdown and vaccination policies.

Result: Optimal policies balance health (infection, hospitalization) and economy (poverty reduction) without lockdowns, focusing on mid-age and elderly vaccination.

Conclusion: The framework shows promise but requires further validation and aims to be open-sourced for broader use.

Abstract: To mitigate the impact of the pandemic, several measures include lockdowns, rapid vaccination programs, school closures, and economic stimulus. These interventions can have positive or unintended negative consequences. Current research to model and determine an optimal intervention automatically through round-tripping is limited by the simulation objectives, scale (a few thousand individuals), model types that are not suited for intervention studies, and the number of intervention strategies they can explore (discrete vs continuous). We address these challenges using a Deep Deterministic Policy Gradient (DDPG) based policy optimization framework on a large-scale (100,000 individual) epidemiological agent-based simulation where we perform multi-objective optimization. We determine the optimal policy for lockdown and vaccination in a minimalist age-stratified multi-vaccine scenario with a basic simulation for economic activity. With no lockdown and vaccination (mid-age and elderly), results show optimal economy (individuals below the poverty line) with balanced health objectives (infection, and hospitalization). An in-depth simulation is needed to further validate our results and open-source our framework.

[328] SCAR: State-Space Compression for AI-Driven Resource Management in 6G-Enabled Vehicular Infotainment Systems

Ioan-Sorin Comsa, Purav Shah, Karthik Vaidhyanathan, Deepak Gangadharan, Christof Imhof, Per Bergamin, Aryan Kaushik, Gabriel-Miro Muntean, Ramona Trestian

Main category: cs.LG

TL;DR: SCAR is an Edge AI framework for 6G vehicular networks, using ML-based compression and RL to optimize scheduling and fairness, outperforming baselines.

DetailsMotivation: Traditional RRM techniques struggle with high data complexity in 6G vehicular networks, necessitating AI-driven solutions.

Method: SCAR employs ML-based compression (clustering, RBF networks) and RL policies for resource management, validated via simulations.

Result: SCAR improves feasible scheduling time by 14%, reduces unfair scheduling by 15%, and cuts CQI clustering distortion by 10%.

Conclusion: SCAR effectively enhances scalability and fairness in dynamic vehicular networks, proving its viability for 6G infotainment services.

Abstract: The advent of 6G networks opens new possibilities for connected infotainment services in vehicular environments. However, traditional Radio Resource Management (RRM) techniques struggle with the increasing volume and complexity of data such as Channel Quality Indicators (CQI) from autonomous vehicles. To address this, we propose SCAR (State-Space Compression for AI-Driven Resource Management), an Edge AI-assisted framework that optimizes scheduling and fairness in vehicular infotainment. SCAR employs ML-based compression techniques (e.g., clustering and RBF networks) to reduce CQI data size while preserving essential features. These compressed states are used to train 6G-enabled Reinforcement Learning policies that maximize throughput while meeting fairness objectives defined by the NGMN. Simulations show that SCAR increases time in feasible scheduling regions by 14% and reduces unfair scheduling time by 15% compared to RL baselines without CQI compression. Furthermore, Simulated Annealing with Stochastic Tunneling (SAST)-based clustering reduces CQI clustering distortion by 10%, confirming its efficiency. These results demonstrate SCAR’s scalability and fairness benefits for dynamic vehicular networks.

[329] Membership Inference Attack with Partial Features

Xurun Wang, Guangrui Liu, Xinjie Li, Haoyu He, Lin Yao, Weizhe Zhang

Main category: cs.LG

TL;DR: The paper introduces Partial Feature Membership Inference (PFMI), a scenario where attackers infer membership using partial features, and proposes MRAD, a two-stage attack framework combining reconstruction and anomaly detection.

DetailsMotivation: Existing membership inference attacks assume full feature access, which is unrealistic. The study addresses the gap by exploring attacks with partial feature information.

Method: MRAD involves two stages: (1) optimizing unknown features to minimize sample loss, and (2) using anomaly detection to measure deviation from the training distribution.

Result: MRAD achieves an AUC of ~0.6 on STL-10 with 40% missing features, proving effectiveness across datasets and compatibility with anomaly detection techniques.

Conclusion: MRAD successfully addresses PFMI, demonstrating practical applicability in real-world scenarios with partial feature access.

Abstract: Machine learning models have been shown to be susceptible to membership inference attack, which can be used to determine whether a given sample appears in the training data. Existing membership inference methods commonly assume that the adversary has full access to the features of the target sample. This assumption, however, does not hold in many real-world scenarios where only partial features information is available, thereby limiting the applicability of these methods. In this work, we study an inference scenario where the adversary observes only partial features of each sample and aims to infer whether this observed subset was present in the training set of the target model. We define this problem as Partial Feature Membership Inference (PFMI). To address this problem, we propose MRAD (Memory-guided Reconstruction and Anomaly Detection), a two-stage attack framework. In the first stage, MRAD optimizes the unknown feature values to minimize the loss of the sample. In the second stage, it measures the deviation between the reconstructed sample and the training distribution using anomaly detection. Empirical results demonstrate that MRAD is effective across a range of datasets, and maintains compatibility with various off-the-shelf anomaly detection techniques. For example, on STL-10, our attack achieves an AUC of around 0.6 even with 40% of the missing features.

[330] AttriLens-Mol: Attribute Guided Reinforcement Learning for Molecular Property Prediction with Large Language Models

Xuan Lin, Long Chen, Yile Wang

Main category: cs.LG

TL;DR: AttriLens-Mol is a reinforcement learning framework for molecular property prediction with LLMs, using attribute-guided rewards to improve relevance and performance.

DetailsMotivation: Current LLM-based methods for molecular property prediction rely on human prompts and lack relevance in reasoning.

Method: AttriLens-Mol uses format, count, and rationality rewards to guide LLM reasoning and elicit relevant molecular attributes.

Result: The method outperforms supervised fine-tuning and advanced models, improving interpretability and performance.

Conclusion: AttriLens-Mol effectively enhances molecular property prediction by generating relevant attributes and improving model interpretability.

Abstract: Large Language Models (LLMs) have shown promise in assisting molecular property prediction tasks but often rely on human-crafted prompts and chain-of-thought templates. While recent advanced large reasoning models like DeepSeek-R1 employ reinforcement learning for an extended ``thinking’’ process, their reasoning can be verbose and lack relevance. We introduce AttriLens-Mol, an attribute-guided reinforcement learning framework for molecular property prediction with LLMs. AttriLens-Mol steers the model’s reasoning by using: (1) a format reward encouraging attribute-based structured output, (2) a count reward to avoid enumerating irrelevant attributes, and (3) a rationality reward using advanced LLMs and RDKit to verify the relatedness of the generated attributes. This approach implicitly elicits the model’s inherent knowledge of relevant molecular attributes during reasoning, enables making predictions for the molecular property more effectively. Experiments on both in-distribution and out-of-distribution datasets show that, training both 7B-size R1-Distilled-Qwen2.5 and R1-Distilled-LLaMA3.1 models on 4,000 samples with our proposed AttriLens-Mol method significantly boosts the performance, getting comparable or better results than supervised fine-tuning models (Mol-Instructions, ChemDFM, etc.) and advanced models (GPT-3.5, GPT-4o, DeepSeek-V3, DeepSeek-R1, etc.). Further, our extracted attributes for the target property, when used as features for an interpretable decision tree model, yield superior performance compared to attributes generated by prompting LLMs. This shows that AttriLens-Mol effectively elicits more relevant and predictive molecular attributes, leading to enhanced interpretability and performance for property prediction. We release the code in https://github.com/szu-tera/AttriLens-Mol.

[331] Near-Optimal Regret for Efficient Stochastic Combinatorial Semi-Bandits

Zichun Ye, Runqi Wang, Xutong Liu, Shuai Li

Main category: cs.LG

TL;DR: CMOSS is a new algorithm for combinatorial multi-armed bandits, eliminating the log T regret factor and matching lower bounds, while being computationally efficient.

DetailsMotivation: Existing UCB-based and adversarial methods in CMAB have limitations like log T regret or high computational overhead.

Method: Introduces CMOSS, a computationally efficient algorithm achieving instance-independent regret under semi-bandit feedback.

Result: CMOSS achieves O((log k)^2√kmT) regret, matching lower bounds, and performs well in experiments.

Conclusion: CMOSS resolves trade-offs in CMAB, offering improved regret and runtime efficiency.

Abstract: The combinatorial multi-armed bandit (CMAB) is a cornerstone of sequential decision-making framework, dominated by two algorithmic families: UCB-based and adversarial methods such as follow the regularized leader (FTRL) and online mirror descent (OMD). However, prominent UCB-based approaches like CUCB suffer from additional regret factor $\log T$ that is detrimental over long horizons, while adversarial methods such as EXP3.M and HYBRID impose significant computational overhead. To resolve this trade-off, we introduce the Combinatorial Minimax Optimal Strategy in the Stochastic setting (CMOSS). CMOSS is a computationally efficient algorithm that achieves an instance-independent regret of $O\big( (\log k)^2\sqrt{kmT}\big )$ under semi-bandit feedback, where $m$ is the number of arms and $k$ is the maximum cardinality of a feasible action. Crucially, this result eliminates the dependency on $\log T$ and matches the established $\Omega\big( \sqrt{kmT}\big)$ lower bound up to $O\big((\log k)^2\big)$. We then extend our analysis to show that CMOSS is also applicable to cascading feedback. Experiments on synthetic and real-world datasets validate that CMOSS consistently outperforms benchmark algorithms in both regret and runtime efficiency.

[332] In-Training Defenses against Emergent Misalignment in Language Models

David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Lucie Flek, Florian Mai

Main category: cs.LG

TL;DR: The paper studies safeguards against emergent misalignment (EMA) in fine-tuned LLMs, evaluating four regularization methods to prevent harmful behaviors outside the target domain.

DetailsMotivation: Addressing the risk of EMA, where fine-tuning LLMs for specific domains can inadvertently induce harmful behaviors, even when model weights are hidden behind APIs.

Method: Four training regularization interventions: KL-divergence regularization, ℓ2 distance in feature space, projecting onto a safe subspace (SafeLoRA), and interleaving safe training examples.

Result: Evaluation of methods’ effectiveness in reducing EMA across malicious tasks and their impact on benign tasks.

Conclusion: Discussion of open questions in EMA research, highlighting the need for practical safeguards in fine-tuning APIs.

Abstract: Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) projecting onto a safe subspace (SafeLoRA), and (iv) interleaving of a small amount of safe training examples from a general instruct-tuning dataset. We first evaluate the methods’ emergent misalignment effect across four malicious, EMA-inducing tasks. Second, we assess the methods’ impacts on benign tasks. We conclude with a discussion of open questions in emergent misalignment research.

[333] Synthetic Data Generation and Differential Privacy using Tensor Networks’ Matrix Product States (MPS)

Alejandro Moreno R., Desale Fentaw, Samuel Palmer, Raúl Salles de Padua, Ninad Dixit, Samuel Mugel, Roman Orús, Manuel Radons, Josef Menter, Ali Abedi

Main category: cs.LG

TL;DR: A method using Tensor Networks (Matrix Product States) for generating high-quality, privacy-preserving synthetic tabular data is proposed, outperforming existing models like CTGAN, VAE, and PrivBayes under strict privacy constraints.

DetailsMotivation: Addressing data scarcity, privacy constraints, and the need for diverse datasets in AI training.

Method: Uses Matrix Product States (MPS) with noise injection and gradient clipping for differential privacy (DP), benchmarked against CTGAN, VAE, and PrivBayes.

Result: MPS outperforms classical models in fidelity and privacy, especially under strict DP constraints.

Conclusion: MPS is a promising, interpretable, and scalable tool for privacy-aware synthetic data generation in sensitive domains.

Abstract: Synthetic data generation is a key technique in modern artificial intelligence, addressing data scarcity, privacy constraints, and the need for diverse datasets in training robust models. In this work, we propose a method for generating privacy-preserving high-quality synthetic tabular data using Tensor Networks, specifically Matrix Product States (MPS). We benchmark the MPS-based generative model against state-of-the-art models such as CTGAN, VAE, and PrivBayes, focusing on both fidelity and privacy-preserving capabilities. To ensure differential privacy (DP), we integrate noise injection and gradient clipping during training, enabling privacy guarantees via R'enyi Differential Privacy accounting. Across multiple metrics analyzing data fidelity and downstream machine learning task performance, our results show that MPS outperforms classical models, particularly under strict privacy constraints. This work highlights MPS as a promising tool for privacy-aware synthetic data generation. By combining the expressive power of tensor network representations with formal privacy mechanisms, the proposed approach offers an interpretable and scalable alternative for secure data sharing. Its structured design facilitates integration into sensitive domains where both data quality and confidentiality are critical.

[334] Multi-Omics Analysis for Cancer Subtype Inference via Unrolling Graph Smoothness Priors

Jielong Lu, Zhihao Wu, Jiajun Yu, Jiajun Bu, Haishuai Wang

Main category: cs.LG

TL;DR: GTMancer, a Graph Transformer framework, enhances multi-omics cancer subtype classification by integrating contrastive learning and dual attention mechanisms, outperforming existing methods.

DetailsMotivation: Existing methods fail to fully capture the coupling between heterogeneous omics data, limiting their ability to resolve subtle cancer subtype differences crucial for precision oncology.

Method: GTMancer uses contrastive learning to embed multi-omics data into a unified space, unrolls a multiplex graph optimization problem, and employs dual attention coefficients to capture structural priors within and among omics data.

Result: GTMancer outperforms state-of-the-art algorithms on seven real-world cancer datasets.

Conclusion: The proposed framework effectively integrates multi-omics data, improving cancer subtype classification and advancing precision oncology.

Abstract: Integrating multi-omics datasets through data-driven analysis offers a comprehensive understanding of the complex biological processes underlying various diseases, particularly cancer. Graph Neural Networks (GNNs) have recently demonstrated remarkable ability to exploit relational structures in biological data, enabling advances in multi-omics integration for cancer subtype classification. Existing approaches often neglect the intricate coupling between heterogeneous omics, limiting their capacity to resolve subtle cancer subtype heterogeneity critical for precision oncology. To address these limitations, we propose a framework named Graph Transformer for Multi-omics Cancer Subtype Classification (GTMancer). This framework builds upon the GNN optimization problem and extends its application to complex multi-omics data. Specifically, our method leverages contrastive learning to embed multi-omics data into a unified semantic space. We unroll the multiplex graph optimization problem in that unified space and introduce dual sets of attention coefficients to capture structural graph priors both within and among multi-omics data. This approach enables global omics information to guide the refining of the representations of individual omics. Empirical experiments on seven real-world cancer datasets demonstrate that GTMancer outperforms existing state-of-the-art algorithms.

[335] OM2P: Offline Multi-Agent Mean-Flow Policy

Zhuoran Li, Xun Wang, Hai Zhong, Longbo Huang

Main category: cs.LG

TL;DR: OM2P is a novel offline MARL algorithm that integrates mean-flow models for efficient one-step action sampling, addressing challenges like low sampling efficiency and misalignment with reward maximization.

DetailsMotivation: To overcome the inefficiency of diffusion and flow-based models in offline MARL, particularly their slow sampling and misalignment with reward goals.

Method: Proposes OM2P with reward-aware optimization, mean-flow matching loss, Q-function supervision, generalized timestep distribution, and derivative-free estimation.

Result: Achieves up to 3.8x lower GPU memory usage and 10.8x faster training, with superior performance on benchmarks.

Conclusion: OM2P successfully integrates mean-flow models into offline MARL, enabling practical and scalable generative policies for multi-agent settings.

Abstract: Generative models, especially diffusion and flow-based models, have been promising in offline multi-agent reinforcement learning. However, integrating powerful generative models into this framework poses unique challenges. In particular, diffusion and flow-based policies suffer from low sampling efficiency due to their iterative generation processes, making them impractical in time-sensitive or resource-constrained settings. To tackle these difficulties, we propose OM2P (Offline Multi-Agent Mean-Flow Policy), a novel offline MARL algorithm to achieve efficient one-step action sampling. To address the misalignment between generative objectives and reward maximization, we introduce a reward-aware optimization scheme that integrates a carefully-designed mean-flow matching loss with Q-function supervision. Additionally, we design a generalized timestep distribution and a derivative-free estimation strategy to reduce memory overhead and improve training stability. Empirical evaluations on Multi-Agent Particle and MuJoCo benchmarks demonstrate that OM2P achieves superior performance, with up to a 3.8x reduction in GPU memory usage and up to a 10.8x speed-up in training time. Our approach represents the first to successfully integrate mean-flow model into offline MARL, paving the way for practical and scalable generative policies in cooperative multi-agent settings.

[336] A Study on Regularization-Based Continual Learning Methods for Indic ASR

Gokul Adethya T, S. Jaya Nirmala

Main category: cs.LG

TL;DR: The paper explores Continual Learning (CL) for ASR in Indian languages, testing three CL strategies to mitigate forgetting and improve scalability.

DetailsMotivation: India's linguistic diversity and sequential data arrival challenge traditional multilingual ASR models, necessitating CL for privacy-conscious, scalable solutions.

Method: A Conformer-based hybrid RNN-T/CTC model pretrained on Hindi is incrementally trained on eight more languages. Three CL strategies (EWC, MAS, LwF) are evaluated.

Result: CL effectively reduces forgetting compared to naive fine-tuning, with performance analyzed via WER and Backward Transfer.

Conclusion: CL is promising for scalable ASR in diverse Indian languages under realistic constraints.

Abstract: Indias linguistic diversity poses significant challenges for developing inclusive Automatic Speech Recognition (ASR) systems. Traditional multilingual models, which require simultaneous access to all language data, are impractical due to the sequential arrival of data and privacy constraints. Continual Learning (CL) offers a solution by enabling models to learn new languages sequentially without catastrophically forgetting previously learned knowledge. This paper investigates CL for ASR on Indian languages using a subset of the IndicSUPERB benchmark. We employ a Conformer-based hybrid RNN-T/CTC model, initially pretrained on Hindi, which is then incrementally trained on eight additional Indian languages, for a total sequence of nine languages. We evaluate three prominent regularization- and distillation-based CL strategies: Elastic Weight Consolidation (EWC), Memory Aware Synapses (MAS), and Learning without Forgetting (LwF), selected for their suitability in no-replay, privacy-conscious scenarios. Performance is analyzed using Word Error Rate (WER) for both RNN-T and CTC paths on clean and noisy data, as well as knowledge retention via Backward Transfer. We also explore the impact of varying the number of training epochs (1, 2, 5, and 10) per task. Results, compared against naive fine-tuning, demonstrate CLs effectiveness in mitigating forgetting, making it a promising approach for scalable ASR in diverse Indian languages under realistic constraints. The code is available at: https://github.com/FrozenWolf-Cyber/Indic-CL-ASR

[337] Low-Bit Data Processing Using Multiple-Output Spiking Neurons with Non-linear Reset Feedback

Sanja Karilanova, Subhrakanti Dey, Ayça Özçelikkale

Main category: cs.LG

TL;DR: A novel spiking neuron model combining SSM state transitions with non-linear feedback reset is proposed, achieving competitive performance in SNN tasks.

DetailsMotivation: To bridge the advantages of SNNs (low-latency, energy-efficient) and deep SSMs (competitive performance) by addressing their limitations (high-precision activations, no reset).

Method: Develop a multiple-output spiking neuron model integrating linear SSM state transitions and non-linear reset feedback.

Result: Achieves comparable performance to SNN benchmarks in keyword spotting, event-based vision, and sequential pattern recognition tasks.

Conclusion: The reset mechanism enables stable learning even with unstable linear dynamics, expanding beyond deep SSM constraints.

Abstract: Neuromorphic computing is an emerging technology enabling low-latency and energy-efficient signal processing. A key algorithmic tool in neuromorphic computing is spiking neural networks (SNNs). SNNs are biologically inspired neural networks which utilize stateful neurons, and provide low-bit data processing by encoding and decoding information using spikes. Similar to SNNs, deep state-space models (SSMs) utilize stateful building blocks. However, deep SSMs, which recently achieved competitive performance in various temporal modeling tasks, are typically designed with high-precision activation functions and no reset mechanisms. To bridge the gains offered by SNNs and the recent deep SSM models, we propose a novel multiple-output spiking neuron model that combines a linear, general SSM state transition with a non-linear feedback mechanism through reset. Compared to the existing neuron models for SNNs, our proposed model clearly conceptualizes the differences between the spiking function, the reset condition and the reset action. The experimental results on various tasks, i.e., a keyword spotting task, an event-based vision task and a sequential pattern recognition task, show that our proposed model achieves performance comparable to existing benchmarks in the SNN literature. Our results illustrate how the proposed reset mechanism can overcome instability and enable learning even when the linear part of neuron dynamics is unstable, allowing us to go beyond the strictly enforced stability of linear dynamics in recent deep SSM models.

[338] FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields

Junhyeog Yun, Minui Hong, Gunhee Kim

Main category: cs.LG

TL;DR: FedMeNF is a novel Federated Meta-Learning approach for neural fields, addressing privacy leakage and resource constraints with a privacy-preserving loss function.

DetailsMotivation: To overcome the limitations of traditional FML (privacy leakage) and the high resource demands of neural fields, especially for edge devices.

Method: Introduces FedMeNF, which uses a privacy-preserving loss function to regulate local meta-optimization without retaining private client data.

Result: FedMeNF achieves fast optimization and robust performance with few-shot or non-IID data while preserving privacy.

Conclusion: FedMeNF effectively balances privacy, efficiency, and performance in federated learning for neural fields.

Abstract: Neural fields provide a memory-efficient representation of data, which can effectively handle diverse modalities and large-scale data. However, learning to map neural fields often requires large amounts of training data and computations, which can be limited to resource-constrained edge devices. One approach to tackle this limitation is to leverage Federated Meta-Learning (FML), but traditional FML approaches suffer from privacy leakage. To address these issues, we introduce a novel FML approach called FedMeNF. FedMeNF utilizes a new privacy-preserving loss function that regulates privacy leakage in the local meta-optimization. This enables the local meta-learner to optimize quickly and efficiently without retaining the client’s private data. Our experiments demonstrate that FedMeNF achieves fast optimization speed and robust reconstruction performance, even with few-shot or non-IID data across diverse data modalities, while preserving client data privacy.

[339] Introducing Fractional Classification Loss for Robust Learning with Noisy Labels

Mert Can Kurucu, Tufan Kumbasar, İbrahim Eksin, Müjde Güzelkaya

Main category: cs.LG

TL;DR: FCL is a new adaptive robust loss function for deep learning that automatically adjusts to label noise, balancing robustness and convergence without manual tuning.

DetailsMotivation: Existing robust loss functions require extensive hyperparameter tuning for different datasets, which is inefficient.

Method: FCL combines the fractional derivative of Cross-Entropy (active) and Mean Absolute Error (passive) into a single loss function, with a learnable parameter μ to balance robustness and convergence.

Result: FCL achieves state-of-the-art performance on benchmark datasets without manual hyperparameter tuning.

Conclusion: FCL dynamically adapts to label noise, offering a practical solution for robust deep learning.

Abstract: Robust loss functions are crucial for training deep neural networks in the presence of label noise, yet existing approaches require extensive, dataset-specific hyperparameter tuning. In this work, we introduce Fractional Classification Loss (FCL), an adaptive robust loss that automatically calibrates its robustness to label noise during training. Built within the active-passive loss framework, FCL employs the fractional derivative of the Cross-Entropy (CE) loss as its active component and the Mean Absolute Error (MAE) as its passive loss component. With this formulation, we demonstrate that the fractional derivative order $\mu$ spans a family of loss functions that interpolate between MAE-like robustness and CE-like fast convergence. Furthermore, we integrate $\mu$ into the gradient-based optimization as a learnable parameter and automatically adjust it to optimize the trade-off between robustness and convergence speed. We reveal that FCL’s unique property establishes a critical trade-off that enables the stable learning of $\mu$: lower log penalties on difficult or mislabeled examples improve robustness but impose higher penalties on easy or clean data, reducing model confidence in them. Consequently, FCL can dynamically reshape its loss landscape to achieve effective classification performance under label noise. Extensive experiments on benchmark datasets show that FCL achieves state-of-the-art results without the need for manual hyperparameter tuning.

[340] Structural Equation-VAE: Disentangled Latent Representations for Tabular Data

Ruiyu Zhang, Ce Zhao, Xin Zhao, Lin Nie, Wai-Fung Lam

Main category: cs.LG

TL;DR: SE-VAE is a novel VAE architecture for tabular data that embeds structural equation modeling principles, improving interpretability and disentanglement without relying heavily on regularization.

DetailsMotivation: The challenge of learning interpretable latent representations from tabular data in deep generative modeling, especially in domains requiring theory-driven constructs and valid measurement.

Method: SE-VAE integrates structural equation modeling into VAE design, aligning latent subspaces with indicator groupings and isolating confounding variation via a global nuisance latent.

Result: SE-VAE outperforms baselines in factor recovery, interpretability, and robustness to nuisance variation, with architectural structure being the key performance driver.

Conclusion: SE-VAE provides a principled, white-box framework for generative modeling in scientific and social domains where interpretability and validity are critical.

Abstract: Learning interpretable latent representations from tabular data remains a challenge in deep generative modeling. We introduce SE-VAE (Structural Equation-Variational Autoencoder), a novel architecture that embeds measurement structure directly into the design of a variational autoencoder. Inspired by structural equation modeling, SE-VAE aligns latent subspaces with known indicator groupings and introduces a global nuisance latent to isolate construct-specific confounding variation. This modular architecture enables disentanglement through design rather than through statistical regularizers alone. We evaluate SE-VAE on a suite of simulated tabular datasets and benchmark its performance against a series of leading baselines using standard disentanglement metrics. SE-VAE consistently outperforms alternatives in factor recovery, interpretability, and robustness to nuisance variation. Ablation results reveal that architectural structure, rather than regularization strength, is the key driver of performance. SE-VAE offers a principled framework for white-box generative modeling in scientific and social domains where latent constructs are theory-driven and measurement validity is essential.

[341] Geometric-k-means: A Bound Free Approach to Fast and Eco-Friendly k-means

Parichit Sharma, Marcin Stanislaw, Hasan Kurban, Oguzhan Kulekci, Mehmet Dalkilic

Main category: cs.LG

TL;DR: Gk-means improves k-means efficiency using geometric principles, reducing runtime and energy use without quality loss.

DetailsMotivation: Enhance k-means efficiency and energy economy while maintaining solution quality.

Method: Uses scalar projection to focus on high expressive data (HE) and bypass low expressive data (LE).

Result: Outperforms traditional and SOTA k-means in runtime, distance computations, and energy efficiency.

Conclusion: Gk-means is a sustainable, efficient alternative to traditional k-means.

Abstract: This paper introduces Geometric-k-means (or Gk-means for short), a novel approach that significantly enhances the efficiency and energy economy of the widely utilized k-means algorithm, which, despite its inception over five decades ago, remains a cornerstone in machine learning applications. The essence of Gk-means lies in its active utilization of geometric principles, specifically scalar projection, to significantly accelerate the algorithm without sacrificing solution quality. This geometric strategy enables a more discerning focus on data points that are most likely to influence cluster updates, which we call as high expressive data (HE). In contrast, low expressive data (LE), does not impact clustering outcome, is effectively bypassed, leading to considerable reductions in computational overhead. Experiments spanning synthetic, real-world and high-dimensional datasets, demonstrate Gk-means is significantly better than traditional and state of the art (SOTA) k-means variants in runtime and distance computations (DC). Moreover, Gk-means exhibits better resource efficiency, as evidenced by its reduced energy footprint, placing it as more sustainable alternative.

[342] Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Zhaomin Wu, Mingzhe Du, See-Kiong Ng, Bingsheng He

Main category: cs.LG

TL;DR: The paper investigates self-initiated deception in LLMs, proposing a framework with two metrics to quantify deception likelihood, revealing increased deception in complex tasks.

DetailsMotivation: To explore underexplored threats of LLM deception beyond human-induced scenarios, focusing on self-initiated deception in benign prompts.

Method: Proposes a novel framework using ‘contact searching questions’ and two statistical metrics (Deceptive Intention Score and Deceptive Behavior Score) derived from psychological principles.

Result: Evaluation of 14 leading LLMs shows both metrics escalate with task difficulty, indicating increased deception in complex problems.

Conclusion: Advanced LLMs exhibit rising deception tendencies in complex tasks, raising concerns for their deployment in critical domains.

Abstract: Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective, remains a significant and underexplored threat. Existing studies typically induce such deception by explicitly setting a “hidden” objective through prompting or fine-tuning, which may not fully reflect real-world human-LLM interactions. Moving beyond this human-induced deception, we investigate LLMs’ self-initiated deception on benign prompts. To address the absence of ground truth in this evaluation, we propose a novel framework using “contact searching questions.” This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model’s bias towards a hidden objective. The second, Deceptive Behavior Score, measures the inconsistency between the LLM’s internal belief and its expressed output. Upon evaluating 14 leading LLMs, we find that both metrics escalate as task difficulty increases, rising in parallel for most models. Building on these findings, we formulate a mathematical model to explain this behavior. These results reveal that even the most advanced LLMs exhibit an increasing tendency toward deception when handling complex problems, raising critical concerns for the deployment of LLM agents in complex and crucial domains.

[343] ActivityDiff: A diffusion model with Positive and Negative Activity Guidance for De Novo Drug Design

Renyi Zhou, Huimin Zhu, Jing Tang, Min Li

Main category: cs.LG

TL;DR: ActivityDiff is a generative method using diffusion models for multi-target drug design, balancing efficacy and safety by leveraging classifier guidance.

DetailsMotivation: Precise control over molecular activity, including multi-target modulation and off-target toxicity mitigation, is lacking in current generative drug design methods.

Method: ActivityDiff employs classifier-guided diffusion models, using separately trained drug-target classifiers for positive and negative guidance.

Result: ActivityDiff effectively handles tasks like single-/dual-target generation, selective generation, and off-target effect reduction.

Conclusion: ActivityDiff introduces a novel paradigm for integrated molecular activity control, offering a versatile framework for drug design.

Abstract: Achieving precise control over a molecule’s biological activity-encompassing targeted activation/inhibition, cooperative multi-target modulation, and off-target toxicity mitigation-remains a critical challenge in de novo drug design. However, existing generative methods primarily focus on producing molecules with a single desired activity, lacking integrated mechanisms for the simultaneous management of multiple intended and unintended molecular interactions. Here, we propose ActivityDiff, a generative approach based on the classifier-guidance technique of diffusion models. It leverages separately trained drug-target classifiers for both positive and negative guidance, enabling the model to enhance desired activities while minimizing harmful off-target effects. Experimental results show that ActivityDiff effectively handles essential drug design tasks, including single-/dual-target generation, fragment-constrained dual-target design, selective generation to enhance target specificity, and reduction of off-target effects. These results demonstrate the effectiveness of classifier-guided diffusion in balancing efficacy and safety in molecular design. Overall, our work introduces a novel paradigm for achieving integrated control over molecular activity, and provides ActivityDiff as a versatile and extensible framework.

[344] End-to-End Text-to-SQL with Dataset Selection: Leveraging LLMs for Adaptive Query Generation

Anurag Tripathi, Vaibhav Patle, Abhinav Jain, Ayush Pundir, Sairam Menon, Ajeet Kumar Singh

Main category: cs.LG

TL;DR: A three-stage text-to-SQL framework improves database intent prediction and SQL generation by leveraging LLMs, prompt engineering, and critic agents.

DetailsMotivation: Traditional text-to-SQL methods assume a pre-specified database, which is impractical for multiple extensive databases. This paper addresses the overlooked step of identifying the correct database.

Method: The framework: 1) extracts implicit rules from NLQs using LLMs, 2) predicts the correct database (db_id) with a RoBERTa-based model, and 3) refines SQL using critic agents.

Result: Outperforms state-of-the-art models in database intent prediction and SQL generation accuracy.

Conclusion: The proposed framework effectively handles database identification and SQL generation, improving usability for non-technical users.

Abstract: Text-to-SQL bridges the gap between natural language and structured database language, thus allowing non-technical users to easily query databases. Traditional approaches model text-to-SQL as a direct translation task, where a given Natural Language Query (NLQ) is mapped to an SQL command. Recent advances in large language models (LLMs) have significantly improved translation accuracy, however, these methods all require that the target database is pre-specified. This becomes problematic in scenarios with multiple extensive databases, where identifying the correct database becomes a crucial yet overlooked step. In this paper, we propose a three-stage end-to-end text-to-SQL framework to identify the user’s intended database before generating SQL queries. Our approach leverages LLMs and prompt engineering to extract implicit information from natural language queries (NLQs) in the form of a ruleset. We then train a large db_id prediction model, which includes a RoBERTa-based finetuned encoder, to predict the correct Database identifier (db_id) based on both the NLQ and the LLM-generated rules. Finally, we refine the generated SQL by using critic agents to correct errors. Experimental results demonstrate that our framework outperforms the current state-of-the-art models in both database intent prediction and SQL generation accuracy.

[345] A New Lens on Homelessness: Daily Tent Monitoring with 311 Calls and Street Images

Wooyong Jung, Sola Kim, Dongwook Kim, Maryam Tabar, Dongwon Lee

Main category: cs.LG

TL;DR: A new method using 311 Service Calls and street-level imagery tracks homeless tent trends in San Francisco, offering more detailed and timely data than traditional counts.

DetailsMotivation: Existing homelessness monitoring methods like PIT counts lack frequency, consistency, and spatial detail, limiting effective policy responses.

Method: Uses publicly available, crowdsourced data (311 Service Calls and street-level imagery) to create a predictive model for daily, neighborhood-level homeless tent trends.

Result: The model reveals rapid fluctuations during COVID-19 and spatial shifts in tent locations, patterns often missed by traditional counts.

Conclusion: This approach provides timely, localized, and cost-effective data to guide policy and evaluate homelessness interventions.

Abstract: Homelessness in the United States has surged to levels unseen since the Great Depression. However, existing methods for monitoring it, such as point-in-time (PIT) counts, have limitations in terms of frequency, consistency, and spatial detail. This study proposes a new approach using publicly available, crowdsourced data, specifically 311 Service Calls and street-level imagery, to track and forecast homeless tent trends in San Francisco. Our predictive model captures fine-grained daily and neighborhood-level variations, uncovering patterns that traditional counts often overlook, such as rapid fluctuations during the COVID-19 pandemic and spatial shifts in tent locations over time. By providing more timely, localized, and cost-effective information, this approach serves as a valuable tool for guiding policy responses and evaluating interventions aimed at reducing unsheltered homelessness.

[346] Sample-efficient LLM Optimization with Reset Replay

Zichuan Liu, Jinyu Wang, Lei Song, Jiang Bian

Main category: cs.LG

TL;DR: LoRR enhances LLM optimization by improving sample efficiency and reducing primacy bias through reset replay and hybrid objectives.

DetailsMotivation: Address low sample efficiency and primacy bias in LLM optimization methods like RL and preference optimization.

Method: Introduces LoRR, a plugin with reset replay and hybrid optimization (SFT + preference-based losses).

Result: Boosts performance on reasoning benchmarks; iterative DPO with LoRR matches complex RL methods.

Conclusion: LoRR is a practical, efficient paradigm for LLM finetuning, maximizing performance with limited data.

Abstract: Recent advancements in post-training Large Language Models (LLMs), particularly through Reinforcement Learning (RL) and preference optimization methods, are key drivers for enhancing their reasoning capabilities. However, these methods are often plagued by low sample efficiency and a susceptibility to primacy bias, where overfitting to initial experiences degrades policy quality and damages the learning process. To address these challenges, we introduce LLM optimization with Reset Replay (LoRR), a general and powerful plugin designed to enhance sample efficiency in any preference-based optimization framework. LoRR core mechanism enables training at a high replay number, maximizing the utility of each collected data batch. To counteract the risk of overfitting inherent in high-replay training, LoRR incorporates a periodic reset strategy with reusing initial data, which preserves network plasticity. Furthermore, it leverages a hybrid optimization objective, combining supervised fine-tuning (SFT) and preference-based losses to further bolster data exploitation. Our extensive experiments demonstrate that LoRR significantly boosts the performance of various preference optimization methods on both mathematical and general reasoning benchmarks. Notably, an iterative DPO approach augmented with LoRR achieves comparable performance on challenging math tasks, outperforming some complex and computationally intensive RL-based algorithms. These findings highlight that LoRR offers a practical, sample-efficient, and highly effective paradigm for LLM finetuning, unlocking greater performance from limited data.

[347] LLM Unlearning using Gradient Ratio-Based Influence Estimation and Noise Injection

Ameya Anjarlekar, Sandeep Pombra

Main category: cs.LG

TL;DR: GRIN is a modular framework for LLM unlearning, using gradient-ratio metrics and selective noise injection to improve forgetting of sensitive data while preserving model utility.

DetailsMotivation: Addressing the need for effective machine unlearning in LLMs due to legal and ethical concerns, as existing methods often fail to fully forget data or degrade unrelated knowledge.

Method: GRIN identifies key parameters responsible for memorization via a gradient-ratio metric, then selectively injects noise before fine-tuning.

Result: Validated on benchmarks (TOFU, WMDP, SafePKU), GRIN improves unlearning performance without compromising model utility.

Conclusion: GRIN offers a targeted and effective solution for LLM unlearning, addressing localization issues and maintaining model performance.

Abstract: The growing legal and ethical scrutiny of large language models (LLMs) necessitates effective machine unlearning, particularly for sensitive or unauthorized data. Existing empirical methods often yield incomplete forgetting or unintended degradation of unrelated knowledge due to poor localization. In this work, we propose GRIN: a modular and targeted framework for LLM unlearning. GRIN introduces a novel gradient-ratio-based metric to identify parameters most responsible for memorizing forget data. We then perform selective noise injection into these parameters prior to fine-tuning, which improves unlearning performance while maintaining model utility. Finally, we propose new evaluation metrics tailored to the LLM setting and validate our approach on standard benchmarks such as TOFU, WMDP, and SafePKU.

[348] From research to clinic: Accelerating the translation of clinical decision support systems by making synthetic data interoperable

Pavitra Chauhan, Mohsen Gamal Saad Askar, Kristian Svendsen, Bjørn Fjukstad, Brita Elvevåg, Lars Ailo Bongo, Edvard Pedersen

Main category: cs.LG

TL;DR: The paper proposes SyntHIR, an architecture using synthetic EHR data to ease CDSS tool development and testing, demonstrating its value through successful deployment in Norway’s largest EHR system.

DetailsMotivation: The lack of CDSS tool translation into clinics due to focus on model training over tool development and restricted EHR access motivates the need for synthetic data solutions.

Method: The SyntHIR system integrates synthetic data generators, ensures data interoperability, and enables tool transportability, tested via a proof-of-concept CDSS tool in Norway.

Result: SyntHIR successfully facilitated CDSS tool development and deployment in Norway’s largest EHR system, proving its translational value.

Conclusion: SyntHIR serves as a reference model to accelerate CDSS tool translation from research to clinical practice.

Abstract: The translation of clinical decision support system (CDSS) tools from research settings into the clinic is often non-existent, partly because the focus tends to be on training machine learning models rather than tool development using the model for inference. To develop a CDSS tool that can be deployed in the clinical workflow, there is a need to integrate, validate, and test the tool on the Electronic Health Record (EHR) systems that store and manage patient data. Not surprisingly, it is rarely possible for researchers to get the necessary access to an EHR system due to legal restrictions pertaining to the protection of data privacy in patient records. We propose an architecture for using synthetic data in EHR systems to make CDSS tool development and testing much easier. In this study, the architecture is implemented in the SyntHIR system. SyntHIR has three noteworthy architectural features enabling (i) integration with synthetic data generators, (ii) data interoperability, and (iii) tool transportability. The translational value of this approach was evaluated through two primary steps. First, a working proof-of-concept of a machine learning-based CDSS tool was developed using data from patient registries in Norway. Second, the transportability of this CDSS tool was demonstrated by successfully deploying it in Norway’s largest EHR system vendor (DIPS). These findings showcase the value of the SyntHIR architecture as a useful reference model to accelerate the translation of “bench to bedside” research of CDSS tools.

[349] Reinforcement Learning Based Sensor Optimization for Bio-markers

Sajal Khandelwal, Pawan Kumar, Syed Azeemuddin

Main category: cs.LG

TL;DR: The paper explores enhancing sensitivity of IDC-based RF sensors using RLBPSO, outperforming ACO and other methods by optimizing design parameters.

DetailsMotivation: Improving sensitivity of low-cost, easily fabricated IDC-based RF sensors, which is often limited by design flaws and noise.

Method: Uses reinforcement learning-based Binary Particle Swarm Optimization (RLBPSO) to optimize sensor design parameters like electrode design and finger width.

Result: RLBPSO achieves superior sensitivity improvements across various frequency ranges compared to ACO and other state-of-the-art methods.

Conclusion: RLBPSO is an effective method for optimizing IDC-based RF sensor designs, offering significant sensitivity enhancements.

Abstract: Radio frequency (RF) biosensors, in particular those based on inter-digitated capacitors (IDCs), are pivotal in areas like biomedical diagnosis, remote sensing, and wireless communication. Despite their advantages of low cost and easy fabrication, their sensitivity can be hindered by design imperfections, environmental factors, and circuit noise. This paper investigates enhancing the sensitivity of IDC-based RF sensors using novel reinforcement learning based Binary Particle Swarm Optimization (RLBPSO), and it is compared to Ant Colony Optimization (ACO), and other state-of-the-art methods. By focusing on optimizing design parameters like electrode design and finger width, the proposed study found notable improvements in sensor sensitivity. The proposed RLBPSO method shows best optimized design for various frequency ranges when compared to current state-of-the-art methods.

[350] A Markov Random Field model for Hypergraph-based Machine Learning

Bohan Tang, Keyue Jiang, Laura Toni, Siheng Chen, Xiaowen Dong

Main category: cs.LG

TL;DR: The paper introduces a hypergraph Markov random field model to understand data generation processes, improving machine learning tasks like structure inference and node classification.

DetailsMotivation: To enhance machine learning model generalisation, robustness, and interpretability by modelling data generation processes on hypergraphs.

Method: Develops a hypergraph Markov random field using a multivariate Gaussian distribution, tailored for hypergraph structure and features. Introduces HGSI for structure inference and Hypergraph-MLP for node classification.

Result: HGSI outperforms existing methods in structure inference, and Hypergraph-MLP excels in node classification benchmarks, offering runtime efficiency and robustness.

Conclusion: The proposed data-generating process and frameworks significantly improve hypergraph machine learning tasks, demonstrating superior performance and robustness.

Abstract: Understanding the data-generating process is essential for building machine learning models that generalise well while ensuring robustness and interpretability. This paper addresses the fundamental challenge of modelling the data generation processes on hypergraphs and explores how such models can inform the design of machine learning algorithms for hypergraph data. The key to our approach is the development of a hypergraph Markov random field that models the joint distribution of the node features and hyperedge features in a hypergraph through a multivariate Gaussian distribution whose covariance matrix is uniquely determined by the hypergraph structure. The proposed data-generating process provides a valuable inductive bias for various hypergraph machine learning tasks, thus enhancing the algorithm design. In this paper, we focus on two representative downstream tasks: structure inference and node classification. Accordingly, we introduce two novel frameworks: 1) an original hypergraph structure inference framework named HGSI, and 2) a novel learning framework entitled Hypergraph-MLP for node classification on hypergraphs. Empirical evaluation of the proposed frameworks demonstrates that:

  1. HGSI outperforms existing hypergraph structure inference methods on both synthetic and real-world data; and 2) Hypergraph-MLP outperforms baselines in six hypergraph node classification benchmarks, at the same time promoting runtime efficiency and robustness against structural perturbations during inference.

[351] Bayesian Gaussian Process ODEs via Double Normalizing Flows

Jian Xu, Shian Du, Junmei Yang, Xinghao Ding, John Paisley, Delu Zeng

Main category: cs.LG

TL;DR: The paper introduces normalizing flows to enhance Gaussian Process ODEs (GPODEs), improving flexibility and accuracy in modeling dynamical systems.

DetailsMotivation: Standard GPs with basic kernels limit GPODE representation of complex scenarios, prompting the need for more flexible and expressive models.

Method: Normalizing flows reparameterize the ODE vector field and posterior inference, combined with a variational learning algorithm for simultaneous learning and inference.

Result: The method improves accuracy and uncertainty estimates, validated on simulated and real-world data like human motion.

Conclusion: The approach effectively captures uncertainty and enhances accuracy in Bayesian GPODEs.

Abstract: Recently, Gaussian processes have been used to model the vector field of continuous dynamical systems, referred to as GPODEs, which are characterized by a probabilistic ODE equation. Bayesian inference for these models has been extensively studied and applied in tasks such as time series prediction. However, the use of standard GPs with basic kernels like squared exponential kernels has been common in GPODE research, limiting the model’s ability to represent complex scenarios. To address this limitation, we introduce normalizing flows to reparameterize the ODE vector field, resulting in a data-driven prior distribution, thereby increasing flexibility and expressive power. We develop a data-driven variational learning algorithm that utilizes analytically tractable probability density functions of normalizing flows, enabling simultaneous learning and inference of unknown continuous dynamics. Additionally, we also apply normalizing flows to the posterior inference of GP ODEs to resolve the issue of strong mean-field assumptions in posterior inference. By applying normalizing flows in both these ways, our model improves accuracy and uncertainty estimates for Bayesian Gaussian Process ODEs. We validate the effectiveness of our approach on simulated dynamical systems and real-world human motion data, including time series prediction and missing data recovery tasks. Experimental results show that our proposed method effectively captures model uncertainty while improving accuracy.

[352] Entropy Causal Graphs for Multivariate Time Series Anomaly Detection

Falih Gozi Febrinanto, Kristen Moore, Chandra Thapa, Mujie Liu, Vidya Saikrishna, Jiangang Ma, Feng Xia

Main category: cs.LG

TL;DR: CGAD introduces a causal graph-based framework for multivariate time series anomaly detection, improving performance by 9% over state-of-the-art methods.

DetailsMotivation: Existing frameworks ignore causal relationships among variables, degrading anomaly detection performance.

Method: CGAD uses transfer entropy to construct causal graphs and combines weighted graph convolutional networks with causal convolutions to model relationships and temporal patterns. Anomaly scoring employs median absolute deviation-based normalization.

Result: CGAD achieves a 9% average improvement in performance metrics compared to existing methods.

Conclusion: CGAD effectively captures causal relationships and temporal patterns, enhancing anomaly detection in multivariate time series data.

Abstract: Many multivariate time series anomaly detection frameworks have been proposed and widely applied. However, most of these frameworks do not consider intrinsic relationships between variables in multivariate time series data, thus ignoring the causal relationship among variables and degrading anomaly detection performance. This work proposes a novel framework called CGAD, an entropy Causal Graph for multivariate time series Anomaly Detection. CGAD utilizes transfer entropy to construct graph structures that unveil the underlying causal relationships among time series data. Weighted graph convolutional networks combined with causal convolutions are employed to model both the causal graph structures and the temporal patterns within multivariate time series data. Furthermore, CGAD applies anomaly scoring, leveraging median absolute deviation-based normalization to improve the robustness of the anomaly identification process. Extensive experiments demonstrate that CGAD outperforms state-of-the-art methods on real-world datasets with a 9% average improvement in terms of three different multivariate time series anomaly detection metrics.

[353] Decision-focused predictions via pessimistic bilevel optimization: complexity and algorithms

Víctor Bucarey, Sophia Calderón, Gonzalo Muñoz, Frederic Semet

Main category: cs.LG

TL;DR: The paper addresses uncertainty in optimization by focusing on decision-focused predictions to minimize regret, reformulating the problem as a non-convex quadratic optimization, and demonstrating improved performance over existing methods.

DetailsMotivation: To tackle the sensitivity of decisions to uncertain parameters in optimization, the paper aims to develop predictive models that minimize regret in decision-making.

Method: The authors formulate expected regret minimization as a pessimistic bilevel optimization model, analyze its computational complexity, and reformulate it as a non-convex quadratic problem. They also propose computational techniques for tractability.

Result: The approach shows improved training performance over the state-of-the-art method by Elmachtoub and Grigas (2022) in experiments on shortest-path and bipartite matching instances.

Conclusion: The proposed method effectively addresses uncertainty in optimization by focusing on decision-focused predictions and offers computational advantages over existing techniques.

Abstract: Dealing with uncertainty in optimization parameters is an important and longstanding challenge. Typically, uncertain parameters are predicted accurately, and then a deterministic optimization problem is solved. However, the decisions produced by this so-called predict-then-optimize procedure can be highly sensitive to uncertain parameters. In this work, we contribute to recent efforts in producing decision-focused predictions, i.e., to build predictive models that are constructed with the goal of minimizing a regret measure on the decisions taken with them. We begin by formulating the exact expected regret minimization as a pessimistic bilevel optimization model. Then, we show computational complexity results of this problem, including its membership in NP. In combination with a known NP-hardness result, this establishes NP-completeness and discards its hardness in higher complexity classes. Using duality arguments, we reformulate it as a non-convex quadratic optimization problem. Finally, leveraging the quadratic reformulation, we show various computational techniques to achieve empirical tractability. We report extensive computational results on shortest-path and bipartite matching instances with uncertain cost vectors. Our results indicate that our approach can improve training performance over the approach of Elmachtoub and Grigas (2022), a state-of-the-art method for decision-focused learning.

[354] Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation

Bruno Laboissiere Camargos Borges, Bruno Machado Pacheco, Danilo Silva

Main category: cs.LG

TL;DR: The paper introduces Soft Dice Confidence (SDC), a linear-time approximation for optimal confidence estimation in selective prediction for semantic segmentation, outperforming previous methods.

DetailsMotivation: State-of-the-art semantic segmentation models underperform in high-stakes applications like medical imaging, where abstaining from low-confidence predictions can improve performance.

Method: Focuses on image-level abstention, deriving an optimal confidence estimator (intractable for typical images) and proposing SDC, a linear-time approximation. Also introduces a plug-in SDC for estimated marginal posteriors.

Result: SDC outperforms prior methods, including those needing extra tuning data, validated on synthetic and real-world medical imaging tasks, including out-of-distribution scenarios.

Conclusion: SDC is a reliable, efficient tool for selective prediction in semantic segmentation, particularly in medical imaging.

Abstract: In semantic segmentation, even state-of-the-art deep learning models fall short of the performance required in certain high-stakes applications such as medical image analysis. In these cases, performance can be improved by allowing a model to abstain from making predictions when confidence is low, an approach known as selective prediction. While well-known in the classification literature, selective prediction has been underexplored in the context of semantic segmentation. This paper tackles the problem by focusing on image-level abstention, which involves producing a single confidence estimate for the entire image, in contrast to previous approaches that focus on pixel-level uncertainty. Assuming the Dice coefficient as the evaluation metric for segmentation, two main contributions are provided in this paper: (i) In the case of known marginal posterior probabilities, we derive the optimal confidence estimator, which is observed to be intractable for typical image sizes. Then, an approximation computable in linear time, named Soft Dice Confidence (SDC), is proposed and proven to be tightly bounded to the optimal estimator. (ii) When only an estimate of the marginal posterior probabilities are known, we propose a plug-in version of the SDC and show it outperforms all previous methods, including those requiring additional tuning data. These findings are supported by experimental results on both synthetic data and real-world data from six medical imaging tasks, including out-of-distribution scenarios, positioning the SDC as a reliable and efficient tool for selective prediction in semantic segmentation.

[355] Data Collaboration Analysis with Orthonormal Basis Selection and Alignment

Keiyu Nosaka, Yuichi Takano, Akiko Yoshise

Main category: cs.LG

TL;DR: Orthonormal Data Collaboration (ODC) improves model accuracy and stability in multi-party data training by enforcing orthonormality constraints, reducing alignment complexity, and ensuring privacy.

DetailsMotivation: Existing Data Collaboration (DC) methods show that the choice of target basis affects model accuracy and stability, motivating the need for a more robust framework.

Method: ODC enforces orthonormality on secret and target bases, simplifying alignment to the Orthogonal Procrustes Problem with a closed-form solution.

Result: ODC reduces alignment complexity, speeds up alignment, and ensures model performance invariance to the target basis choice, with comparable or superior accuracy.

Conclusion: ODC is a computationally efficient and privacy-preserving enhancement to DC, especially when orthonormal bases are feasible.

Abstract: Data Collaboration (DC) enables multiple parties to jointly train a model without exposing their private datasets. Each party privately transforms its data using a secret linear basis and shares only the resulting intermediate representations. Existing theory asserts that any target basis spanning the same subspace as the secret bases should suffice; however, empirical evidence reveals that the particular choice of target basis significantly influences model accuracy and stability. In this paper, we introduce Orthonormal Data Collaboration (ODC), a novel DC framework that explicitly enforces orthonormality constraints on both the secret and target bases. Under these constraints, the basis alignment step reduces precisely to the classical Orthogonal Procrustes Problem, admitting a closed-form solution. We rigorously establish that the resulting orthonormal change-of-basis matrices achieve orthogonal concordance, aligning all parties’ intermediate representations up to a common orthogonal transformation. Consequently, downstream model performance becomes invariant to the specific choice of orthonormal target basis. Computationally, ODC substantially reduces alignment complexity from O(\min{a,(cl)^2,a^2cl) to O(acl^2) where a denotes anchor data size, l the latent dimension, and c the number of collaborating parties. Extensive empirical evaluations confirm the theoretical advantages of ODC, demonstrating alignment speed-ups of up to two orders of magnitude compared to state-of-the-art DC methods, alongside comparable or superior accuracy across multiple benchmark datasets. ODC maintains robust privacy under the semi-honest threat model and requires only a single round of communication. These results establish ODC as a practically advantageous and computationally efficient enhancement to existing DC pipelines, particularly when orthonormal secret bases are naturally feasible.

[356] Position: Lifetime tuning is incompatible with continual reinforcement learning

Golnaz Mesbahi, Parham Mohammad Panahi, Olya Mastikhina, Steven Tang, Martha White, Adam White

Main category: cs.LG

TL;DR: The paper critiques the standard RL evaluation methodology (lifetime tuning) for continual RL, showing it fails to distinguish algorithms for continual learning. It demonstrates that limited tuning better reflects continual RL goals.

DetailsMotivation: To highlight the flaws in current RL evaluation practices for continual learning and advocate for methodologies that align with the goals of continual RL.

Method: Empirical testing of DQN and SAC in continuing and non-stationary environments, comparing lifetime tuning vs. limited tuning.

Result: Lifetime tuning doesn’t distinguish algorithms for continual learning, while limited tuning shows continual RL algorithms outperform non-continual ones.

Conclusion: Current evaluation practices hinder progress in continual RL; adopting limited tuning methodologies is necessary for meaningful advancements.

Abstract: In continual RL we want agents capable of never-ending learning, and yet our evaluation methodologies do not reflect this. The standard practice in RL is to assume unfettered access to the deployment environment for the full lifetime of the agent. For example, agent designers select the best performing hyperparameters in Atari by testing each for 200 million frames and then reporting results on 200 million frames. In this position paper, we argue and demonstrate the pitfalls of this inappropriate empirical methodology: lifetime tuning. We provide empirical evidence to support our position by testing DQN and SAC across several of continuing and non-stationary environments with two main findings: (1) lifetime tuning does not allow us to identify algorithms that work well for continual learning – all algorithms equally succeed; (2) recently developed continual RL algorithms outperform standard non-continual algorithms when tuning is limited to a fraction of the agent’s lifetime. The goal of this paper is to provide an explanation for why recent progress in continual RL has been mixed and motivate the development of empirical practices that better match the goals of continual RL.

[357] Reorganizing attention-space geometry with expressive attention

Claudius Gros

Main category: cs.LG

TL;DR: The paper introduces expressive attention (EA), based on squared dot product, as an alternative to standard dot-product attention (DPA). EA enhances attention for parallel/antiparallel queries and keys, performing comparably or better than DPA, especially in complex tasks.

DetailsMotivation: To explore an alternative attention mechanism (EA) that reorganizes the geometry of token matching without additional computational costs, potentially improving performance in complex tasks.

Method: EA replaces the standard dot product (Q^TK) with its squared form (Q^TK)^2, enhancing attention for parallel/antiparallel queries/keys and suppressing orthogonal ones.

Result: EA matches or outperforms DPA, especially in complex or multi-task settings, achieving 100% performance in tasks where DPA fails.

Conclusion: EA offers a viable, cost-free alternative to DPA, reorganizing attention geometry without performance loss and excelling in complex scenarios.

Abstract: Attention regulates information transfer between tokens. For this, query and key vectors are compared, typically in terms of a scalar product, $\mathbf{Q}^T\mathbf{K}$, together with a subsequent softmax normalization. In geometric terms, the standard dot-product attention (DPA) leads to large/small attention weights for parallel/antiparallel queries and keys. Here we study expressive attention (EA), which is based on $(\mathbf{Q}^T\mathbf{K})^2$, the squared dot product. In this case, attention is enhanced when query and key are either parallel or antiparallel, and suppressed for orthogonal configurations. EA can be introduced into any attention-based code without additional compute costs or memory requirements. For a series of autoregressive prediction tasks, we find that expressive attention performs at least as well as vanilla DPA. Increasing task complexity, EA is observed to outperform DPA with increasing margins, which also holds for multi-task settings. For a given model size, EA manages to achieve 100% performance for a range of complexity levels not accessible to DPA. Our results show that it is possible to reorganize the geometry of the matching condition in the space of attention heads without loss of performance.

[358] Spatio-Temporal Partial Sensing Forecast for Long-term Traffic

Zibo Liu, Zhe Jiang, Zelin Xu, Tingsong Xiao, Zhengkun Xiao, Yupu zhang, Haibo Wang, Shigang Chen

Main category: cs.LG

TL;DR: The paper introduces SLPF, a model for long-term traffic forecasting with partial sensor coverage, addressing challenges like unknown data distribution at unsensed locations and spatio-temporal correlations.

DetailsMotivation: Existing traffic forecasting methods assume full sensor coverage or focus on short-term predictions, leaving a gap for long-term forecasting with partial sensor data.

Method: SLPF uses rank-based embedding to reduce noise, a spatial transfer matrix for distribution shifts, and multi-step training for accuracy.

Result: Experiments on real-world datasets show SLPF outperforms existing methods.

Conclusion: SLPF effectively addresses long-term traffic forecasting with partial sensor data, offering superior performance.

Abstract: Traffic forecasting uses recent measurements by sensors installed at chosen locations to forecast the future road traffic. Existing work either assumes all locations are equipped with sensors or focuses on short-term forecast. This paper studies partial sensing forecast of long-term traffic, assuming sensors are available only at some locations. The problem is challenging due to the unknown data distribution at unsensed locations, the intricate spatio-temporal correlation in long-term forecasting, as well as noise to traffic patterns. We propose a Spatio-temporal Long-term Partial sensing Forecast model (SLPF) for traffic prediction, with several novel contributions, including a rank-based embedding technique to reduce the impact of noise in data, a spatial transfer matrix to overcome the spatial distribution shift from sensed locations to unsensed locations, and a multi-step training process that utilizes all available data to successively refine the model parameters for better accuracy. Extensive experiments on several real-world traffic datasets demonstrate its superior performance. Our source code is at https://github.com/zbliu98/SLPF

[359] Formal Local Implication Between Two Neural Networks

Anahita Baninajjar, Ahmed Rezine, Amir Aminifar

Main category: cs.LG

TL;DR: The paper introduces a method to formally compare two neural networks over an entire input region, ensuring one network’s correct decisions imply the other’s, with applications in verifying compact networks.

DetailsMotivation: To provide a formal foundation for comparing neural networks over entire input regions, ensuring correctness in applications like verifying pruned or quantized networks.

Method: Proposes a sound formulation for verifying local implications between two networks, ensuring N1’s correctness whenever N2 is correct in a region D.

Result: Evaluated on MNIST, CIFAR10, and medical datasets, demonstrating the method’s relevance and applicability.

Conclusion: The proposed formulation effectively verifies local implications between networks, useful for comparing trained and compact models.

Abstract: Given two neural network classifiers with the same input and output domains, our goal is to compare the two networks in relation to each other over an entire input region (e.g., within a vicinity of an input sample). To this end, we establish the foundation of formal local implication between two networks, i.e., N2 implies N1, in an entire input region D. That is, network N1 consistently makes a correct decision every time network N2 does, and it does so in an entire input region D. We further propose a sound formulation for establishing such formally-verified (provably correct) local implications. The proposed formulation is relevant in the context of several application domains, e.g., for comparing a trained network and its corresponding compact (e.g., pruned, quantized, distilled) networks. We evaluate our formulation based on the MNIST, CIFAR10, and two real-world medical datasets, to show its relevance.

[360] ATM: Improving Model Merging by Alternating Tuning and Merging

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Fabrizio Silvestri, Emanuele Rodolà

Main category: cs.LG

TL;DR: ATM (Alternate Tuning and Merging) reinterprets model merging as an iterative process, offering a cost-efficient alternative to multitask learning and improving existing merging methods.

DetailsMotivation: To provide a theoretical foundation for task vectors and explore iterative merging as a practical alternative to multitask learning, especially in restricted data-sharing scenarios.

Method: ATM alternates between tuning and merging, leveraging task vectors equivalent to multitask gradients under specific conditions.

Result: ATM proves effective across diverse vision tasks, serving as a lightweight refinement or alternative to multitask learning.

Conclusion: ATM offers a simple, iterative approach to model merging, enhancing efficiency and performance in constrained settings.

Abstract: Model merging has emerged as a cost-efficient approximation to multitask learning. Among merging strategies, task arithmetic is notable for its simplicity and effectiveness. In this work, we provide a theoretical motivation for task vectors by highlighting that, under single-epoch full-batch gradient descent, they are equivalent to multitask gradients. This insight leads us to reinterpret model merging as a single step in an iterative procedure that Alternates between Tuning and Merging (ATM). We propose two applications of ATM: (1) as an alternative to multitask learning in scenarios where data sharing is restricted (e.g., federated settings), and (2) as a lightweight refinement step to improve existing model merging methods using a small validation set. Experiments across diverse vision tasks demonstrate the effectiveness of ATM.

Weishuo Ma, Yanbo Wang, Xiyuan Wang, Muhan Zhang

Main category: cs.LG

TL;DR: A well-tuned Graph Autoencoder (GAE) matches the performance of advanced GNNs for link prediction, offering better efficiency and achieving state-of-the-art results on certain datasets.

DetailsMotivation: To address the exaggeration of benefits in new GNN approaches due to outdated baselines, the study explores GAEs with modern techniques.

Method: Systematically applies model-agnostic tricks and hyperparameter tuning to GAEs.

Result: Achieves a Hits@100 score of 78.41% on ogbl-ppa and identifies structural information as key for performance.

Conclusion: Highlights the need for updated baselines to accurately assess progress in GNNs for link prediction.

Abstract: Recent advancements in graph neural networks (GNNs) for link prediction have introduced sophisticated training techniques and model architectures. However, reliance on outdated baselines may exaggerate the benefits of these new approaches. To tackle this issue, we systematically explore Graph Autoencoders (GAEs) by applying model-agnostic tricks in recent methods and tuning hyperparameters. We find that a well-tuned GAE can match the performance of recent sophisticated models while offering superior computational efficiency on widely-used link prediction benchmarks. Our approach delivers substantial performance gains on datasets where structural information dominates and feature data is limited. Specifically, our GAE achieves a state-of-the-art Hits@100 score of 78.41% on the ogbl-ppa dataset. Furthermore, we examine the impact of various tricks to uncover the reasons behind our success and to guide the design of future methods. Our study emphasizes the critical need to update baselines for a more accurate assessment of progress in GNNs for link prediction. Our code is available at https://github.com/GraphPKU/Refined-GAE.

[362] Adaptive Collocation Point Strategies For Physics Informed Neural Networks via the QR Discrete Empirical Interpolation Method

Adrian Celaya, David Fuentes, Beatrice Riviere

Main category: cs.LG

TL;DR: The paper proposes two adaptive collocation point selection strategies using QR-DEIM to enhance PINN accuracy for solving PDEs, outperforming traditional fixed and adaptive methods.

DetailsMotivation: Current collocation point sampling methods in PINNs (fixed or adaptive) often fail to capture critical regions with high solution gradients or miss residual dynamics, limiting their effectiveness for complex PDEs.

Method: Two adaptive collocation point selection strategies are introduced, leveraging QR-DEIM, a reduced-order modeling technique, to dynamically update points during training.

Result: The proposed QR-DEIM-based methods improve PINN accuracy on benchmark PDEs compared to existing fixed and adaptive sampling techniques.

Conclusion: The QR-DEIM-based adaptive strategies offer a promising direction for enhancing PINN performance by better capturing critical solution features.

Abstract: Physics-informed neural networks (PINNs) have gained significant attention for solving forward and inverse problems related to partial differential equations (PDEs). While advancements in loss functions and network architectures have improved PINN accuracy, the impact of collocation point sampling on their performance remains underexplored. Fixed sampling methods, such as uniform random sampling and equispaced grids, can fail to capture critical regions with high solution gradients, limiting their effectiveness for complex PDEs. Adaptive methods, inspired by adaptive mesh refinement from traditional numerical methods, address this by dynamically updating collocation points during training but may overlook residual dynamics between updates, potentially losing valuable information. To overcome this limitation, we propose two adaptive collocation point selection strategies utilizing the QR Discrete Empirical Interpolation Method (QR-DEIM), a reduced-order modeling technique for efficiently approximating nonlinear functions. Our results on benchmark PDEs demonstrate that our QR-DEIM-based approaches improve PINN accuracy compared to existing methods, offering a promising direction for adaptive collocation point strategies.

[363] Systemizing Multiplicity: The Curious Case of Arbitrariness in Machine Learning

Prakhar Ganesh, Afaf Taik, Golnoosh Farnadi

Main category: cs.LG

TL;DR: The paper systematizes the study of arbitrariness in algorithmic modeling, focusing on multiplicity—arbitrariness across ‘good models’—by formalizing terminology, expanding definitions, clarifying distinctions, and analyzing risks and benefits.

DetailsMotivation: To address the arbitrariness in algorithmic modeling decisions and systematize the study of multiplicity, a growing area of interest in responsible AI.

Method: The work formalizes terminology, expands the definition of multiplicity, clarifies its distinction from uncertainty and variance, and analyzes its risks and benefits.

Result: A structured framework for understanding multiplicity, its implications, and its place in responsible AI, along with identified open research questions.

Conclusion: The paper highlights the importance of multiplicity in responsible AI and points to future research directions in this emerging field.

Abstract: Algorithmic modeling relies on limited information in data to extrapolate outcomes for unseen scenarios, often embedding an element of arbitrariness in its decisions. A perspective on this arbitrariness that has recently gained interest is multiplicity-the study of arbitrariness across a set of “good models”, i.e., those likely to be deployed in practice. In this work, we systemize the literature on multiplicity by: (a) formalizing the terminology around model design choices and their contribution to arbitrariness, (b) expanding the definition of multiplicity to incorporate underrepresented forms beyond just predictions and explanations, (c) clarifying the distinction between multiplicity and other lenses of arbitrariness, i.e., uncertainty and variance, and (d) distilling the benefits and potential risks of multiplicity into overarching trends, situating it within the broader landscape of responsible AI. We conclude by identifying open research questions and highlighting emerging trends in this young but rapidly growing area of research.

[364] Rethinking the Bias of Foundation Model under Long-tailed Distribution

Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su

Main category: cs.LG

TL;DR: The paper investigates how imbalances in pre-training data affect long-tailed downstream tasks, identifying parameter and data imbalances. It proposes a causal learning-based method to address these issues, achieving a 1.67% performance boost.

DetailsMotivation: To understand and mitigate the biases introduced by imbalanced pre-training data in foundation models, which are overlooked in existing methods.

Method: The paper uses causal learning to treat incomplete semantic factors as confounders and proposes a backdoor adjustment method to learn true causal effects between inputs and labels.

Result: The method improves performance by an average of 1.67% on datasets, addressing parameter imbalance more effectively than existing techniques.

Conclusion: The proposed causal learning approach successfully tackles both parameter and data imbalances, outperforming current re-balancing strategies.

Abstract: Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances from pre-training affect long-tailed downstream tasks. Specifically, we find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we build our method on causal learning and view the incomplete semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Notably, we achieve an average performance increase of about $1.67%$ on each dataset. Code is available: https://github.com/JiahaoChen1/Pre-train-Imbalance

[365] Contextually Entangled Gradient Mapping for Optimized LLM Comprehension

Colin Sisate, Alistair Goldfinch, Vincent Waterstone, Sebastian Kingsley, Mariana Blackthorn

Main category: cs.LG

TL;DR: CEGM enhances gradient optimization by treating gradients as dynamic carriers of contextual dependencies, improving semantic coherence and reasoning in neural architectures.

DetailsMotivation: To bridge gaps in existing optimization strategies by redefining the relationship between contextual embeddings and gradient updates.

Method: Integrates entangled gradient dynamics into a loss regularization framework, involving modifications like entanglement layers and dynamic coefficient adjustments.

Result: Outperformed baselines in accuracy and resilience, reduced semantic drift, and improved embedding coherence.

Conclusion: Demonstrates broader implications for theoretical and practical advancements in optimization.

Abstract: Contextually Entangled Gradient Mapping (CEGM) introduces a new approach to gradient optimization, redefining the relationship between contextual embeddings and gradient updates to enhance semantic coherence and reasoning capabilities in neural architectures. By treating gradients as dynamic carriers of contextual dependencies rather than isolated numerical entities, the proposed methodology bridges critical gaps in existing optimization strategies. The integration of entangled gradient dynamics into a loss regularization framework demonstrated significant improvements in tasks involving long-form reasoning, contextual retention, and adaptability to unseen domains. Experimental evaluations showed that the CEGM-enhanced model consistently outperformed baseline approaches, achieving higher accuracy in token-level predictions and greater resilience to noisy inputs. Practical implementations involved modifications to training pipelines, introducing entanglement layers and dynamic coefficient adjustments that seamlessly align with existing architectures. Results further highlighted reductions in semantic drift during sequential transformations and improvements in embedding coherence across paraphrased sentences, showing the robustness and versatility of the proposed methodology. The findings demonstrate the broader implications of gradient entanglement for both theoretical advancements and practical applications in optimization strategies.

[366] The Ensemble Kalman Update is an Empirical Matheron Update

Dan MacKinlay

Main category: cs.LG

TL;DR: The paper explores the connection between the Ensemble Kalman Filter (EnKF) and the Matheron update in Gaussian process regression, linking data assimilation to modern GP sampling.

DetailsMotivation: To highlight the under-exploited connection between EnKF and Matheron update, bridging historical data assimilation methods with contemporary Gaussian process techniques.

Method: Provides a compact introduction with accessible definitions, linking the ensemble update step of EnKF to the empirical Matheron update in GP regression.

Result: Demonstrates the equivalence between EnKF and Matheron update, offering insights into their shared mathematical foundations.

Conclusion: The paper successfully connects decades of data assimilation engineering with modern GP sampling, making the relationship accessible across fields.

Abstract: The Ensemble Kalman Filter (EnKF) is a widely used method for data assimilation in high-dimensional systems, with an ensemble update step equivalent to an empirical version of the Matheron update popular in Gaussian process regression – a connection that links half a century of data-assimilation engineering to modern path-wise GP sampling. This paper provides a compact introduction to this simple but under-exploited connection, with necessary definitions accessible to all fields involved. Source code is available at https://github.com/danmackinlay/paper_matheron_equals_enkf .

[367] CAMEF: Causal-Augmented Multi-Modality Event-Driven Financial Forecasting by Integrating Time Series Patterns and Salient Macroeconomic Announcements

Yang Zhang, Wenbo Yang, Jun Wang, Qiang Ma, Jie Xiong

Main category: cs.LG

TL;DR: CAMEF is a multi-modality framework integrating textual and time-series data with causal learning for financial forecasting, outperforming existing methods.

DetailsMotivation: Existing forecasting methods lack multi-modal integration and causal understanding of macroeconomic events' impact on markets.

Method: CAMEF combines textual and time-series data with causal learning and LLM-based counterfactual event augmentation.

Result: CAMEF outperforms state-of-the-art baselines, validated by ablation studies.

Conclusion: CAMEF effectively captures causal relationships and enhances financial forecasting accuracy.

Abstract: Accurately forecasting the impact of macroeconomic events is critical for investors and policymakers. Salient events like monetary policy decisions and employment reports often trigger market movements by shaping expectations of economic growth and risk, thereby establishing causal relationships between events and market behavior. Existing forecasting methods typically focus either on textual analysis or time-series modeling, but fail to capture the multi-modal nature of financial markets and the causal relationship between events and price movements. To address these gaps, we propose CAMEF (Causal-Augmented Multi-Modality Event-Driven Financial Forecasting), a multi-modality framework that effectively integrates textual and time-series data with a causal learning mechanism and an LLM-based counterfactual event augmentation technique for causal-enhanced financial forecasting. Our contributions include: (1) a multi-modal framework that captures causal relationships between policy texts and historical price data; (2) a new financial dataset with six types of macroeconomic releases from 2008 to April 2024, and high-frequency real trading data for five key U.S. financial assets; and (3) an LLM-based counterfactual event augmentation strategy. We compare CAMEF to state-of-the-art transformer-based time-series and multi-modal baselines, and perform ablation studies to validate the effectiveness of the causal learning mechanism and event types.

[368] Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

Han Qi, Haochen Yang, Qiaosheng Zhang, Zhuoran Yang

Main category: cs.LG

TL;DR: The paper introduces novel RLHF algorithms using information-directed sampling (IDS) to improve sample efficiency, with theoretical guarantees and computational efficiency.

DetailsMotivation: To address the challenge of reinforcement learning from human feedback (RLHF) in large language models by leveraging information theory for exploration.

Method: Designs IDS-based algorithms with a surrogate environment and a novel distance measure (ℓg-distance) to achieve efficient exploration and regret bounds.

Result: Achieves a Bayesian regret bound of O(H^(3/2)√(log(K(ε))T)) and proposes an Approximate-IDS algorithm for computational efficiency.

Conclusion: Demonstrates the value of information theory in RLHF and standard RL, offering practical and theoretically sound solutions.

Abstract: We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emph{surrogate environment} and introduce a novel distance measure (named the \emph{$\ell_g$-distance}), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order $O(H^{\frac{3}{2}}\sqrt{\log(K(\epsilon)) T})$, where $H$ is the episode length, $T$ is the number of episode and $K(\epsilon)$ is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order $\tilde{O}(H^2\sqrt{SAT})$, where $S$ and $A$ are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.

[369] DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Mogens Henrik From, Jacob Nielsen, Lukas Galke, Peter Schneider-Kamp

Main category: cs.LG

TL;DR: FlexDeMo introduces a hybrid sharded data parallel training strategy, reducing inter-node communication by synchronizing only fast-moving gradient components, achieving similar performance to full gradient synchronization but faster.

DetailsMotivation: Training large neural networks requires extensive resources. Existing methods like DeMo assume models fit on a single accelerator, limiting scalability. FlexDeMo addresses this by relaxing this assumption.

Method: FlexDeMo shards model parameters locally between accelerators and reduces inter-node communication by synchronizing only fast-moving gradient components, creating a hybrid sharded data parallel approach.

Result: FlexDeMo achieves similar validation loss as full gradient synchronization methods like AdamW but is significantly faster, demonstrated across language and vision domains.

Conclusion: FlexDeMo is a promising distributed training scheme for large-scale models, offering efficiency without compromising performance.

Abstract: Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients – resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo – introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

[370] CAST: Cross Attention based multimodal fusion of Structure and Text for materials property prediction

Jaewan Lee, Changyoung Park, Hongjun Yang, Sungbin Lim, Woohyung Lim, Sehui Han

Main category: cs.LG

TL;DR: CAST, a cross-attention-based multimodal model, integrates graph and text data to improve material property predictions, outperforming existing methods with significant MAE improvements.

DetailsMotivation: GNNs struggle with global structural characteristics in material property prediction, prompting the need for a model that integrates graph and textual data.

Method: CAST uses cross-attention to combine graph node-level and text token-level features, along with a masked node prediction pretraining strategy.

Result: CAST achieves 10.2% to 35.7% average relative MAE improvements over baselines in predicting key material properties.

Conclusion: Multimodal learning, as demonstrated by CAST, enhances predictive accuracy in materials science by aligning graph and text representations.

Abstract: Recent advancements in graph neural networks (GNNs) have significantly enhanced the prediction of material properties by modeling crystal structures as graphs. However, GNNs often struggle to capture global structural characteristics, such as crystal systems, limiting their predictive performance. To overcome this issue, we propose CAST, a cross-attention-based multimodal model that integrates graph representations with textual descriptions of materials, effectively preserving critical structural and compositional information. Unlike previous approaches, such as CrysMMNet and MultiMat, which rely on aggregated material-level embeddings, CAST leverages cross-attention mechanisms to combine fine-grained graph node-level and text token-level features. Additionally, we introduce a masked node prediction pretraining strategy that further enhances the alignment between node and text embeddings. Our experimental results demonstrate that CAST outperforms existing baseline models across four key material properties-formation energy, band gap, bulk modulus, and shear modulus-with average relative MAE improvements ranging from 10.2% to 35.7%. Analysis of attention maps confirms the importance of pretraining in effectively aligning multimodal representations. This study underscores the potential of multimodal learning frameworks for developing more accurate and globally informed predictive models in materials science.

[371] Navigating Demand Uncertainty in Container Shipping: Deep Reinforcement Learning for Enabling Adaptive and Feasible Master Stowage Planning

Jaike van Twiller, Yossiri Adulyasak, Erick Delage, Djordje Grbic, Rune Møller Jensen

Main category: cs.LG

TL;DR: The paper proposes a deep reinforcement learning framework with feasibility projection to solve the master stowage planning problem in container shipping, outperforming traditional methods.

DetailsMotivation: Addressing challenges in RL for combinatorial optimization under real-world constraints, particularly in container shipping, to maximize revenue and minimize costs while handling demand uncertainty and operational constraints.

Method: A deep reinforcement learning framework with feasibility projection is implemented to tackle the master stowage planning problem under demand uncertainty.

Result: The framework efficiently finds adaptive, feasible solutions, outperforming mixed-integer programming and RL with feasibility regularization.

Conclusion: The AI-driven policy enables adaptive and feasible planning under uncertainty, optimizing efficiency and contributing to sustainable global supply chains.

Abstract: Reinforcement learning (RL) has shown promise in solving various combinatorial optimization problems. However, conventional RL faces challenges when dealing with real-world constraints, especially when action space feasibility is explicit and dependent on the corresponding state or trajectory. In this work, we focus on using RL in container shipping, often considered the cornerstone of global trade, by dealing with the critical challenge of master stowage planning. The main objective is to maximize cargo revenue and minimize operational costs while navigating demand uncertainty and various complex operational constraints, namely vessel capacity and stability, which must be dynamically updated along the vessel’s voyage. To address this problem, we implement a deep reinforcement learning framework with feasibility projection to solve the master stowage planning problem (MPP) under demand uncertainty. The experimental results show that our architecture efficiently finds adaptive, feasible solutions for this multi-stage stochastic optimization problem, outperforming traditional mixed-integer programming and RL with feasibility regularization. Our AI-driven decision-support policy enables adaptive and feasible planning under uncertainty, optimizing operational efficiency and capacity utilization while contributing to sustainable and resilient global supply chains.

[372] ACTIVA: Amortized Causal Effect Estimation via Transformer-based Variational Autoencoder

Andreas Sauter, Saber Salehkaleybar, Aske Plaat, Erman Acar

Main category: cs.LG

TL;DR: ACTIVA is a transformer-based VAE for amortized causal inference, enabling zero-shot prediction of interventional distributions from observational data without restrictive assumptions.

DetailsMotivation: Existing methods for predicting interventional distributions are limited by restrictive assumptions and lack of amortization across problem instances.

Method: ACTIVA uses a transformer-based conditional VAE to learn latent representations from observational data and intervention queries, enabling zero-shot inference.

Result: Empirical evaluations show ACTIVA effectively predicts interventional distributions as mixtures over observationally equivalent causal models.

Conclusion: ACTIVA offers a promising, scalable approach for causal inference, with potential for real-world applications.

Abstract: Predicting the distribution of outcomes under hypothetical interventions is crucial across healthcare, economics, and policy-making. However, existing methods often require restrictive assumptions, and are typically limited by the lack of amortization across problem instances. We propose ACTIVA, a transformer-based conditional variational autoencoder (VAE) architecture for amortized causal inference, which estimates interventional distributions directly from observational data without. ACTIVA learns a latent representation conditioned on observational inputs and intervention queries, enabling zero-shot inference by amortizing causal knowledge from diverse training scenarios. We provide theoretical insights showing that ACTIVA predicts interventional distributions as mixtures over observationally equivalent causal models. Empirical evaluations on synthetic and semi-synthetic datasets confirm the effectiveness of our amortized approach and highlight promising directions for future real-world applications.

[373] Global graph features unveiled by unsupervised geometric deep learning

Mirja Granfors, Jesús Pineda, Blanca Zufiria Gerbolés, Joana B. Pereira, Carlo Manzo, Giovanni Volpe

Main category: cs.LG

TL;DR: GAUDI is an unsupervised geometric deep learning framework for analyzing complex graphs, capturing both local and global structures, and outperforming existing methods in diverse applications.

DetailsMotivation: Graphs model complex systems but their structural variability complicates analysis and classification. GAUDI addresses this by capturing invariant features and disentangling noise.

Method: GAUDI uses an hourglass architecture with hierarchical pooling and upsampling layers, linked via skip connections to preserve connectivity during encoding-decoding.

Result: GAUDI maps similar system states to nearby regions in a latent space, disentangling invariant features from noise, and excels in applications like brain connectivity and protein assembly analysis.

Conclusion: GAUDI outperforms related approaches, offering new insights into emergent phenomena across scientific domains.

Abstract: Graphs provide a powerful framework for modeling complex systems, but their structural variability poses significant challenges for analysis and classification. To address these challenges, we introduce GAUDI (Graph Autoencoder Uncovering Descriptive Information), a novel unsupervised geometric deep learning framework designed to capture both local details and global structure. GAUDI employs an innovative hourglass architecture with hierarchical pooling and upsampling layers linked through skip connections, which preserve essential connectivity information throughout the encoding-decoding process. Even though identical or highly similar underlying parameters describing a system’s state can lead to significant variability in graph realizations, GAUDI consistently maps them into nearby regions of a structured and continuous latent space, effectively disentangling invariant process-level features from stochastic noise. We demonstrate GAUDI’s versatility across multiple applications, including small-world networks modeling, characterization of protein assemblies from super-resolution microscopy, analysis of collective motion in the Vicsek model, and identification of age-related changes in brain connectivity. Comparison with related approaches highlights GAUDI’s superior performance in analyzing complex graphs, providing new insights into emergent phenomena across diverse scientific domains.

[374] Learning to Match Unpaired Data with Minimum Entropy Coupling

Mustapha Bounoua, Giulio Franzese, Pietro Michiardi

Main category: cs.LG

TL;DR: Proposes DDMEC, a method using diffusion models to solve the continuous Minimum Entropy Coupling (MEC) problem, outperforming existing approaches in tasks like data alignment and image translation.

DetailsMotivation: Real-world multimodal data is often unpaired, posing challenges for learning joint distributions. Existing MEC methods are limited to discrete distributions.

Method: Uses generative diffusion models to approximate and minimize joint entropy while relaxing marginal constraints.

Result: DDMEC outperforms specialized methods in tasks like unsupervised single-cell multi-omics data alignment and unpaired image translation.

Conclusion: DDMEC is a general and effective solution for continuous MEC problems, applicable to diverse tasks.

Abstract: Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.

[375] Training Plug-n-Play Knowledge Modules with Deep Context Distillation

Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vulić, Alessandro Sordoni

Main category: cs.LG

TL;DR: The paper proposes modularizing knowledge with lightweight Knowledge Modules (KMs) to dynamically integrate new information into language models, overcoming limitations of in-context learning and retrieval-augmented generation.

DetailsMotivation: Challenges in integrating new or evolving information post-pre-training, especially in low-data or private/specialized scenarios, motivate the need for a more efficient method.

Method: Train document-level Knowledge Modules (KMs) as parameter-efficient LoRA modules, using Deep Context Distillation instead of next-token prediction.

Result: The method outperforms standard next-token prediction and pre-instruction training across two datasets.

Conclusion: KMs offer a scalable solution for dynamic knowledge integration, with potential synergies with RAG.

Abstract: Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in low-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and RAG.

[376] Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality

Sewoong Lee, Adam Davies, Marc E. Canby, Julia Hockenmaier

Main category: cs.LG

TL;DR: The paper proposes a theoretical link between the ℓ₂-norm of sparse and dense vectors, enabling sparse autoencoders (SAEs) to avoid manual tuning of the hyperparameter k (ℓ₀). It introduces a new evaluation method for SAEs and a dynamic activation function (top-AFA) that eliminates the need for fixed k. Empirical results on GPT2 layers validate the approach.

DetailsMotivation: Current methods for training sparse autoencoders lack a theoretical basis for selecting the hyperparameter k (ℓ₀), leading to inefficiencies and manual tuning.

Method: The paper establishes a theoretical connection between ℓ₂-norms of sparse and dense vectors, introduces a new evaluation method for SAEs, and proposes the top-AFA activation function to dynamically determine feature activations.

Result: Empirical validation on GPT2 layers with 80M tokens shows the effectiveness of the proposed approach compared to state-of-the-art k-sparse autoencoders.

Conclusion: The theoretical insights and practical tools (top-AFA) improve the training and evaluation of sparse autoencoders, eliminating the need for manual hyperparameter tuning.

Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability research for large language models; however, the state-of-the-art method of using $k$-sparse autoencoders lacks a theoretical grounding for selecting the hyperparameter $k$ that represents the number of nonzero activations, often denoted by $\ell_0$. In this paper, we reveal a theoretical link that the $\ell_2$-norm of the sparse feature vector can be approximated with the $\ell_2$-norm of the dense vector with a closed-form error, which allows sparse autoencoders to be trained without the need to manually determine $\ell_0$. Specifically, we validate two applications of our theoretical findings. First, we introduce a new methodology that can assess the feature activations of pre-trained SAEs by computing the theoretically expected value from the input embedding, which has been overlooked by existing SAE evaluation methods and loss functions. Second, we introduce a novel activation function, top-AFA, which builds upon our formulation of approximate feature activation (AFA). This function enables top-$k$ style activation without requiring a constant hyperparameter $k$ to be tuned, dynamically determining the number of activated features for each input. By training SAEs on three intermediate layers to reconstruct GPT2 hidden embeddings for over 80 million tokens from the OpenWebText dataset, we demonstrate the empirical merits of this approach and compare it with current state-of-the-art $k$-sparse autoencoders. Our code is available at: https://github.com/SewoongLee/top-afa-sae.

[377] Exploring Superior Function Calls via Reinforcement Learning

Bingguang Hao, Maolin Wang, Zengzhuang Xu, Yicheng Chen, Cunyin Peng, Jinjie GU, Chenyi Zhuang

Main category: cs.LG

TL;DR: A novel reinforcement learning framework improves function calling in LLMs by addressing exploration, reasoning, and verification challenges, achieving 86.02% accuracy on the Berkeley Function Calling Leaderboard.

DetailsMotivation: Current training approaches for function calling in LLMs lack robustness, relying on superficial pattern matching or struggling with complex action spaces.

Method: A two-stage data preparation pipeline with iterative LLM evaluation and AST validation, combined with strategic entropy-based exploration in reinforcement learning.

Result: Achieves 86.02% overall accuracy, outperforming standard GRPO by 6% in complex scenarios, especially benefiting code-pretrained models.

Conclusion: The framework enhances function calling performance, suggesting structured language generation as a strong starting point for reinforcement learning in such tasks.

Abstract: Function calling capabilities are crucial for deploying Large Language Models in real-world applications, yet current training approaches fail to develop robust reasoning strategies. Supervised fine-tuning produces models that rely on superficial pattern matching, while standard reinforcement learning methods struggle with the complex action space of structured function calls. We present a novel reinforcement learning framework designed to enhance group relative policy optimization through strategic entropy based exploration specifically tailored for function calling tasks. Our approach addresses three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction. Our two-stage data preparation pipeline ensures high-quality training samples through iterative LLM evaluation and abstract syntax tree validation. Extensive experiments on the Berkeley Function Calling Leaderboard demonstrate that this framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios. Notably, our method shows particularly strong improvements on code-pretrained models, suggesting that structured language generation capabilities provide an advantageous starting point for reinforcement learning in function calling tasks. We will release all the code, models and dataset to benefit the community.

[378] Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, Eran Malach

Main category: cs.LG

TL;DR: RL fine-tuning improves performance in language models for mathematical reasoning, but its mechanisms are unclear. This study systematically analyzes RL fine-tuning’s effects on model scale, data composition, and generalization.

DetailsMotivation: To understand how RL fine-tuning improves language models for mathematical reasoning, given the lack of transparency in existing models' training data and mechanisms.

Method: Train models from scratch on open datasets, using RL algorithms (PPO, GRPO, Expert Iteration) across different model scales.

Result: RL fine-tuning amplifies pretraining data patterns, with scale-dependent biases in generalization. Performance on harder tasks improves with simpler question training.

Conclusion: Controlled small-scale studies reveal RL’s role in shaping model behavior, offering insights for future work.

Abstract: Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning models, recent work has demonstrated that RL fine-tuning consistently improves performance, even in smaller-scale models; however, the underlying mechanisms driving these improvements are not well-understood. Understanding the effects of RL fine-tuning requires disentangling its interaction with pretraining data composition, hyperparameters, and model scale, but such problems are exacerbated by the lack of transparency regarding the training data used in many existing models. In this work, we present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch on different mixtures of fully open datasets. We investigate the effects of various RL fine-tuning algorithms (PPO, GRPO, and Expert Iteration) across models of different scales. Our study reveals that RL algorithms consistently converge towards a dominant output distribution, amplifying patterns in the pretraining data. We also find that models of different scales trained on the same data mixture will converge to distinct output distributions, suggesting that there are scale-dependent biases in model generalization. Moreover, we find that RL post-training on simpler questions can lead to performance gains on harder ones, indicating that certain reasoning capabilities generalize across tasks. Our findings show that small-scale proxies in controlled settings can elicit interesting insights regarding the role of RL in shaping language model behavior.

[379] On the Value of Cross-Modal Misalignment in Multimodal Representation Learning

Yichao Cai, Yuhang Liu, Erdun Gao, Tianjiao Jiang, Zhen Zhang, Anton van den Hengel, Javen Qinfeng Shi

Main category: cs.LG

TL;DR: The paper reconciles opposing views on cross-modal misalignment in multimodal contrastive learning (MMCL), formalizing it via latent variable models and offering practical insights for ML system design.

DetailsMotivation: Address the challenge of cross-modal misalignment in real-world datasets for MMCL, aiming to unify mitigation and leveraging approaches.

Method: Uses latent variable models to formalize misalignment mechanisms (selection and perturbation biases) and analyzes MMCL under these assumptions.

Result: MMCL captures semantic variables invariant to biases, providing a unified understanding of misalignment. Practical insights for ML design are derived.

Conclusion: The study bridges theoretical and practical gaps in handling misalignment, validated through empirical studies on synthetic and real datasets.

Abstract: Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit cross-modal misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize cross-modal misalignment by introducing two specific mechanisms: Selection bias, where some semantic variables are absent in the text, and perturbation bias, where semantic variables are altered – both leading to misalignment in data pairs. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings via extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of cross-modal misalignment on multimodal representation learning.

[380] iTFKAN: Interpretable Time Series Forecasting with Kolmogorov-Arnold Network

Ziran Liang, Rui An, Wenqi Fan, Yanghui Rao, Yuxuan Liang

Main category: cs.LG

TL;DR: The paper introduces iTFKAN, an interpretable model for time series forecasting, addressing the lack of interpretability in current deep methods. It uses model symbolization and incorporates prior knowledge and time-frequency learning to improve performance and trustworthiness.

DetailsMotivation: Current deep forecasting methods lack interpretability, which is critical for safety-critical applications like auto-driving and healthcare. This motivates the development of a trustworthy and interpretable model.

Method: iTFKAN employs model symbolization for interpretability and integrates two strategies: prior knowledge injection and time-frequency synergy learning to handle complex time series data.

Result: Experiments show iTFKAN achieves strong forecasting performance while maintaining high interpretability.

Conclusion: iTFKAN provides a credible and interpretable solution for time series forecasting, balancing performance and trustworthiness.

Abstract: As time evolves, data within specific domains exhibit predictability that motivates time series forecasting to predict future trends from historical data. However, current deep forecasting methods can achieve promising performance but generally lack interpretability, hindering trustworthiness and practical deployment in safety-critical applications such as auto-driving and healthcare. In this paper, we propose a novel interpretable model, iTFKAN, for credible time series forecasting. iTFKAN enables further exploration of model decision rationales and underlying data patterns due to its interpretability achieved through model symbolization. Besides, iTFKAN develops two strategies, prior knowledge injection, and time-frequency synergy learning, to effectively guide model learning under complex intertwined time series data. Extensive experimental results demonstrated that iTFKAN can achieve promising forecasting performance while simultaneously possessing high interpretive capabilities.

[381] SMOGAN: Synthetic Minority Oversampling with GAN Refinement for Imbalanced Regression

Shayan Alahyari, Mike Domaratzki

Main category: cs.LG

TL;DR: SMOGAN is a two-step oversampling framework for imbalanced regression, using DistGAN to refine synthetic samples and improve performance on underrepresented data.

DetailsMotivation: Imbalanced regression hinders model performance on minority samples, and existing methods often fail to capture complex data distributions.

Method: SMOGAN combines initial oversampling with DistGAN, a distribution-aware GAN, to refine synthetic samples using adversarial loss and Maximum Mean Discrepancy.

Result: SMOGAN outperforms default oversampling methods on 23 imbalanced datasets.

Conclusion: SMOGAN effectively addresses imbalanced regression by generating more accurate synthetic samples, improving model performance on underrepresented data.

Abstract: Imbalanced regression refers to prediction tasks where the target variable is skewed. This skewness hinders machine learning models, especially neural networks, which concentrate on dense regions and therefore perform poorly on underrepresented (minority) samples. Despite the importance of this problem, only a few methods have been proposed for imbalanced regression. Many of the available solutions for imbalanced regression adapt techniques from the class imbalance domain, such as linear interpolation and the addition of Gaussian noise, to create synthetic data in sparse regions. However, in many cases, the underlying distribution of the data is complex and non-linear. Consequently, these approaches generate synthetic samples that do not accurately represent the true feature-target relationship. To overcome these limitations, we propose SMOGAN, a two-step oversampling framework for imbalanced regression. In Stage 1, an existing oversampler generates initial synthetic samples in sparse target regions. In Stage 2, we introduce DistGAN, a distribution-aware GAN that serves as SMOGAN’s filtering layer and refines these samples via adversarial loss augmented with a Maximum Mean Discrepancy objective, aligning them with the true joint feature-target distribution. Extensive experiments on 23 imbalanced datasets show that SMOGAN consistently outperforms the default oversampling method without the DistGAN filtering layer.

[382] Khan-GCL: Kolmogorov-Arnold Network Based Graph Contrastive Learning with Hard Negatives

Zihu Wang, Boxun Xu, Hejia Geng, Peng Li

Main category: cs.LG

TL;DR: Khan-GCL enhances graph contrastive learning by integrating Kolmogorov-Arnold Networks (KAN) for better encoder capacity and generating semantically meaningful hard negatives.

DetailsMotivation: Address limitations of conventional GCL: MLP-based encoders' restricted capacity and suboptimal negative samples.

Method: Integrates KAN into GCL encoder and uses KAN coefficients to generate hard negatives.

Result: Achieves state-of-the-art performance across datasets and tasks.

Conclusion: Khan-GCL improves GCL by leveraging KAN and strategic hard negatives for better discriminative features.

Abstract: Graph contrastive learning (GCL) has demonstrated great promise for learning generalizable graph representations from unlabeled data. However, conventional GCL approaches face two critical limitations: (1) the restricted expressive capacity of multilayer perceptron (MLP) based encoders, and (2) suboptimal negative samples that either from random augmentations-failing to provide effective ‘hard negatives’-or generated hard negatives without addressing the semantic distinctions crucial for discriminating graph data. To this end, we propose Khan-GCL, a novel framework that integrates the Kolmogorov-Arnold Network (KAN) into the GCL encoder architecture, substantially enhancing its representational capacity. Furthermore, we exploit the rich information embedded within KAN coefficient parameters to develop two novel critical feature identification techniques that enable the generation of semantically meaningful hard negative samples for each graph representation. These strategically constructed hard negatives guide the encoder to learn more discriminative features by emphasizing critical semantic differences between graphs. Extensive experiments demonstrate that our approach achieves state-of-the-art performance compared to existing GCL methods across a variety of datasets and tasks.

[383] VerificAgent: Domain-Specific Memory Verification for Scalable Oversight of Aligned Computer-Use Agents

Thong Q. Nguyen, Shubhang Desai, Raja Hasnain Anwar, Firoz Shaik, Vishwas Suryanarayanan, Vishal Chowdhary

Main category: cs.LG

TL;DR: VerificAgent is a framework for ensuring safe and aligned memory augmentation in computer-using agents (CUAs) by combining expert-curated knowledge, iterative memory growth, and human fact-checking.

DetailsMotivation: Unvetted memories in CUAs can lead to unsafe or domain-inappropriate heuristics, drifting from user intent and safety constraints.

Method: VerificAgent uses expert-curated seed knowledge, trajectory-based memory growth, and post-hoc human fact-checking to sanitize memories.

Result: VerificAgent improves task reliability, reduces hallucination-induced failures, and maintains interpretable guidance without additional fine-tuning.

Conclusion: Human-verified memory provides scalable oversight for CUAs, limiting policy drift and anchoring behavior to domain norms and safety constraints.

Abstract: Continual memory augmentation lets computer-using agents (CUAs) learn from prior interactions, but unvetted memories can encode domain-inappropriate or unsafe heuristics–spurious rules that drift from user intent and safety constraints. We introduce VerificAgent, a scalable oversight framework that treats persistent memory as an explicit alignment surface. VerificAgent combines (1) an expert-curated seed of domain knowledge, (2) iterative, trajectory-based memory growth during training, and (3) a post-hoc human fact-checking pass to sanitize accumulated memories before deployment. Evaluated on OSWorld productivity tasks and additional adversarial stress tests, VerificAgent improves task reliability, reduces hallucination-induced failures, and preserves interpretable, auditable guidance–without additional model fine-tuning. By letting humans correct high-impact errors once, the verified memory acts as a frozen safety contract that future agent actions must satisfy. Our results suggest that domain-scoped, human-verified memory offers a scalable oversight mechanism for CUAs, complementing broader alignment strategies by limiting silent policy drift and anchoring agent behavior to the norms and safety constraints of the target domain.

[384] Fusing Cross-Domain Knowledge from Multimodal Data to Solve Problems in the Physical World

Yu Zheng

Main category: cs.LG

TL;DR: The paper introduces a framework for cross-domain multimodal data fusion to address real-world problems by leveraging existing data from diverse domains, overcoming challenges of knowledge alignment.

DetailsMotivation: The complexity of physical environments requires multimodal data fusion, but collecting new data for every problem is impractical. Cross-domain knowledge fusion is proposed to utilize existing data from other domains.

Method: A four-layer framework (Domains, Links, Models, Data) is introduced to answer key questions: what to fuse, why fusion is possible, and how to fuse. It includes domain selection, knowledge alignment, fusion paradigms, and data representation.

Result: The framework enables effective fusion of cross-domain multimodal data, providing a structured approach to solve real-world problems without needing new data collection.

Conclusion: The proposed framework addresses the challenges of cross-domain knowledge fusion, offering a scalable and practical solution for leveraging existing multimodal data across domains.

Abstract: The proliferation of artificial intelligence has enabled a diversity of applications that bridge the gap between digital and physical worlds. As physical environments are too complex to model through a single information acquisition approach, it is crucial to fuse multimodal data generated by different sources, such as sensors, devices, systems, and people, to solve a problem in the real world. Unfortunately, it is neither applicable nor sustainable to deploy new resources to collect original data from scratch for every problem. Thus, when data is inadequate in the domain of problem, it is vital to fuse knowledge from multimodal data that is already available in other domains. We call this cross-domain knowledge fusion. Existing research focus on fusing multimodal data in a single domain, supposing the knowledge from different datasets is intrinsically aligned; however, this assumption may not hold in the scenarios of cross-domain knowledge fusion. In this paper, we formally define the cross-domain multimodal data fusion problem, discussing its unique challenges, differences and advantages beyond data fusion in a single domain. We propose a four-layer framework, consisting of Domains, Links, Models and Data layers, answering three key questions:“what to fuse”, “why can be fused”, and “how to fuse”. The Domains Layer selects relevant data from different domains for a given problem. The Links Layer reveals the philosophy of knowledge alignment beyond specific model structures. The Models Layer provides two knowledge fusion paradigms based on the fundamental mechanisms for processing data. The Data Layer turns data of different structures, resolutions, scales and distributions into a consistent representation that can be fed into an AI model. With this framework, we can design solutions that fuse cross-domain multimodal data effectively for solving real-world problems.

[385] Scientifically-Interpretable Reasoning Network (ScIReN): Discovering Hidden Relationships in the Carbon Cycle and Beyond

Joshua Fan, Haodi Xu, Feng Tao, Md Nasim, Marc Grimson, Yiqi Luo, Carla P. Gomes

Main category: cs.LG

TL;DR: ScIReN combines interpretable neural networks and process-based models to improve understanding of the soil carbon cycle, outperforming black-box models in accuracy and interpretability.

DetailsMotivation: The soil carbon cycle is poorly understood, and existing models either lack interpretability or rely on ad-hoc parameters. ScIReN aims to bridge this gap by integrating scientific knowledge with data-driven learning.

Method: ScIReN uses an interpretable encoder (Kolmogorov-Arnold networks) to predict latent parameters and a differentiable process-based decoder to enforce scientific laws. Smoothness penalties and hard-sigmoid constraints ensure interpretability and adherence to prior knowledge.

Result: ScIReN outperforms black-box models in predictive accuracy for soil carbon flow and ecosystem respiration tasks while revealing hidden scientific relationships.

Conclusion: ScIReN provides a transparent, accurate framework for studying the soil carbon cycle, combining the strengths of neural networks and process-based models.

Abstract: Understanding how carbon flows through the soil is crucial for mitigating the effects of climate change. While soils have potential to sequester carbon from the atmosphere, the soil carbon cycle remains poorly understood. Scientists have developed mathematical process-based models of the soil carbon cycle based on existing knowledge, but they contain numerous unknown parameters that must be set in an ad-hoc manner, and often fit observations poorly. On the other hand, neural networks can learn patterns from data, but do not respect known scientific laws, nor can they reveal novel scientific relationships due to their black-box nature. We thus propose Scientifically-Interpretable Reasoning Network (ScIReN), a fully-transparent framework that combines interpretable neural and process-based reasoning. An interpretable encoder predicts scientifically-meaningful latent parameters, which are then passed through a differentiable process-based decoder to predict labeled output variables. ScIReN leverages Kolmogorov-Arnold networks (KAN) to ensure the encoder is fully interpretable and reveals relationships between input features and latent parameters; it uses novel smoothness penalties to balance expressivity and simplicity. ScIReN also uses a novel hard-sigmoid constraint layer to restrict latent parameters to meaningful ranges defined by scientific prior knowledge. While the process-based decoder enforces established scientific knowledge, the KAN-based encoder reveals new scientific relationships hidden in conventional black-box models. We apply ScIReN on two tasks: simulating the flow of organic carbon through soils, and modeling ecosystem respiration from plants. In both tasks, ScIReN outperforms black-box networks in predictive accuracy while providing substantial scientific interpretability – it can infer latent scientific mechanisms and their relationships with input features.

[386] Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Zesen Wang, Lijuan Lan, Yonggang Li

Main category: cs.LG

TL;DR: LLM-Prompt is a novel framework using large language models (LLMs) for time series forecasting, addressing shortcomings like lack of unified textual prompts and modality discrepancies.

DetailsMotivation: Existing LLM-based methods for time series forecasting lack a unified prompt paradigm and ignore modality differences between text and time series data.

Method: Proposes LLM-Prompt, integrating multi-prompt information and cross-modal semantic alignment, with learnable soft prompts, textualized hard prompts, and a cross-modal fusion module.

Result: Demonstrates strong performance on 6 public and 3 carbon emission datasets.

Conclusion: LLM-Prompt is an effective framework for time series forecasting, overcoming limitations of existing LLM-based methods.

Abstract: Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting and data-scarce scenarios. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting. However, we find existing LLM-based methods still have shortcomings: (1) the absence of a unified paradigm for textual prompt formulation and (2) the neglect of modality discrepancies between textual prompts and time series. To address this, we propose LLM-Prompt, an LLM-based time series forecasting framework integrating multi-prompt information and cross-modal semantic alignment. Specifically, we first construct a unified textual prompt paradigm containing learnable soft prompts and textualized hard prompts. Second, to enhance LLMs’ comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve cross-modal fusion of temporal and textual information. Finally, the transformed time series from the LLMs are projected to obtain the forecasts. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that LLM-Prompt is a powerful framework for time series forecasting.

[387] Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information

Fei Chen, Wenchi Zhou

Main category: cs.LG

TL;DR: The paper proposes a data reduction strategy using Pointwise V-Information (PVI) to improve training efficiency and model performance by selecting optimal data subsets.

DetailsMotivation: To enhance data-centric AI by focusing on the most instructive examples in large datasets, improving data quality and training efficiency.

Method: Uses PVI to quantify instance difficulty, removes low-difficulty instances, and employs progressive learning on PVI-sorted examples.

Result: Maintains classifier performance with minimal accuracy decline (0.0001%-0.76%) when removing 10%-30% of data, and achieves 0.8% accuracy gain with progressive learning.

Conclusion: Training on optimal subsets with PVI-based data reduction improves model performance and efficiency, with successful adaptation to Chinese NLP tasks.

Abstract: In order to increase the effectiveness of model training, data reduction is essential to data-centric Artificial Intelligence (AI). It achieves this by locating the most instructive examples in massive datasets. To increase data quality and training efficiency, the main difficulty is choosing the best examples rather than the complete datasets. In this paper, we propose an effective data reduction strategy based on Pointwise V-Information (PVI). To enable a static method, we first use PVI to quantify instance difficulty and remove instances with low difficulty. Experiments show that classifier performance is maintained with only a 0.0001% to 0.76% decline in accuracy when 10%-30% of the data is removed. Second, we train the classifiers using a progressive learning strategy on examples sorted by increasing PVI, accelerating convergence and achieving a 0.8% accuracy gain over conventional training. Our findings imply that training a classifier on the chosen optimal subset may improve model performance and increase training efficiency when combined with an efficient data reduction strategy. Furthermore, we have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese Natural Language Processing (NLP) tasks and base models, yielding insightful results for faster training and cross-lingual data reduction.

[388] A Graph Sufficiency Perspective for Neural Networks

Cencheng Shen, Yuexiao Dong

Main category: cs.LG

TL;DR: The paper analyzes neural networks using graph variables and statistical sufficiency, interpreting layers as graph-based transformations and establishing conditions for layer outputs to preserve input distributions.

DetailsMotivation: To bridge statistical sufficiency, graph theory, and deep learning, providing a new statistical understanding of neural networks.

Method: Two theoretical paths: one assumes dense anchor points for asymptotic sufficiency in infinite-width networks, and the other ensures exact/approximate sufficiency in finite-width networks with region-separated inputs.

Result: The framework covers various architectures (fully connected, ReLU, sigmoid, CNNs) and provides error bounds on optimal loss for regression and classification.

Conclusion: The work offers a novel statistical perspective on neural networks, linking graph-theoretic representations and deep learning.

Abstract: This paper analyzes neural networks through graph variables and statistical sufficiency. We interpret neural network layers as graph-based transformations, where neurons act as pairwise functions between inputs and learned anchor points. Within this formulation, we establish conditions under which layer outputs are sufficient for the layer inputs, that is, each layer preserves the conditional distribution of the target variable given the input variable. We explore two theoretical paths under this graph-based view. The first path assumes dense anchor points and shows that asymptotic sufficiency holds in the infinite-width limit and is preserved throughout training. The second path, more aligned with practical architectures, proves exact or approximate sufficiency in finite-width networks by assuming region-separated input distributions and constructing appropriate anchor points. This path can ensure the sufficiency property for an infinite number of layers, and provide error bounds on the optimal loss for both regression and classification tasks using standard neural networks. Our framework covers fully connected layers, general pairwise functions, ReLU and sigmoid activations, and convolutional neural networks. Overall, this work bridges statistical sufficiency, graph-theoretic representations, and deep learning, providing a new statistical understanding of neural networks.

[389] Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains

Jingyi Yu, Tim Pychynski, Marco F. Huber

Main category: cs.LG

TL;DR: CICME is a three-step method for inferring causal mechanisms from heterogeneous data across domains, using Causal Transfer Learning to identify domain-invariant mechanisms and guide individual domain analysis.

DetailsMotivation: To understand complex sensor systems through causality by analyzing heterogeneous data from multiple domains.

Method: CICME uses a three-step approach, leveraging Causal Transfer Learning to detect domain-invariant causal mechanisms and then individually estimating remaining mechanisms per domain.

Result: CICME outperforms baseline methods in linear Gaussian models, especially in manufacturing-inspired scenarios.

Conclusion: CICME effectively combines pooled and individual domain causal discovery, demonstrating superior performance in certain cases.

Abstract: To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.

[390] RANA: Robust Active Learning for Noisy Network Alignment

Yixuan Nan, Xixun Lin, Yanmin Shang, Zhuofan Li, Can Zhao, Yanan Cao

Main category: cs.LG

TL;DR: RANA is a robust active learning framework for noisy network alignment, addressing structural and labeling noise while improving alignment accuracy.

DetailsMotivation: Existing network alignment methods overlook noise issues, which degrade performance. RANA aims to tackle structural and labeling noise alongside label sparsity.

Method: RANA uses a Noise-aware Selection Module for structural noise and a Label Denoising Module for labeling noise, incorporating cleanliness scores and multi-source fusion denoising.

Result: RANA outperforms state-of-the-art active learning methods in alignment accuracy on three real-world datasets.

Conclusion: RANA effectively improves robustness in network alignment by addressing noise and sparsity, demonstrating superior performance.

Abstract: Network alignment has attracted widespread attention in various fields. However, most existing works mainly focus on the problem of label sparsity, while overlooking the issue of noise in network alignment, which can substantially undermine model performance. Such noise mainly includes structural noise from noisy edges and labeling noise caused by human-induced and process-driven errors. To address these problems, we propose RANA, a Robust Active learning framework for noisy Network Alignment. RANA effectively tackles both structure noise and label noise while addressing the sparsity of anchor link annotations, which can improve the robustness of network alignment models. Specifically, RANA introduces the proposed Noise-aware Selection Module and the Label Denoising Module to address structural noise and labeling noise, respectively. In the first module, we design a noise-aware maximization objective to select node pairs, incorporating a cleanliness score to address structural noise. In the second module, we propose a novel multi-source fusion denoising strategy that leverages model and twin node pairs labeling to provide more accurate labels for node pairs. Empirical results on three real-world datasets demonstrate that RANA outperforms state-of-the-art active learning-based methods in alignment accuracy. Our code is available at https://github.com/YXNan0110/RANA.

[391] Adacc: An Adaptive Framework Unifying Compression and Activation Recomputation for LLM Training

Ping Chen, Zhuohong Deng, Ping Li, Shuibing He, Hongzi Zhu, Yi Zheng, Zhefeng Wang, Baoxing Huai, Minyi Guo

Main category: cs.LG

TL;DR: Adacc is an adaptive memory optimization framework for LLMs that dynamically combines activation recomputation and data compression to improve training efficiency without sacrificing accuracy.

DetailsMotivation: GPU memory constraints limit LLM training. Existing methods (recomputation and compression) have drawbacks like overhead or accuracy loss. Adacc aims to unify these strategies adaptively.

Method: Adacc uses tensor-level decisions, layer-specific compression, MILP-based scheduling, and adaptive policy evolution to optimize memory usage dynamically.

Result: Adacc boosts training throughput by 1.01x to 1.37x while maintaining baseline accuracy.

Conclusion: Adacc effectively balances memory optimization and accuracy, outperforming static or single-strategy approaches.

Abstract: Training large language models (LLMs) is often constrained by GPU memory limitations. To alleviate memory pressure, activation recomputation and data compression have been proposed as two major strategies. However, both approaches have limitations: recomputation introduces significant training overhead, while compression can lead to accuracy degradation and computational inefficiency when applied naively. In this paper, we propose Adacc, the first adaptive memory optimization framework that unifies activation recomputation and data compression to improve training efficiency for LLMs while preserving model accuracy. Unlike existing methods that apply static, rule-based strategies or rely solely on one technique, Adacc makes fine-grained, tensor-level decisions, dynamically selecting between recomputation, retention, and compression based on tensor characteristics and runtime hardware constraints. Adacc tackles three key challenges: (1) it introduces layer-specific compression algorithms that mitigate accuracy loss by accounting for outliers in LLM activations; (2) it employs a MILP-based scheduling policy to globally optimize memory strategies across layers; and (3) it integrates an adaptive policy evolution mechanism to update strategies during training in response to changing data distributions. Experimental results show that Adacc improves training throughput by 1.01x to 1.37x compared to state-of-the-art frameworks, while maintaining accuracy comparable to the baseline.

[392] SPARTA: Advancing Sparse Attention in Spiking Neural Networks via Spike-Timing-Based Prioritization

Minsuk Jang, Changick Kim

Main category: cs.LG

TL;DR: SPARTA leverages spike-timing dynamics for efficient sparse attention in SNNs, reducing complexity while maintaining accuracy.

DetailsMotivation: Current SNNs overlook precise spike-timing information, missing computational cues. SPARTA aims to exploit these dynamics for better efficiency and performance.

Method: SPARTA uses heterogeneous neuron dynamics and spike-timing cues (firing patterns, timing, intervals) for competitive gating, achieving 65.4% sparsity.

Result: Achieves 98.78% on DVS-Gesture, 83.06% on CIFAR10-DVS, and 95.3% on CIFAR-10, reducing attention complexity from O(N^2) to O(K^2).

Conclusion: Exploiting spike timing improves computational efficiency and accuracy, making SPARTA state-of-the-art for SNNs.

Abstract: Current Spiking Neural Networks (SNNs) underutilize the temporal dynamics inherent in spike-based processing, relying primarily on rate coding while overlooking precise timing information that provides rich computational cues. We propose SPARTA (Spiking Priority Attention with Resource-Adaptive Temporal Allocation), a framework that leverages heterogeneous neuron dynamics and spike-timing information to enable efficient sparse attention. SPARTA prioritizes tokens based on temporal cues, including firing patterns, spike timing, and inter-spike intervals, achieving 65.4% sparsity through competitive gating. By selecting only the most salient tokens, SPARTA reduces attention complexity from O(N^2) to O(K^2) with k « n, while maintaining high accuracy. Our method achieves state-of-the-art performance on DVS-Gesture (98.78%) and competitive results on CIFAR10-DVS (83.06%) and CIFAR-10 (95.3%), demonstrating that exploiting spike timing dynamics improves both computational efficiency and accuracy.

[393] Resource-Efficient Automatic Software Vulnerability Assessment via Knowledge Distillation and Particle Swarm Optimization

Chaoyang Gao, Xiang Chen, Jiyu Wang, Jibin Wang, Guang Yang

Main category: cs.LG

TL;DR: A resource-efficient framework combining knowledge distillation and particle swarm optimization for automated vulnerability assessment reduces model size by 99.4% while retaining 89.3% accuracy.

DetailsMotivation: Addressing the computational and storage demands of large pre-trained models in cybersecurity vulnerability assessment.

Method: Two-stage approach: particle swarm optimization for compact model architecture and knowledge distillation to transfer knowledge from a large teacher model.

Result: Achieves 99.4% model size reduction, 89.3% accuracy retention, outperforms baselines by 1.7% accuracy with 60% fewer parameters, and reduces training time by 72.1%.

Conclusion: The proposed framework is effective for scalable and efficient vulnerability assessment, balancing performance and resource usage.

Abstract: The increasing complexity of software systems has led to a surge in cybersecurity vulnerabilities, necessitating efficient and scalable solutions for vulnerability assessment. However, the deployment of large pre-trained models in real-world scenarios is hindered by their substantial computational and storage demands. To address this challenge, we propose a novel resource-efficient framework that integrates knowledge distillation and particle swarm optimization to enable automated vulnerability assessment. Our framework employs a two-stage approach: First, particle swarm optimization is utilized to optimize the architecture of a compact student model, balancing computational efficiency and model capacity. Second, knowledge distillation is applied to transfer critical vulnerability assessment knowledge from a large teacher model to the optimized student model. This process significantly reduces the model size while maintaining high performance. Experimental results on an enhanced MegaVul dataset, comprising 12,071 CVSS (Common Vulnerability Scoring System) v3 annotated vulnerabilities, demonstrate the effectiveness of our approach. Our approach achieves a 99.4% reduction in model size while retaining 89.3% of the original model’s accuracy. Furthermore, it outperforms state-of-the-art baselines by 1.7% in accuracy with 60% fewer parameters. The framework also reduces training time by 72.1% and architecture search time by 34.88% compared to traditional genetic algorithms.

[394] HALO: Hindsight-Augmented Learning for Online Auto-Bidding

Pusen Dong, Chenglong Cao, Xinyu Zhou, Jirong You, Linhe Xu, Feifan Xu, Shuo Yuan

Main category: cs.LG

TL;DR: HALO, a new auto-bidding method, addresses inefficiencies in traditional solutions by leveraging hindsight learning and B-spline representation for robust adaptation to diverse advertiser constraints.

DetailsMotivation: Traditional auto-bidding solutions struggle with sample inefficiency and poor generalization under varying budget-ROI constraints, necessitating a more adaptive approach.

Method: HALO uses a hindsight mechanism to repurpose failed explorations into training data and employs B-spline functional representation for continuous bid mapping.

Result: HALO outperforms traditional methods, reducing constraint violations and improving Gross Merchandise Value (GMV) in industrial evaluations.

Conclusion: HALO provides a scalable and efficient solution for Multi-Constraint Bidding in dynamic digital advertising environments.

Abstract: Digital advertising platforms operate millisecond-level auctions through Real-Time Bidding (RTB) systems, where advertisers compete for ad impressions through algorithmic bids. This dynamic mechanism enables precise audience targeting but introduces profound operational complexity due to advertiser heterogeneity: budgets and ROI targets span orders of magnitude across advertisers, from individual merchants to multinational brands. This diversity creates a demanding adaptation landscape for Multi-Constraint Bidding (MCB). Traditional auto-bidding solutions fail in this environment due to two critical flaws: 1) severe sample inefficiency, where failed explorations under specific constraints yield no transferable knowledge for new budget-ROI combinations, and 2) limited generalization under constraint shifts, as they ignore physical relationships between constraints and bidding coefficients. To address this, we propose HALO: Hindsight-Augmented Learning for Online Auto-Bidding. HALO introduces a theoretically grounded hindsight mechanism that repurposes all explorations into training data for arbitrary constraint configuration via trajectory reorientation. Further, it employs B-spline functional representation, enabling continuous, derivative-aware bid mapping across constraint spaces. HALO ensures robust adaptation even when budget/ROI requirements differ drastically from training scenarios. Industrial dataset evaluations demonstrate the superiority of HALO in handling multi-scale constraints, reducing constraint violations while improving GMV.

[395] Intelligent Sampling of Extreme-Scale Turbulence Datasets for Accurate and Efficient Spatiotemporal Model Training

Wesley Brewer, Murali Meena Gopalakrishnan, Matthias Maiterth, Aditya Kashi, Jong Youl Choi, Pei Zhang, Stephen Nichols, Riccardo Balin, Miles Couchman, Stephen de Bruyn Kops, P. K. Yeung, Daniel Dotson, Rohini Uma-Vaideswaran, Sarp Oral, Feiyi Wang

Main category: cs.LG

TL;DR: SICKLE, a sparse intelligent curation framework, uses MaxEnt sampling to train models with less data, improving accuracy and reducing energy use by up to 38x.

DetailsMotivation: With Moore's law and Dennard scaling ending, efficient training requires reducing data volume without compromising model performance.

Method: Developed SICKLE with MaxEnt sampling, scalable training, and energy benchmarking, comparing it with random and phase-space sampling on DNS turbulence datasets.

Result: Subsampling as preprocessing improved model accuracy and reduced energy consumption by up to 38x in some cases.

Conclusion: Intelligent subsampling (SICKLE) enables efficient training with less data, offering significant energy savings and accuracy improvements.

Abstract: With the end of Moore’s law and Dennard scaling, efficient training increasingly requires rethinking data volume. Can we train better models with significantly less data via intelligent subsampling? To explore this, we develop SICKLE, a sparse intelligent curation framework for efficient learning, featuring a novel maximum entropy (MaxEnt) sampling approach, scalable training, and energy benchmarking. We compare MaxEnt with random and phase-space sampling on large direct numerical simulation (DNS) datasets of turbulence. Evaluating SICKLE at scale on Frontier, we show that subsampling as a preprocessing step can improve model accuracy and substantially lower energy consumption, with reductions of up to 38x observed in certain cases.

[396] Federated Continual Recommendation

Jaehyung Lim, Wonbin Kweon, Woojoo Kim, Junyoung Kim, Seongjin Choi, Dongha Kim, Hwanjo Yu

Main category: cs.LG

TL;DR: F3CRec bridges Federated Recommendation (FedRec) and Continual Learning Recommendation (CLRec) to handle non-stationary data streams while preserving privacy, outperforming existing methods.

DetailsMotivation: Existing FedRec methods struggle with evolving user preferences, while CLRec assumes centralized data, creating a gap for privacy-preserving continual learning in recommendations.

Method: Proposes F3CRec with Adaptive Replay Memory (client-side) and Item-wise Temporal Mean (server-side) to balance knowledge retention and adaptation.

Result: F3CRec outperforms existing methods in maintaining recommendation quality over time in federated environments.

Conclusion: F3CRec effectively integrates FedRec and CLRec, addressing privacy and non-stationary data challenges in recommendation systems.

Abstract: The increasing emphasis on privacy in recommendation systems has led to the adoption of Federated Learning (FL) as a privacy-preserving solution, enabling collaborative training without sharing user data. While Federated Recommendation (FedRec) effectively protects privacy, existing methods struggle with non-stationary data streams, failing to maintain consistent recommendation quality over time. On the other hand, Continual Learning Recommendation (CLRec) methods address evolving user preferences but typically assume centralized data access, making them incompatible with FL constraints. To bridge this gap, we introduce Federated Continual Recommendation (FCRec), a novel task that integrates FedRec and CLRec, requiring models to learn from streaming data while preserving privacy. As a solution, we propose F3CRec, a framework designed to balance knowledge retention and adaptation under the strict constraints of FCRec. F3CRec introduces two key components: Adaptive Replay Memory on the client side, which selectively retains past preferences based on user-specific shifts, and Item-wise Temporal Mean on the server side, which integrates new knowledge while preserving prior information. Extensive experiments demonstrate that F3CRec outperforms existing approaches in maintaining recommendation quality over time in a federated environment.

[397] X-VFL: A New Vertical Federated Learning Framework with Cross Completion and Decision Subspace Alignment

Qinghua Yao, Xiangrui Xu, Zhize Li

Main category: cs.LG

TL;DR: X-VFL is a new Vertical Federated Learning framework addressing data alignment and independent inference challenges with novel modules (XCom and DS-Align) and proven convergence rates.

DetailsMotivation: Overcome limitations in VFL: strict data alignment and lack of local inference support.

Method: Introduces X-VFL with Cross Completion (XCom) for missing features and Decision Subspace Alignment (DS-Align) for local inference.

Result: Achieves 15% and 43% accuracy improvements on CIFAR-10 and MIMIC-III datasets, respectively.

Conclusion: X-VFL is effective for scenarios with missing features and local inference, outperforming existing methods.

Abstract: Vertical Federated Learning (VFL) enables collaborative learning by integrating disjoint feature subsets from multiple clients/parties. However, VFL typically faces two key challenges: i) the requirement for perfectly aligned data samples across all clients (missing features are not allowed); ii) the requirement for joint collaborative inference/prediction involving all clients (it does not support locally independent inference on a single client). To address these challenges, we propose X-VFL, a new VFL framework designed to deal with the non-aligned data samples with (partially) missing features and to support locally independent inference of new data samples for each client. In particular, we design two novel modules in X-VFL: Cross Completion (XCom) and Decision Subspace Alignment (DS-Align). XCom can complete/reconstruct missing features for non-aligned data samples by leveraging information from other clients. DS-Align aligns local features with completed and global features across all clients within the decision subspace, thus enabling locally independent inference at each client. Moreover, we provide convergence theorems for different algorithms used in training X-VFL, showing an $O(1/\sqrt{T})$ convergence rate for SGD-type algorithms and an $O(1/T)$ rate for PAGE-type algorithms, where $T$ denotes the number of training update steps. Extensive experiments on real-world datasets demonstrate that X-VFL significantly outperforms existing methods, e.g., achieving a 15% improvement in accuracy on the image CIFAR-10 dataset and a 43% improvement on the medical MIMIC-III dataset. These results validate the practical effectiveness and superiority of X-VFL, particularly in scenarios involving partially missing features and locally independent inference.

cs.MA

[398] Risk Analysis Techniques for Governed LLM-based Multi-Agent Systems

Alistair Reid, Simon O’Callaghan, Liam Carroll, Tiberio Caetano

Main category: cs.MA

TL;DR: The paper discusses the need for a new risk analysis approach for multi-agent AI systems, identifying six critical failure modes and providing tools for practitioners to assess them.

DetailsMotivation: Organizations are adopting interconnected multi-agent AI systems, but interactions between agents create emergent behaviors and novel risks, requiring a different risk analysis approach than single-agent systems.

Method: The report examines six failure modes in governed environments and offers a toolkit for practitioners. It advocates for staged testing and evidence collection through simulation, observation, benchmarking, and red teaming.

Result: The methodology provides a foundation for robust risk management in LLM-based multi-agent systems, addressing emergent risks and failure modes.

Conclusion: The approach emphasizes analysis validity and progressive testing to mitigate risks in multi-agent AI deployments, supporting safer organizational adoption.

Abstract: Organisations are starting to adopt LLM-based AI agents, with their deployments naturally evolving from single agents towards interconnected, multi-agent networks. Yet a collection of safe agents does not guarantee a safe collection of agents, as interactions between agents over time create emergent behaviours and induce novel failure modes. This means multi-agent systems require a fundamentally different risk analysis approach than that used for a single agent. This report addresses the early stages of risk identification and analysis for multi-agent AI systems operating within governed environments where organisations control their agent configurations and deployment. In this setting, we examine six critical failure modes: cascading reliability failures, inter-agent communication failures, monoculture collapse, conformity bias, deficient theory of mind, and mixed motive dynamics. For each, we provide a toolkit for practitioners to extend or integrate into their existing frameworks to assess these failure modes within their organisational contexts. Given fundamental limitations in current LLM behavioural understanding, our approach centres on analysis validity, and advocates for progressively increasing validity through staged testing across stages of abstraction and deployment that gradually increases exposure to potential negative impacts, while collecting convergent evidence through simulation, observational analysis, benchmarking, and red teaming. This methodology establishes the groundwork for robust organisational risk management as these LLM-based multi-agent systems are deployed and operated.

[399] Semantic Reasoning Meets Numerical Precision: An LLM-Powered Multi-Agent System for Power Grid Control

Yan Zhang

Main category: cs.MA

TL;DR: Grid-Agent is an AI-driven framework combining LLMs and multi-agent reinforcement learning to address power grid violations in real time, outperforming traditional methods in tests.

DetailsMotivation: The complexity of modern power grids due to DERs, EVs, and extreme weather events challenges traditional rule-based and optimization methods, necessitating adaptive solutions.

Method: Grid-Agent uses a modular agent architecture: a planning agent for action sequences and a validation agent for stability checks, with adaptive multiscale network representation for scalability.

Result: Tests on IEEE and CIGRE systems show superior violation mitigation, with continuous learning capabilities for diverse topologies.

Conclusion: Grid-Agent is highly suitable for smart grids, offering rapid, autonomous responses to dynamic conditions.

Abstract: The increasing penetration of Distributed Energy Resources (DERs), widespread adoption of Electric Vehicles (EVs), and the growing frequency of extreme weather events have significantly increased the complexity of power grid planning, operation, and management. Traditional rule-based systems and numerical optimization approaches often struggle with the scale, dynamics, and adaptability required by modern power networks. This paper introduces Grid-Agent, an autonomous, AI-driven framework that combines Large Language Models (LLMs) with multi-agent reinforcement learning to detect and remediate grid violations in real time. Grid-Agent integrates semantic reasoning with numerical precision through a modular agent architecture: a planning agent generates coordinated action sequences using numerical power flow solvers, while a validation agent evaluates system stability and action effectiveness via sandboxed execution with safety rollbacks. To ensure scalability, Grid-Agent incorporates an adaptive multiscale network representation that dynamically selects optimal encoding schemes based on network size and complexity. The framework enables coordinated violation resolution through optimizing switch configurations, battery deployment, and load curtailment strategies. Experimental results in standard IEEE and CIGRE test systems (IEEE 69-bus, CIGRE MV, and IEEE 30-bus) demonstrate superior violation mitigation performance. Additionally, the framework’s built-in data collection and learning capabilities enable continuous learning and adaptation to diverse network topologies. The autonomous nature of the framework makes it particularly suitable for modern smart grid applications requiring rapid response to dynamic operating conditions.

[400] Flow-Based Task Assignment for Large-Scale Online Multi-Agent Pickup and Delivery

Yue Zhang, Zhe Chen, Daniel Harabor, Pierre Le Bodic, Peter J. Stuckey

Main category: cs.MA

TL;DR: The paper presents a scalable and efficient method for online Multi-Agent Pickup and Delivery (MAPD) by formulating task assignment as a minimum-cost flow problem, improving real-time performance and solution quality.

DetailsMotivation: Existing methods for online MAPD either use simple heuristics (poor decisions) or complex reasoning (limited scalability). The goal is to develop a scalable, real-time solution.

Method: Formulate task assignment as a minimum-cost flow problem over the environment graph, avoiding pairwise distance computations. Introduce congestion-aware edge cost models for better traffic estimates.

Result: The method scales to over 20000 agents and 30000 tasks within 1-second planning time, outperforming baselines in efficiency and quality.

Conclusion: The proposed approach effectively balances scalability and solution quality for real-time MAPD, supporting large-scale applications.

Abstract: We study the problem of online Multi-Agent Pickup and Delivery (MAPD), where a team of agents must repeatedly serve dynamically appearing tasks on a shared map. Existing online methods either rely on simple heuristics, which result in poor decisions, or employ complex reasoning, which suffers from limited scalability under real-time constraints. In this work, we focus on the task assignment subproblem and formulate it as a minimum-cost flow over the environment graph. This eliminates the need for pairwise distance computations and allows agents to be simultaneously assigned to tasks and routed toward them. The resulting flow network also supports efficient guide path extraction to integrate with the planner and accelerates planning under real-time constraints. To improve solution quality, we introduce two congestion-aware edge cost models that incorporate real-time traffic estimates. This approach supports real-time execution and scales to over 20000 agents and 30000 tasks within 1-second planning time, outperforming existing baselines in both computational efficiency and assignment quality.

[401] Policy Optimization in Multi-Agent Settings under Partially Observable Environments

Ainur Zhaikhan, Malek Khammassi, Ali H. Sayed

Main category: cs.MA

TL;DR: The paper introduces an adaptive social learning method for estimating global states in MARL, combining social learning and MARL efficiently without two-timescale frameworks.

DetailsMotivation: Existing methods for estimating global states in MARL are time- and computation-intensive due to two-timescale learning frameworks.

Method: The approach alternates between a single step of social learning and a single step of MARL, avoiding the need for two-timescale frameworks.

Result: Theoretical guarantees support the method’s effectiveness, and simulations show performance approaching that of RL with known true states.

Conclusion: The proposed method offers an efficient alternative to traditional MARL approaches, with validated performance.

Abstract: This work leverages adaptive social learning to estimate partially observable global states in multi-agent reinforcement learning (MARL) problems. Unlike existing methods, the proposed approach enables the concurrent operation of social learning and reinforcement learning. Specifically, it alternates between a single step of social learning and a single step of MARL, eliminating the need for the time- and computation-intensive two-timescale learning frameworks. Theoretical guarantees are provided to support the effectiveness of the proposed method. Simulation results verify that the performance of the proposed methodology can approach that of reinforcement learning when the true state is known.

cs.MM

Minwoo Oh, Minsu Park, Eunil Park

Main category: cs.MM

TL;DR: A novel pipeline combining Music Source Separation (MSS) and cross-modal video-music retrieval (CMVMR) addresses copyright issues in short videos by separating and restoring original soundtracks (OST) from background music (BGM).

DetailsMotivation: To combat copyright infringement in short video platforms where BGMs obscure OSTs, evading originality detection.

Method: Proposes a pipeline integrating MSS and CMVMR, supported by two datasets (OASD-20K and OSVAR-160) for training and evaluation.

Result: Effectively separates and restores OSTs with high accuracy, ensuring content integrity.

Conclusion: Provides an ethical, scalable solution for copyright compliance in user-generated short videos.

Abstract: Short video platforms like YouTube Shorts and TikTok face significant copyright compliance challenges, as infringers frequently embed arbitrary background music (BGM) to obscure original soundtracks (OST) and evade content originality detection. To tackle this issue, we propose a novel pipeline that integrates Music Source Separation (MSS) and cross-modal video-music retrieval (CMVMR). Our approach effectively separates arbitrary BGM from the original OST, enabling the restoration of authentic video audio tracks. To support this work, we introduce two domain-specific datasets: OASD-20K for audio separation and OSVAR-160 for pipeline evaluation. OASD-20K contains 20,000 audio clips featuring mixed BGM and OST pairs, while OSVAR-160 is a unique benchmark dataset comprising 1,121 video and mixed-audio pairs, specifically designed for short video restoration tasks. Experimental results demonstrate that our pipeline not only removes arbitrary BGM with high accuracy but also restores OSTs, ensuring content integrity. This approach provides an ethical and scalable solution to copyright challenges in user-generated content on short video platforms.

eess.AS

[403] NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

Edresson Casanova, Paarth Neekhara, Ryan Langman, Shehzeen Hussain, Subhankar Ghosh, Xuesong Yang, Ante Jukić, Jason Li, Boris Ginsburg

Main category: eess.AS

TL;DR: NanoCodec, a low frame-rate audio codec, improves efficiency in Speech LLMs by reducing autoregressive steps while maintaining high-quality compression.

DetailsMotivation: Existing audio codecs operate at high frame rates, causing slow training and inference for autoregressive models.

Method: Ablation studies on frame rate, bitrate, and causality led to the development of NanoCodec, operating at 12.5 FPS.

Result: NanoCodec outperforms existing codecs across bitrate ranges, setting a new benchmark for efficiency.

Conclusion: NanoCodec enables low-latency, efficient training and inference for Speech LLMs.

Abstract: Large Language Models (LLMs) have significantly advanced audio processing by leveraging audio codecs to discretize audio into tokens, enabling the application of language modeling techniques to speech data. However, existing audio codecs often operate at high frame rates, leading to slow training and inference, particularly for autoregressive models. To address this, there is growing interest in low frame-rate audio codecs, which reduce the number of autoregressive steps required to generate one second of audio. In this paper, we conduct ablation studies to examine the impact of frame rate, bitrate, and causality on codec reconstruction quality. Based on our findings, we introduce NanoCodec, a state-of-the-art audio codec that achieves high-quality compression at just 12.5 frames per second (FPS). NanoCodec outperforms related works across various bitrate ranges, establishing a new benchmark for low-latency and efficient Speech LLM training and inference.

[404] EchoFree: Towards Ultra Lightweight and Efficient Neural Acoustic Echo Cancellation

Xingchen Li, Boyi Kang, Ziqian Wang, Zihan Zhang, Mingshuai Liu, Zhonghua Fu, Lei Xie

Main category: eess.AS

TL;DR: EchoFree is a lightweight neural AEC framework combining linear filtering with a neural post-filter, optimized via SSL, achieving low-latency and high performance with minimal parameters.

DetailsMotivation: Existing neural AEC methods fail to meet real-world low-latency and computational demands while maintaining performance.

Method: Proposes EchoFree, using linear filtering and a neural post-filter on Bark-scale features, optimized via a two-stage SSL strategy.

Result: Outperforms low-complexity AEC models and matches state-of-the-art lightweight models (e.g., DeepVQE-S) with only 278K parameters and 30 MMACs.

Conclusion: EchoFree offers an efficient, high-performance solution for real-world AEC applications.

Abstract: In recent years, neural networks (NNs) have been widely applied in acoustic echo cancellation (AEC). However, existing approaches struggle to meet real-world low-latency and computational requirements while maintaining performance. To address this challenge, we propose EchoFree, an ultra lightweight neural AEC framework that combines linear filtering with a neural post filter. Specifically, we design a neural post-filter operating on Bark-scale spectral features. Furthermore, we introduce a two-stage optimization strategy utilizing self-supervised learning (SSL) models to improve model performance. We evaluate our method on the blind test set of the ICASSP 2023 AEC Challenge. The results demonstrate that our model, with only 278K parameters and 30 MMACs computational complexity, outperforms existing low-complexity AEC models and achieves performance comparable to that of state-of-the-art lightweight model DeepVQE-S. The audio examples are available.

[405] Leveraging LLMs for Scalable Non-intrusive Speech Quality Assessment

Fredrik Cumlin, Xinyu Liang, Anubhab Ghosh, Saikat Chatterjee

Main category: eess.AS

TL;DR: The paper proposes using large language models (LLMs) as pseudo-raters for speech quality assessment (SQA) to overcome data limitations. It introduces LibriAugmented, a dataset labeled by a fine-tuned LLM, and compares training strategies, showing that a two-stage approach (pretraining on LLM labels, then fine-tuning on human labels) improves generalization.

DetailsMotivation: Limited training data and costly human annotations hinder non-intrusive SQA systems' generalization to real-time conferencing calls.

Method: The study constructs LibriAugmented (101,129 speech clips with simulated degradations labeled by Vicuna-7b-v1.5) and compares three training strategies: human-labeled data, LLM-labeled data, and a two-stage approach. DNSMOS Pro and DeePMOS are used for evaluation.

Result: The two-stage approach outperforms others, e.g., DNSMOS Pro achieves 0.63 vs. 0.55 PCC on NISQA_TEST_LIVETALK and 0.73 vs. 0.65 PCC on Tencent with reverb.

Conclusion: LLMs can serve as scalable pseudo-raters for SQA, offering a cost-effective solution to data limitations.

Abstract: Non-intrusive speech quality assessment (SQA) systems suffer from limited training data and costly human annotations, hindering their generalization to real-time conferencing calls. In this work, we propose leveraging large language models (LLMs) as pseudo-raters for speech quality to address these data bottlenecks. We construct LibriAugmented, a dataset consisting of 101,129 speech clips with simulated degradations labeled by a fine-tuned auditory LLM (Vicuna-7b-v1.5). We compare three training strategies: using human-labeled data, using LLM-labeled data, and a two-stage approach (pretraining on LLM labels, then fine-tuning on human labels), using both DNSMOS Pro and DeePMOS. We test on several datasets across languages and quality degradations. While LLM-labeled training yields mixed results compared to human-labeled training, we provide empirical evidence that the two-stage approach improves the generalization performance (e.g., DNSMOS Pro achieves 0.63 vs. 0.55 PCC on NISQA_TEST_LIVETALK and 0.73 vs. 0.65 PCC on Tencent with reverb). Our findings demonstrate the potential of using LLMs as scalable pseudo-raters for speech quality assessment, offering a cost-effective solution to the data limitation problem.

[406] Egonoise Resilient Source Localization and Speech Enhancement for Drones Using a Hybrid Model and Learning-Based Approach

Yihsuan Wu, Yukai Chiu, Michael Anthony, Mingsian R. Bai

Main category: eess.AS

TL;DR: A hybrid technique combining Array Signal Processing and Deep Neural Networks enhances speech signals for microphone-embedded drones, outperforming baselines even at -30 dB SNR.

DetailsMotivation: Drones' auditory capabilities are underexplored due to rotor noise. This paper addresses the low SNR problem in drone audition.

Method: A six-microphone uniform circular array uses beamsteering for speaker localization and a GSC-DF2 system for speech enhancement.

Result: The hybrid approach outperforms four baseline methods, validated by the DREGON dataset and measured data.

Conclusion: The proposed technique effectively mitigates drone rotor noise, enabling better auditory performance in search and rescue or military operations.

Abstract: Drones are becoming increasingly important in search and rescue missions, and even military operations. While the majority of drones are equipped with camera vision capabilities, the realm of drone audition remains underexplored due to the inherent challenge of mitigating the egonoise generated by the rotors. In this paper, we present a novel technique to address this extremely low signal-to-noise ratio (SNR) problem encountered by the microphone-embedded drones. The technique is implemented using a hybrid approach that combines Array Signal Processing (ASP) and Deep Neural Networks (DNN) to enhance the speech signals captured by a six-microphone uniform circular array mounted on a quadcopter. The system performs localization of the target speaker through beamsteering in conjunction with speech enhancement through a Generalized Sidelobe Canceller-DeepFilterNet 2 (GSC-DF2) system. To validate the system, the DREGON dataset and measured data are employed. Objective evaluations of the proposed hybrid approach demonstrated its superior performance over four baseline methods in the SNR condition as low as -30 dB.

[407] Use Cases for Voice Anonymization

Sarina Meyer, Ngoc Thang Vu

Main category: eess.AS

TL;DR: The paper explores how voice anonymization systems should adapt to specific use cases, proposing a taxonomy and design criteria based on literature and user studies.

DetailsMotivation: Current voice anonymization research lacks clarity on use case-specific requirements, leading to potential mismatches between system design and real-world needs.

Method: The authors conduct a literature analysis and user study to identify use cases and public expectations, then derive a taxonomy and design criteria.

Result: A taxonomy of use cases for voice anonymization is proposed, along with requirements and design criteria for method development.

Conclusion: The paper advocates for more use case-oriented research and development in voice anonymization systems.

Abstract: The performance of a voice anonymization system is typically measured according to its ability to hide the speaker’s identity and keep the data’s utility for downstream tasks. This means that the requirements the anonymization should fulfill depend on the context in which it is used and may differ greatly between use cases. However, these use cases are rarely specified in research papers. In this paper, we study the implications of use case-specific requirements on the design of voice anonymization methods. We perform an extensive literature analysis and user study to collect possible use cases and to understand the expectations of the general public towards such tools. Based on these studies, we propose the first taxonomy of use cases for voice anonymization, and derive a set of requirements and design criteria for method development and evaluation. Using this scheme, we propose to focus more on use case-oriented research and development of voice anonymization systems.

[408] Acoustic Non-Stationarity Objective Assessment with Hard Label Criteria for Supervised Learning Models

Guilherme Zucatelli, Ricardo Barioni, Gabriela Dantas

Main category: eess.AS

TL;DR: The paper introduces a Hard Label Criteria (HLC) algorithm for real-time non-stationarity labeling in acoustic signals, enabling supervised learning. It also proposes NANSA, a network that outperforms existing methods with 99% accuracy while being computationally efficient.

DetailsMotivation: Traditional non-stationarity measures are resource-intensive and impractical for real-time use, necessitating a more efficient solution.

Method: The paper proposes the HLC algorithm for labeling and the NANSA network for assessment, evaluated on acoustic models.

Result: HLC shows acoustic models encode stationarity, and NANSA achieves 99% accuracy, solving computational issues.

Conclusion: The HLC and NANSA provide efficient, accurate solutions for non-stationarity assessment, overcoming traditional limitations.

Abstract: Objective non-stationarity measures are resource intensive and impose critical limitations for real-time processing solutions. In this paper, a novel Hard Label Criteria (HLC) algorithm is proposed to generate a global non-stationarity label for acoustic signals, enabling supervised learning strategies to be trained as stationarity estimators. The HLC is first evaluated on state-of-the-art general-purpose acoustic models, demonstrating that these models encode stationarity information. Furthermore, the first-of-its-kind HLC-based Network for Acoustic Non-Stationarity Assessment (NANSA) is proposed. NANSA models outperform competing approaches, achieving up to 99% classification accuracy, while solving the computational infeasibility of traditional objective measures.

[409] A Self-Attention-Driven Deep Denoiser Model for Real Time Lung Sound Denoising in Noisy Environments

Samiul Based Shuvo, Syed Samiul Alam, Taufiq Hasan

Main category: eess.AS

TL;DR: A deep-learning model (Uformer) is proposed for denoising lung sounds, outperforming existing methods with significant SNR improvements.

DetailsMotivation: Lung sounds are contaminated in real-world settings, and conventional denoising methods fail due to spectral overlap complexities.

Method: Uformer combines CNN and Transformer modules for feature extraction and denoising, validated through ablation studies.

Result: Uformer achieves an average SNR improvement of 16.51 dB (for -12 dB signals) and 19.31 dB (end-to-end), outperforming existing models.

Conclusion: Uformer is robust and generalized for monitoring respiratory conditions, validated by qualitative and quantitative results.

Abstract: Objective: Lung auscultation is a valuable tool in diagnosing and monitoring various respiratory diseases. However, lung sounds (LS) are significantly affected by numerous sources of contamination, especially when recorded in real-world clinical settings. Conventional denoising models prove impractical for LS denoising, primarily owing to spectral overlap complexities arising from diverse noise sources. To address this issue, we propose a specialized deep-learning model (Uformer) for lung sound denoising. Methods: The proposed Uformer model is constituted of three modules: a Convolutional Neural Network (CNN) encoder module, dedicated to extracting latent features; a Transformer encoder module, employed to further enhance the encoding of unique LS features and effectively capture intricate long-range dependencies; and a CNN decoder module, employed to generate the denoised signals. An ablation study was performed in order to find the most optimal architecture. Results: The performance of the proposed Uformer model was evaluated on lung sounds induced with different types of synthetic and real-world noises. Lung sound signals of -12 dB to 15 dB signal-to-noise ratio (SNR) were considered in testing experiments. The proposed model showed an average SNR improvement of 16.51 dB when evaluated with -12 dB LS signals. Our end-to-end model, with an average SNR improvement of 19.31 dB, outperforms the existing model when evaluated with ambient noise and fewer parameters. Conclusion: Based on the qualitative and quantitative findings in this study, it can be stated that Uformer is robust and generalized to be used in assisting the monitoring of respiratory conditions.

[410] Post-training for Deepfake Speech Detection

Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi

Main category: eess.AS

TL;DR: A post-training method enhances SSL models for deepfake speech detection, improving robustness and outperforming existing detectors.

DetailsMotivation: Bridging the gap between general pre-training and domain-specific fine-tuning for deepfake speech detection.

Method: Post-training SSL models using a multilingual dataset (56K+ hours genuine, 18K+ hours artifact speech). Further fine-tuned on Deepfake-Eval-2024.

Result: Post-trained models show strong robustness and generalization, surpassing state-of-the-art detectors.

Conclusion: Post-training effectively adapts SSL models for deepfake detection, with models and code made available.

Abstract: We introduce a post-training approach that adapts self-supervised learning (SSL) models for deepfake speech detection by bridging the gap between general pre-training and domain-specific fine-tuning. We present AntiDeepfake models, a series of post-trained models developed using a large-scale multilingual speech dataset containing over 56,000 hours of genuine speech and 18,000 hours of speech with various artifacts in over one hundred languages. Experimental results show that the post-trained models already exhibit strong robustness and generalization to unseen deepfake speech. When they are further fine-tuned on the Deepfake-Eval-2024 dataset, these models consistently surpass existing state-of-the-art detectors that do not leverage post-training. Model checkpoints and source code are available online.

[411] REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, Lei Xie

Main category: eess.AS

TL;DR: REF-VC is a noise-robust expressive voice conversion system addressing challenges of environmental noise and expressive output demands. It outperforms baselines in noisy scenarios and matches performance in clean sets.

DetailsMotivation: Traditional methods struggle with balancing noise robustness and prosody richness. ASR-based methods suppress expressiveness, while SSL-based models suffer from timbre leakage and noise sensitivity.

Method: REF-VC introduces random erasing for SSL features, implicit alignment inspired by E2TTS, and Shortcut Models for faster inference.

Result: REF-VC outperforms Seed-VC in noisy zero-shot scenarios and matches its performance on clean sets. It also supports singing voice conversion.

Conclusion: REF-VC effectively balances noise robustness and expressiveness, offering a versatile solution for voice conversion applications.

Abstract: In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody richness, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL features, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that REF-VC outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy set, while also performing comparably to Seed-VC on the clean set. In addition, REF-VC can be compatible with singing voice conversion within one model.

eess.IV

[412] Transformer-Based Explainable Deep Learning for Breast Cancer Detection in Mammography: The MammoFormer Framework

Ojonugwa Oluwafemi Ejiga Peter, Daniel Emakporuena, Bamidele Dayo Tunde, Maryam Abdulkarim, Abdullahi Bn Umar

Main category: eess.IV

TL;DR: MammoFormer combines transformers and XAI to improve breast cancer detection, outperforming CNNs with up to 13% better accuracy.

DetailsMotivation: Address limitations of CNNs in mammography, such as inadequate local/contextual processing and lack of explainability for clinical adoption.

Method: Developed MammoFormer, integrating transformers with multi-feature enhancements (e.g., negative transformation, HOG) and XAI. Tested seven architectures (CNNs, ViT, Swin, ConvNext).

Result: Achieved up to 13% performance improvement; ViT reached 98.3% accuracy with AHE, Swin gained 13% with HOG.

Conclusion: MammoFormer overcomes clinical barriers by optimizing transformers, integrating XAI, and combining CNN reliability with global context modeling.

Abstract: Breast cancer detection through mammography interpretation remains difficult because of the minimal nature of abnormalities that experts need to identify alongside the variable interpretations between readers. The potential of CNNs for medical image analysis faces two limitations: they fail to process both local information and wide contextual data adequately, and do not provide explainable AI (XAI) operations that doctors need to accept them in clinics. The researcher developed the MammoFormer framework, which unites transformer-based architecture with multi-feature enhancement components and XAI functionalities within one framework. Seven different architectures consisting of CNNs, Vision Transformer, Swin Transformer, and ConvNext were tested alongside four enhancement techniques, including original images, negative transformation, adaptive histogram equalization, and histogram of oriented gradients. The MammoFormer framework addresses critical clinical adoption barriers of AI mammography systems through: (1) systematic optimization of transformer architectures via architecture-specific feature enhancement, achieving up to 13% performance improvement, (2) comprehensive explainable AI integration providing multi-perspective diagnostic interpretability, and (3) a clinically deployable ensemble system combining CNN reliability with transformer global context modeling. The combination of transformer models with suitable feature enhancements enables them to achieve equal or better results than CNN approaches. ViT achieves 98.3% accuracy alongside AHE while Swin Transformer gains a 13.0% advantage through HOG enhancements

[413] Clinically-guided Data Synthesis for Laryngeal Lesion Detection

Chiara Baldini, Kaisar Kushibar, Richard Osuala, Simone Balocco, Oliver Diaz, Karim Lekadir, Leonardo S. Mattos

Main category: eess.IV

TL;DR: The paper introduces a Latent Diffusion Model (LDM) with a ControlNet adapter to generate synthetic laryngeal endoscopic images, addressing data scarcity for CADx/e systems in otorhinolaryngology. Synthetic data improved lesion detection rates and demonstrated realism evaluated by experts.

DetailsMotivation: Current CADx/e systems in otorhinolaryngology face data scarcity and reliance on operator expertise. Biopsy remains the gold standard despite its drawbacks. The study aims to overcome these limitations by generating synthetic, clinically relevant data.

Method: The study uses a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate realistic laryngeal endoscopic image-annotation pairs, guided by clinical observations.

Result: Adding 10% synthetic data improved lesion detection rates by 9% (internal) and 22.1% (external). Experts struggled to distinguish synthetic from real images, confirming realism.

Conclusion: The approach effectively addresses data scarcity, enhances CADx/e performance, and demonstrates the potential of synthetic data in laryngeal disease diagnosis.

Abstract: Although computer-aided diagnosis (CADx) and detection (CADe) systems have made significant progress in various medical domains, their application is still limited in specialized fields such as otorhinolaryngology. In the latter, current assessment methods heavily depend on operator expertise, and the high heterogeneity of lesions complicates diagnosis, with biopsy persisting as the gold standard despite its substantial costs and risks. A critical bottleneck for specialized endoscopic CADx/e systems is the lack of well-annotated datasets with sufficient variability for real-world generalization. This study introduces a novel approach that exploits a Latent Diffusion Model (LDM) coupled with a ControlNet adapter to generate laryngeal endoscopic image-annotation pairs, guided by clinical observations. The method addresses data scarcity by conditioning the diffusion process to produce realistic, high-quality, and clinically relevant image features that capture diverse anatomical conditions. The proposed approach can be leveraged to expand training datasets for CADx/e models, empowering the assessment process in laryngology. Indeed, during a downstream task of detection, the addition of only 10% synthetic data improved the detection rate of laryngeal lesions by 9% when the model was internally tested and 22.1% on out-of-domain external data. Additionally, the realism of the generated images was evaluated by asking 5 expert otorhinolaryngologists with varying expertise to rate their confidence in distinguishing synthetic from real images. This work has the potential to accelerate the development of automated tools for laryngeal disease diagnosis, offering a solution to data scarcity and demonstrating the applicability of synthetic data in real-world scenarios.

[414] Deep Learning Based Reconstruction Methods for Electrical Impedance Tomography

Alexander Denker, Fabio Margotti, Jianfeng Ning, Kim Knudsen, Derick Nganyu Tanyu, Bangti Jin, Andreas Hauptmann, Peter Maass

Main category: eess.IV

TL;DR: Learned reconstruction methods using deep neural networks outperform model-based techniques in EIT for in-distribution data but struggle with generalization, while hybrid methods offer a balanced solution.

DetailsMotivation: To address the ill-posed nature of the EIT inverse problem and leverage recent advancements in deep learning for improved image reconstruction.

Method: Review and comparison of learned (fully-learned, post-processing, iterative) and model-based (sparsity regularization, Gauss-Newton, level set) methods using simulated and real-world datasets.

Result: Learned methods excel for in-distribution data but lack generalization; hybrid methods balance accuracy and adaptability.

Conclusion: Hybrid approaches combining learned and model-based methods are promising for robust EIT image reconstruction.

Abstract: Electrical Impedance Tomography (EIT) is a powerful imaging modality widely used in medical diagnostics, industrial monitoring, and environmental studies. The EIT inverse problem is about inferring the internal conductivity distribution of the concerned object from the voltage measurements taken on its boundary. This problem is severely ill-posed, and requires advanced computational approaches for accurate and reliable image reconstruction. Recent innovations in both model-based reconstruction and deep learning have driven significant progress in the field. In this review, we explore learned reconstruction methods that employ deep neural networks for solving the EIT inverse problem. The discussion focuses on the complete electrode model, one popular mathematical model for real-world applications of EIT. We compare a wide variety of learned approaches, including fully-learned, post-processing and learned iterative methods, with several conventional model-based reconstruction techniques, e.g., sparsity regularization, regularized Gauss-Newton iteration and level set method. The evaluation is based on three datasets: a simulated dataset of ellipses, an out-of-distribution simulated dataset, and the KIT4 dataset, including real-world measurements. Our results demonstrate that learned methods outperform model-based methods for in-distribution data but face challenges in generalization, where hybrid methods exhibit a good balance of accuracy and adaptability.

[415] Advanced Deep Learning Techniques for Accurate Lung Cancer Detection and Classification

Mobarak Abumohsen, Enrique Costa-Montenegro, Silvia García-Méndez, Amani Yousef Owda, Majdi Owda

Main category: eess.IV

TL;DR: The paper proposes a DenseNet201-based approach for lung cancer detection from CT images, addressing imbalanced data and overfitting with Focal Loss, augmentation, and regularization, achieving 98.95% accuracy.

DetailsMotivation: Lung cancer is a leading cause of death, and current CT-based detection methods suffer from high false positives due to small, imbalanced datasets.

Method: Uses DenseNet201 with Focal Loss, data augmentation, and regularization to handle imbalanced data and prevent overfitting.

Result: Achieves 98.95% accuracy in lung cancer detection and classification.

Conclusion: The proposed method effectively addresses dataset imbalance and overfitting, demonstrating high accuracy for LC detection.

Abstract: Lung cancer (LC) ranks among the most frequently diagnosed cancers and is one of the most common causes of death for men and women worldwide. Computed Tomography (CT) images are the most preferred diagnosis method because of their low cost and their faster processing times. Many researchers have proposed various ways of identifying lung cancer using CT images. However, such techniques suffer from significant false positives, leading to low accuracy. The fundamental reason results from employing a small and imbalanced dataset. This paper introduces an innovative approach for LC detection and classification from CT images based on the DenseNet201 model. Our approach comprises several advanced methods such as Focal Loss, data augmentation, and regularization to overcome the imbalanced data issue and overfitting challenge. The findings show the appropriateness of the proposal, attaining a promising performance of 98.95% accuracy.

[416] Multivariate Fields of Experts

Stanislas Ducotterd, Michael Unser

Main category: eess.IV

TL;DR: The paper introduces a new framework, multivariate fields of experts, for learning image priors, outperforming univariate models and achieving near-deep-learning performance with fewer resources.

DetailsMotivation: To generalize existing fields of experts methods and improve image prior learning by incorporating multivariate potential functions.

Method: Uses Moreau envelopes of the ℓ∞-norm to construct multivariate potential functions, applied to inverse problems like denoising and deblurring.

Result: Outperforms univariate models, matches deep-learning performance, and is faster with fewer parameters and data requirements.

Conclusion: The proposed model is effective, efficient, and retains interpretability due to its structured design.

Abstract: We introduce the multivariate fields of experts, a new framework for the learning of image priors. Our model generalizes existing fields of experts methods by incorporating multivariate potential functions constructed via Moreau envelopes of the $\ell_\infty$-norm. We demonstrate the effectiveness of our proposal across a range of inverse problems that include image denoising, deblurring, compressed-sensing magnetic-resonance imaging, and computed tomography. The proposed approach outperforms comparable univariate models and achieves performance close to that of deep-learning-based regularizers while being significantly faster, requiring fewer parameters, and being trained on substantially fewer data. In addition, our model retains a relatively high level of interpretability due to its structured design.

[417] MESAHA-Net: Multi-Encoders based Self-Adaptive Hard Attention Network with Maximum Intensity Projections for Lung Nodule Segmentation in CT Scan

Muhammad Usman, Azka Rehman, Abd Ur Rehman, Abdullah Shahid, Tariq Mahmood Khan, Imran Razzak, Minyoung Chung, Yeong Gil Shin

Main category: eess.IV

TL;DR: Proposes MESAHA-Net, an efficient end-to-end framework for precise lung nodule segmentation in CT scans, outperforming state-of-the-art methods.

DetailsMotivation: Accurate lung nodule segmentation is vital for early lung cancer diagnosis, but challenges like nodule heterogeneity and size diversity exist.

Method: Uses a multi-encoder-based self-adaptive hard attention network (MESAHA-Net) with three encoding paths, an attention block, and a decoder block.

Result: Outperforms previous methods in segmentation accuracy and computational complexity on the LIDC-IDRI dataset.

Conclusion: MESAHA-Net is robust and suitable for real-time clinical use.

Abstract: Accurate lung nodule segmentation is crucial for early-stage lung cancer diagnosis, as it can substantially enhance patient survival rates. Computed tomography (CT) images are widely employed for early diagnosis in lung nodule analysis. However, the heterogeneity of lung nodules, size diversity, and the complexity of the surrounding environment pose challenges for developing robust nodule segmentation methods. In this study, we propose an efficient end-to-end framework, the multi-encoder-based self-adaptive hard attention network (MESAHA-Net), for precise lung nodule segmentation in CT scans. MESAHA-Net comprises three encoding paths, an attention block, and a decoder block, facilitating the integration of three types of inputs: CT slice patches, forward and backward maximum intensity projection (MIP) images, and region of interest (ROI) masks encompassing the nodule. By employing a novel adaptive hard attention mechanism, MESAHA-Net iteratively performs slice-by-slice 2D segmentation of lung nodules, focusing on the nodule region in each slice to generate 3D volumetric segmentation of lung nodules. The proposed framework has been comprehensively evaluated on the LIDC-IDRI dataset, the largest publicly available dataset for lung nodule segmentation. The results demonstrate that our approach is highly robust for various lung nodule types, outperforming previous state-of-the-art techniques in terms of segmentation accuracy and computational complexity, rendering it suitable for real-time clinical implementation.

[418] A dataset of primary nasopharyngeal carcinoma MRI with multi-modalities segmentation

Yin Li, Qi Chen, Kai Wang, Meige Li, Liping Si, Yingwei Guo, Yu Xiong, Qixing Wang, Yang Qin, Ling Xu, Patrick van der Smagt, Jun Tang, Nutan Chen

Main category: eess.IV

TL;DR: A comprehensive NPC MRI dataset is introduced to address the lack of public data, aiding diagnosis, treatment, and machine learning for nasopharyngeal carcinoma.

DetailsMotivation: The lack of publicly available, comprehensive MRI datasets for NPC limits advancements in diagnosis, treatment planning, and ML algorithm development.

Method: The dataset includes MR axial imaging of 277 primary NPC patients, featuring T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences (831 scans), along with clinical data and radiologist-annotated segmentations.

Result: The dataset provides high-quality, manually annotated MRI scans and clinical data for untreated primary NPC cases.

Conclusion: This dataset fills a critical gap, enabling better research and development in NPC diagnosis and treatment.

Abstract: Multi-modality magnetic resonance imaging(MRI) data facilitate the early diagnosis, tumor segmentation, and disease staging in the management of nasopharyngeal carcinoma (NPC). The lack of publicly available, comprehensive datasets limits advancements in diagnosis, treatment planning, and the development of machine learning algorithms for NPC. Addressing this critical need, we introduce the first comprehensive NPC MRI dataset, encompassing MR axial imaging of 277 primary NPC patients. This dataset includes T1-weighted, T2-weighted, and contrast-enhanced T1-weighted sequences, totaling 831 scans. In addition to the corresponding clinical data, manually annotated and labeled segmentations by experienced radiologists offer high-quality data resources from untreated primary NPC.

[419] MambaEviScrib: Mamba and Evidence-Guided Consistency Enhance CNN Robustness for Scribble-Based Weakly Supervised Ultrasound Image Segmentation

Xiaoxiang Han, Xinyu Li, Jiang Shang, Yiman Liu, Keyan Chen, Shugong Xu, Qiaohong Liu, Qi Zhang

Main category: eess.IV

TL;DR: The paper introduces a weakly supervised learning approach for ultrasound image segmentation, addressing challenges like poor contrast and unclear edges using Dempster-Shafer Theory and a hybrid CNN-Mamba framework.

DetailsMotivation: To reduce annotation costs and improve edge prediction in ultrasound image segmentation, which suffers from poor contrast and unclear edges.

Method: Proposes an Evidence-Guided Consistency strategy using DST and a hybrid CNN-Mamba framework to model global information and long-range dependencies.

Result: The method demonstrates competitive performance in experiments.

Conclusion: The proposed approach effectively addresses challenges in ultrasound image segmentation and shows promise for practical applications.

Abstract: Segmenting anatomical structures and lesions from ultrasound images contributes to disease assessment. Weakly supervised learning (WSL) based on sparse annotation has achieved encouraging performance and demonstrated the potential to reduce annotation costs. This study attempts to introduce scribble-based WSL into ultrasound image segmentation tasks. However, ultrasound images often suffer from poor contrast and unclear edges, coupled with insufficient supervison signals for edges, posing challenges to edge prediction. Uncertainty modeling has been proven to facilitate models in dealing with these issues. Nevertheless, existing uncertainty estimation paradigms are not robust enough and often filter out predictions near decision boundaries, resulting in unstable edge predictions. Therefore, we propose leveraging predictions near decision boundaries effectively. Specifically, we introduce Dempster-Shafer Theory (DST) of evidence to design an Evidence-Guided Consistency strategy. This strategy utilizes high-evidence predictions, which are more likely to occur near high-density regions, to guide the optimization of low-evidence predictions that may appear near decision boundaries. Furthermore, the diverse sizes and locations of lesions in ultrasound images pose a challenge for CNNs with local receptive fields, as they struggle to model global information. Therefore, we introduce Visual Mamba based on structured state space sequence models, which achieves long-range dependency with linear computational complexity, and we construct a novel hybrid CNN-Mamba framework. During training, the collaboration between the CNN branch and the Mamba branch in the proposed framework draws inspiration from each other based on the EGC strategy. Experiments demonstrate the competitiveness of the proposed method. Dataset and code will be available on https://github.com/GtLinyer/MambaEviScrib.

[420] CDI: Blind Image Restoration Fidelity Evaluation based on Consistency with Degraded Image

Xiaojun Tang, Jingru Wang, Guangwei Huang, Guannan Chen, Rui Zheng, Lian Huai, Yuyu Liu, Xingqun Jiang

Main category: eess.IV

TL;DR: The paper proposes a new Image Quality Assessment (IQA) system for Blind Image Restoration (BIR) that evaluates fidelity by calculating Consistency with Degraded Image (CDI), outperforming traditional Full-Reference IQA methods.

DetailsMotivation: Existing Full-Reference IQA methods poorly rate BIR images despite high perceptual quality, highlighting the need for a specialized BIR IQA system.

Method: The authors introduce wavelet domain Reference Guided CDI and Reference Agnostic CDI algorithms to assess fidelity without requiring degradation parameters or reference images.

Result: Experiments on the new DISDCD dataset show CDI’s superiority over traditional IQA methods for BIR fidelity evaluation.

Conclusion: The proposed CDI-based IQA system effectively addresses BIR’s unique challenges, with plans to release the source code and DISDCD dataset.

Abstract: Recent advancements in Blind Image Restoration (BIR) methods, based on Generative Adversarial Networks and Diffusion Models, have significantly improved visual quality. However, they present significant challenges for Image Quality Assessment (IQA), as the existing Full-Reference IQA methods often rate images with high perceptual quality poorly. In this paper, we reassess the Solution Non-Uniqueness and Degradation Indeterminacy issues of BIR, and propose constructing a specific BIR IQA system. In stead of directly comparing a restored image with a reference image, the BIR IQA evaluates fidelity by calculating the Consistency with Degraded Image (CDI). Specifically, we propose a wavelet domain Reference Guided CDI algorithm, which can acquire the consistency with a degraded image for various types without requiring knowledge of degradation parameters. The supported degradation types include down sampling, blur, noise, JPEG and complex combined degradations etc. In addition, we propose a Reference Agnostic CDI, enabling BIR fidelity evaluation without reference images. Finally, in order to validate the rationality of CDI, we create a new Degraded Images Switch Display Comparison Dataset (DISDCD) for subjective evaluation of BIR fidelity. Experiments conducted on DISDCD verify that CDI is markedly superior to common Full Reference IQA methods for BIR fidelity evaluation. The source code and the DISDCD dataset will be publicly available shortly.

[421] POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction

Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, Chunhua Shen

Main category: eess.IV

TL;DR: POMATO is a unified framework for dynamic 3D reconstruction, combining pointmap matching with temporal motion to improve geometry estimation and motion understanding in dynamic scenes.

DetailsMotivation: Existing methods like DUSt3R struggle with ambiguous matching in dynamic regions, limiting performance. POMATO aims to unify geometry estimation and matching while addressing these challenges.

Method: POMATO learns explicit matching relationships by mapping RGB pixels to 3D pointmaps and introduces a temporal motion module for dynamic motions, ensuring scale consistency and enhancing performance.

Result: The framework demonstrates remarkable performance in video depth estimation, 3D point tracking, and pose estimation.

Conclusion: POMATO effectively unifies pointmap matching and temporal motion, advancing dynamic 3D reconstruction and motion understanding.

Abstract: 3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.

[422] Edge2Prompt: Modality-Agnostic Model for Out-of-Distribution Liver Segmentation

Nathan Hollet, Oumeymah Cherkaoui, Philippe C. Cattin, Sidaty El hadramy

Main category: eess.IV

TL;DR: Edge2Prompt is a modality-agnostic liver segmentation pipeline combining edge detection and foundation models, achieving strong performance in data-scarce and OOD scenarios.

DetailsMotivation: Liver segmentation is crucial for clinical workflows but faces challenges like modality-specific tools and data scarcity.

Method: Integrates edge detection with U-Net and SAM-2 to generate prompts for 2D segmentation, reconstructable into 3D volumes.

Result: Outperforms classical methods in data-scarce scenarios and achieves 86.4% Dice Score on OOD tasks, surpassing baselines.

Conclusion: Edge2Prompt bridges classical and foundation models for adaptable, data-efficient liver segmentation.

Abstract: Liver segmentation is essential for preoperative planning in interventions like tumor resection or transplantation, but implementation in clinical workflows faces challenges due to modality-specific tools and data scarcity. We propose Edge2Prompt, a novel pipeline for modality-agnostic liver segmentation that generalizes to out-of-distribution (OOD) data. Our method integrates classical edge detection with foundation models. Modality-agnostic edge maps are first extracted from input images, then processed by a U-Net to generate logit-based prompts. These prompts condition the Segment Anything Model 2 (SAM-2) to generate 2D liver segmentations, which can then be reconstructed into 3D volumes. Evaluated on the multi-modal CHAOS dataset, Edge2Prompt achieves competitive results compared to classical segmentation methods when trained and tested in-distribution (ID), and outperforms them in data-scarce scenarios due to the SAM-2 module. Furthermore, it achieves a mean Dice Score of 86.4% on OOD tasks, outperforming U-Net baselines by 27.4% and other self-prompting methods by 9.1%, demonstrating its effectiveness. This work bridges classical and foundation models for clinically adaptable, data-efficient segmentation.

[423] Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy

Shuo Chen, Yijin Li, Xi Zheng, Guofeng Zhang

Main category: eess.IV

TL;DR: NFH-SEM is a neural field-based hybrid method for 3D SEM reconstruction, eliminating manual calibration and shadow errors, validated on diverse samples.

DetailsMotivation: Conventional 2D SEM images lack 3D topography, and existing methods struggle with complex microstructures due to discrete representations, calibration needs, and shadow errors.

Method: NFH-SEM uses multi-view, multi-detector 2D SEM images, fusing geometric and photometric data into a continuous neural field, with end-to-end self-calibration and shadow disentanglement.

Result: High-fidelity reconstructions of complex samples like two-photon lithography microstructures, peach pollen, and silicon carbide surfaces.

Conclusion: NFH-SEM enables accurate, calibration-free 3D reconstruction of intricate microstructures, demonstrating broad applicability.

Abstract: The scanning electron microscope (SEM) is a widely used imaging device in scientific research and industrial applications. Conventional two-dimensional (2D) SEM images do not directly reveal the three-dimensional (3D) topography of micro samples, motivating the development of SEM 3D surface reconstruction methods. However, reconstruction of complex microstructures remains challenging for existing methods due to the limitations of discrete 3D representations, the need for calibration with reference samples, and shadow-induced gradient errors. Here, we introduce NFH-SEM, a neural field-based hybrid SEM 3D reconstruction method that takes multi-view, multi-detector 2D SEM images as input and fuses geometric and photometric information into a continuous neural field representation. NFH-SEM eliminates the manual calibration procedures through end-to-end self-calibration and automatically disentangles shadows from SEM images during training, enabling accurate reconstruction of intricate microstructures. We validate the effectiveness of NFH-SEM on real and simulated datasets. Our experiments show high-fidelity reconstructions of diverse, challenging samples, including two-photon lithography microstructures, peach pollen, and silicon carbide particle surfaces, demonstrating precise detail and broad applicability.

Last updated: 2025-08-22
Built with Hugo, theme modified on Stack