Daily arXiv Papers - 2025-12-10

AI-enhanced summaries of 0 research papers from arXiv

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Short-Context Dominance: How Much Local Context Natural Language Actually Needs?

Vala Vakilian, Zimeng Wang, Ankit Singh Rawat, Christos Thrampoulidis

Main category: cs.CL

TL;DR: The paper investigates the short-context dominance hypothesis, finding that 75-80% of long sequences need only the last 96 tokens for accurate next-token prediction. It introduces DaMCL to detect challenging long-context sequences and develops a decoding algorithm to mitigate bias from short-context dominance.

DetailsMotivation: To test whether most sequences can be accurately predicted using only a small local context (short-context dominance hypothesis), and to address the resulting bias in LLM predictions that favors short-range dependencies over long-range ones.

Method: 1) Measure Minimum Context Length (MCL) needed for accurate full-context predictions using LLMs as statistical oracles; 2) Introduce Distributionally Aware MCL (DaMCL) as a practical proxy that doesn’t require next-token knowledge; 3) Develop a decoding algorithm that uses DaMCL detection to boost long-range-relevant tokens.

Result: For sequences of 1-7k tokens, 75-80% require only the last 96 tokens at most. DaMCL with simple thresholding effectively detects long vs. short context sequences. The proposed decoding algorithm improves performance across Q&A tasks and model architectures by mitigating short-context bias.

Conclusion: Short-context dominance is prevalent in long sequences, but can be detected and mitigated. The proposed DaMCL metric and decoding algorithm successfully identify challenging long-context sequences and improve LLM performance by reducing bias toward short-range dependencies.

Abstract: We investigate the short-context dominance hypothesis: that for most sequences, a small local prefix suffices to predict their next tokens. Using large language models as statistical oracles, we measure the minimum context length (MCL) needed to reproduce accurate full-context predictions across datasets with sequences of varying lengths. For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most. Given the dominance of short-context tokens, we then ask whether it is possible to detect challenging long-context sequences for which a short local prefix does not suffice for prediction. We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token and is compatible with sampling strategies beyond greedy decoding. Our experiments validate that simple thresholding of the metric defining DaMCL achieves high performance in detecting long vs. short context sequences. Finally, to counter the bias that short-context dominance induces in LLM output distributions, we develop an intuitive decoding algorithm that leverages our detector to identify and boost tokens that are long-range-relevant. Across Q&A tasks and model architectures, we confirm that mitigating the bias improves performance.

[2] Adaptation of Embedding Models to Financial Filings via LLM Distillation

Eliot Brenner, Dominic Seyler, Manjunath Hegde, Andrei Simion, Koustuv Dasgupta, Bing Xiang

Main category: cs.CL

TL;DR: A scalable pipeline trains specialized retrieval models from unlabeled financial corpus using iterative student-teacher training, achieving significant improvements in financial document retrieval metrics.

DetailsMotivation: Specialized conversational AI agents face constraints from computation costs, latency, and need for domain-specific relevance. General embedding models underperform in specialized domains like finance, requiring better retrieval models without expensive human annotation.

Method: Uses iterative student-teacher training: starts with general retrieval model as foundation, uses LLM-judged relevance to distill domain knowledge into compact retriever. Interleaves retrieval-based mining of hard positive/negative examples from unlabeled corpus with iterative retraining. Each iteration uses refined student model to mine progressively harder training examples.

Result: Achieved 27.7% improvement in MRR@5, 44.6% improvement in mean DCG@5 across 14 financial filing types (21,800 query-document pairs), and improved NDCG on 3 of 4 document classes in FinanceBench.

Conclusion: Provides cost-effective solution to bridge gap between general-purpose models and specialized domains without labor-intensive human annotation, enabling practical deployment of specialized conversational AI agents in finance.

Abstract: Despite advances in generative large language models (LLMs), practical application of specialized conversational AI agents remains constrained by computation costs, latency requirements, and the need for precise domain-specific relevance measures. While existing embedding models address the first two constraints, they underperform on information retrieval in specialized domains like finance. This paper introduces a scalable pipeline that trains specialized models from an unlabeled corpus using a general purpose retrieval embedding model as foundation. Our method yields an average of 27.7% improvement in MRR$\texttt{@}$5, 44.6% improvement in mean DCG$\texttt{@}$5 across 14 financial filing types measured over 21,800 query-document pairs, and improved NDCG on 3 of 4 document classes in FinanceBench. We adapt retrieval embeddings (bi-encoder) for RAG, not LLM generators, using LLM-judged relevance to distill domain knowledge into a compact retriever. There are prior works which pair synthetically generated queries with real passages to directly fine-tune the retrieval model. Our pipeline differs from these by introducing interaction between student and teacher models that interleaves retrieval-based mining of hard positive/negative examples from the unlabeled corpus with iterative retraining of the student model’s weights using these examples. Each retrieval iteration uses the refined student model to mine the corpus for progressively harder training examples for the subsequent training iteration. The methodology provides a cost-effective solution to bridging the gap between general-purpose models and specialized domains without requiring labor-intensive human annotation.

[3] Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

Main category: cs.CL

TL;DR: SEA is a universal framework for aligning subtitles to sign language videos across multiple languages using pretrained models for segmentation and embedding, followed by efficient dynamic programming alignment.

DetailsMotivation: Prior approaches for subtitle alignment in sign language videos are limited by being language/dataset-specific end-to-end systems, lacking generality. There's a need for a universal method that works across multiple languages and domains.

Method: Segment, Embed, and Align (SEA) framework: 1) Segment video frames into individual signs using pretrained model, 2) Embed each sign video clip into shared latent space with text using another pretrained model, 3) Perform alignment with lightweight dynamic programming procedure that runs efficiently on CPUs.

Result: State-of-the-art alignment performance on four sign language datasets. The method works efficiently (within a minute for hour-long episodes) and can adapt to various scenarios from small lexicons to large continuous corpora.

Conclusion: SEA provides a flexible, universal framework for subtitle alignment in sign language videos that outperforms prior methods and has potential to generate high-quality parallel data for advancing sign language processing. Code and models are openly available.

Abstract: The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA’s code and models are openly available.

[4] Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Sampriti Soor, Suklav Ghosh, Arijit Sur

Main category: cs.CL

TL;DR: Universal adversarial suffixes (4-10 tokens) that reduce accuracy across tasks/models when appended to inputs, trained via Gumbel-Softmax relaxation with entropy regularization.

DetailsMotivation: Current adversarial prompts are task- or model-specific, making results hard to compare and limiting transferability. Need universal adversarial suffixes that work broadly across tasks and models.

Method: Learn universal adversarial suffixes using Gumbel-Softmax relaxation to make optimization differentiable, then discretize for inference. Training maximizes calibrated cross-entropy on label region while masking gold tokens, with entropy regularization to prevent collapse.

Result: Single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence across sentiment analysis, NLI, paraphrase detection, commonsense QA, and physical reasoning tasks with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B.

Conclusion: Universal adversarial suffixes provide effective and transferable attacks across diverse tasks and model families, demonstrating fragility of LM-based classifiers to carefully crafted universal perturbations.

Abstract: Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable “soft” form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.

[5] Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Sampriti Soor, Suklav Ghosh, Arijit Sur

Main category: cs.CL

TL;DR: RL-trained adversarial suffixes outperform gradient/rule-based methods in degrading model accuracy and transferring across tasks/models.

DetailsMotivation: Existing adversarial suffix methods are brittle, task-specific, and model-specific. Need more robust, transferable adversarial attacks that work across diverse NLP tasks and models.

Method: Reinforcement learning framework where suffix is treated as a policy, trained with Proximal Policy Optimization (PPO) against frozen model as reward oracle. Rewards shaped using calibrated cross-entropy to remove label bias and aggregate across surface forms for better transferability.

Result: RL-trained suffixes consistently degrade accuracy across 5 diverse NLP benchmarks (sentiment, NLI, paraphrase, commonsense reasoning) using 3 distinct language models (Qwen2-1.5B, TinyLlama-1.1B, Phi-1.5). Outperform previous adversarial triggers in cross-task and cross-model transferability.

Conclusion: RL-based approach produces more effective and transferable adversarial suffixes than gradient/rule-based methods, demonstrating robustness across diverse NLP tasks and models.

Abstract: Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.

[6] ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access

Jiwoo Park, Ruoqi Liu, Avani Jagdale, Andrew Srisuwananukorn, Jing Zhao, Lang Li, Ping Zhang, Sachin Kumar

Main category: cs.CL

TL;DR: ClinicalTrialsHub is an interactive platform that combines ClinicalTrials.gov data with automatically extracted structured information from PubMed articles, increasing structured trial data access by 83.8% using LLMs like GPT-5.1 and Gemini-3-Pro.

DetailsMotivation: To improve access to structured clinical trial data for patients, clinicians, researchers, and policymakers by augmenting ClinicalTrials.gov data with information extracted from PubMed research articles, advancing evidence-based medicine.

Method: Uses large language models (GPT-5.1, Gemini-3-Pro) to parse full-text PubMed articles and extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system with evidence-grounded answers linked to source sentences.

Result: Increases access to structured clinical trial data by 83.8% compared to ClinicalTrials.gov alone, demonstrated through user studies with clinicians, researchers, and systematic automatic evaluation of information extraction and QA capabilities.

Conclusion: ClinicalTrialsHub effectively enhances clinical trial data accessibility and has potential to significantly benefit various stakeholders in healthcare and research by providing comprehensive, structured trial information with evidence attribution.

Abstract: We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.

[7] Are generative AI text annotations systematically biased?

Sjoerd B. Stolwijk, Mark Boukes, Damian Trilling

Main category: cs.CL

TL;DR: GLLMs show adequate F1 scores for annotation tasks but exhibit systematic bias - differing from human prevalence estimates, producing different downstream results, and clustering together more than with human annotations.

DetailsMotivation: To investigate bias in GLLM annotations by conceptually replicating Boukes (2024)'s manual annotation study, examining whether GLLMs can reliably replace human annotators for content analysis tasks.

Method: Conceptual replication using multiple GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) with five different prompts to annotate five concepts: political content, interactivity, rationality, incivility, and ideology.

Result: GLLMs achieved adequate F1 scores but showed systematic bias: they differed from manual annotations in prevalence estimates, produced substantively different downstream results, and overlapped more with each other than with human annotations. F1 scores alone failed to capture the degree of bias.

Conclusion: While GLLMs show promise for annotation tasks, they exhibit systematic biases that differ from human judgment patterns, suggesting caution in using them as direct replacements for human annotators without accounting for these biases.

Abstract: This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.

[8] What Triggers my Model? Contrastive Explanations Inform Gender Choices by Translation Models

Janiça Hackenbuchner, Arda Tezcan, Joke Daems

Main category: cs.CL

TL;DR: The paper explores gender bias origins in machine translation models using contrastive explanations and saliency attribution on gender-ambiguous source data, comparing model decisions with human perceptions.

DetailsMotivation: Current interpretability research has been limited in addressing gender bias in machine translation and large language models. The authors aim to move beyond simply measuring bias to understanding its origins by examining which contextual source tokens influence gender inflection choices in translations.

Method: Using gender-ambiguous natural source data, the study employs contrastive explanations and computes saliency attribution to analyze source word influence on gender decisions. The research addresses the challenge of lacking scoring thresholds and examines different attribution levels of source words on model gender decisions, comparing these with human perceptions.

Result: The study demonstrates a noticeable overlap between human perceptions of gender and model attribution patterns. The authors also provide a linguistic analysis of salient words that influence gender decisions in translations.

Conclusion: The work showcases the importance of understanding model translation decisions regarding gender, how these compare to human decisions, and that this information should be leveraged to mitigate gender bias in machine translation systems.

Abstract: Interpretability can be implemented as a means to understand decisions taken by (black box) models, such as machine translation (MT) or large language models (LLMs). Yet, research in this area has been limited in relation to a manifested problem in these models: gender bias. With this research, we aim to move away from simply measuring bias to exploring its origins. Working with gender-ambiguous natural source data, this study examines which context, in the form of input tokens in the source sentence, influences (or triggers) the translation model choice of a certain gender inflection in the target language. To analyse this, we use contrastive explanations and compute saliency attribution. We first address the challenge of a lacking scoring threshold and specifically examine different attribution levels of source words on the model gender decisions in the translation. We compare salient source words with human perceptions of gender and demonstrate a noticeable overlap between human perceptions and model attribution. Additionally, we provide a linguistic analysis of salient words. Our work showcases the relevance of understanding model translation decisions in terms of gender, how this compares to human decisions and that this information should be leveraged to mitigate gender bias.

[9] Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models

Ju-Young Kim, Ji-Hong Park, Se-Yeon Lee, Sujin Park, Gun-Woo Kim

Main category: cs.CL

TL;DR: This paper proposes a soft inductive bias approach with explicit reasoning perspectives to improve inappropriate utterance detection in Korean conversational texts, achieving 87% accuracy with a 3.89% improvement over standard supervised learning.

DetailsMotivation: The motivation stems from growing social concerns about unchecked inappropriate remarks in anonymous online spaces that escalate into verbal abuse and criminal behavior. There's a need for better detection techniques to build safer communication environments, especially in Korean contexts where research applying large language models and chain-of-thought reasoning remains limited.

Method: The authors propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process. They fine-tune a Korean large language model (Kanana-1.5) using this method, which promotes rational decision-making and prevents reasoning errors by constraining the reasoning perspectives.

Result: Experimental results show the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89% over standard supervised learning. Both quantitative performance comparisons and qualitative evaluations across different training strategies demonstrate the effectiveness of the proposed approach.

Conclusion: The proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives. This demonstrates the effectiveness of the soft inductive bias approach for inappropriate utterance detection in Korean conversational texts.

Abstract: Recent incidents in certain online games and communities, where anonymity is guaranteed, show that unchecked inappropriate remarks frequently escalate into verbal abuse and even criminal behavior, raising significant social concerns. Consequently, there is a growing need for research on techniques that can detect inappropriate utterances within conversational texts to help build a safer communication environment. Although large-scale language models trained on Korean corpora and chain-of-thought reasoning have recently gained attention, research applying these approaches to inappropriate utterance detection remains limited. In this study, we propose a soft inductive bias approach that explicitly defines reasoning perspectives to guide the inference process, thereby promoting rational decision-making and preventing errors that may arise during reasoning. We fine-tune a Korean large language model using the proposed method and conduct both quantitative performance comparisons and qualitative evaluations across different training strategies. Experimental results show that the Kanana-1.5 model achieves an average accuracy of 87.0046, improving by approximately 3.89 percent over standard supervised learning. These findings indicate that the proposed method goes beyond simple knowledge imitation by large language models and enables more precise and consistent judgments through constrained reasoning perspectives, demonstrating its effectiveness for inappropriate utterance detection.

[10] Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks

Indrajit Kar, Kalathur Chenchu Kishore Kumar

Main category: cs.CL

TL;DR: Hierarchical multi-agent system with 64x64 grid of lightweight agents uses spatial curriculum learning and confidence-based selection to improve long-horizon reasoning tasks like Tower of Hanoi.

DetailsMotivation: Current LLMs and multi-agent systems struggle with long-horizon reasoning tasks and have high computational costs. There's a need for more efficient distributed reasoning approaches that can handle complex, spatially-grounded tasks.

Method: Proposes hierarchical multi-agent architecture with 64x64 grid of lightweight agents supported by selective oracle. Uses spatial curriculum that expands from center outward, NLL confidence measure, and Thompson Sampling curriculum manager to adaptively select training zones based on competence and reward signals.

Result: System demonstrates improved stability, reduced oracle usage, and stronger long-range reasoning through distributed agent cooperation on spatially grounded Tower of Hanoi benchmark.

Conclusion: The hierarchical multi-agent approach with spatial curriculum and confidence-based selection effectively addresses long-horizon reasoning challenges while reducing computational costs, showing promise for robotic manipulation and planning tasks.

Abstract: Large Language Models and multi-agent systems have shown promise in decomposing complex tasks, yet they struggle with long-horizon reasoning tasks and escalating computation cost. This work introduces a hierarchical multi-agent architecture that distributes reasoning across a 64*64 grid of lightweight agents, supported by a selective oracle. A spatial curriculum progressively expands the operational region of the grid, ensuring that agents master easier central tasks before tackling harder peripheral ones. To improve reliability, the system integrates Negative Log-Likelihood as a measure of confidence, allowing the curriculum to prioritize regions where agents are both accurate and well calibrated. A Thompson Sampling curriculum manager adaptively chooses training zones based on competence and NLL-driven reward signals. We evaluate the approach on a spatially grounded Tower of Hanoi benchmark, which mirrors the long-horizon structure of many robotic manipulation and planning tasks. Results demonstrate improved stability, reduced oracle usage, and stronger long-range reasoning from distributed agent cooperation.

[11] HealthcareNLP: where are we and what is next?

Lifeng Han, Paul Rayson, Suzan Verberne, Andrew Moore, Goran Nenadic

Main category: cs.CL

TL;DR: A tutorial introducing HealthcareNLP applications, covering data/resources, NLP evaluation, and patient-focused layers, with hands-on sessions for practitioners and researchers.

DetailsMotivation: Existing reviews overlook important HealthcareNLP tasks like synthetic data generation for privacy and explainable clinical NLP, and fail to mention key methodologies like retrieval augmented generation and neural symbolic integration of LLMs and KGs.

Method: Three-layer hierarchical framework: 1) Data/Resource layer (annotation guidelines, ethical approvals, governance, synthetic data), 2) NLP-Eval layer (NER, RE, sentiment analysis, linking/coding with categorized methods for explainable HealthAI), 3) Patients layer (PPIE, health literacy, translation, simplification, summarization, shared decision-making support). Includes hands-on sessions.

Result: Provides comprehensive introductory overview of HealthcareNLP sub-areas, addressing gaps in existing reviews and offering practical applications through hands-on sessions.

Conclusion: The tutorial successfully bridges gaps in HealthcareNLP literature by presenting a structured three-layer framework and practical applications, making the field accessible to diverse audiences including NLP practitioners, researchers, healthcare professionals, and students without requiring prior knowledge.

Abstract: This proposed tutorial focuses on Healthcare Domain Applications of NLP, what we have achieved around HealthcareNLP, and the challenges that lie ahead for the future. Existing reviews in this domain either overlook some important tasks, such as synthetic data generation for addressing privacy concerns, or explainable clinical NLP for improved integration and implementation, or fail to mention important methodologies, including retrieval augmented generation and the neural symbolic integration of LLMs and KGs. In light of this, the goal of this tutorial is to provide an introductory overview of the most important sub-areas of a patient- and resource-oriented HealthcareNLP, with three layers of hierarchy: data/resource layer: annotation guidelines, ethical approvals, governance, synthetic data; NLP-Eval layer: NLP tasks such as NER, RE, sentiment analysis, and linking/coding with categorised methods, leading to explainable HealthAI; patients layer: Patient Public Involvement and Engagement (PPIE), health literacy, translation, simplification, and summarisation (also NLP tasks), and shared decision-making support. A hands-on session will be included in the tutorial for the audience to use HealthcareNLP applications. The target audience includes NLP practitioners in the healthcare application domain, NLP researchers who are interested in domain applications, healthcare researchers, and students from NLP fields. The type of tutorial is “Introductory to CL/NLP topics (HealthcareNLP)” and the audience does not need prior knowledge to attend this. Tutorial materials: https://github.com/4dpicture/HealthNLP

[12] QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models

Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier

Main category: cs.CL

TL;DR: QSTN is an open-source Python framework for generating questionnaire responses using LLMs, enabling systematic evaluation of survey design and response generation methods with low compute cost.

DetailsMotivation: To support in-silico surveys and annotation tasks with LLMs by providing a systematic way to generate questionnaire responses, addressing the need for robust evaluation of survey presentation, prompt perturbations, and response generation methods.

Method: Developed an open-source Python framework (QSTN) that enables systematic generation of responses from questionnaire-style prompts, with features for evaluating questionnaire presentation, prompt perturbations, and response generation methods. Includes a no-code user interface for researchers without coding knowledge.

Result: Extensive evaluation (>40 million survey responses) shows that question structure and response generation methods significantly impact alignment of generated responses with human answers, and can be obtained at a fraction of the compute cost.

Conclusion: QSTN framework supports reproducibility and reliability of LLM-based research by providing tools for systematic questionnaire response generation and evaluation, making robust experiments accessible to researchers without coding expertise.

Abstract: We introduce QSTN, an open-source Python framework for systematically generating responses from questionnaire-style prompts to support in-silico surveys and annotation tasks with large language models (LLMs). QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods. Our extensive evaluation ($>40 $ million survey responses) shows that question structure and response generation methods have a significant impact on the alignment of generated survey responses with human answers, and can be obtained for a fraction of the compute cost. In addition, we offer a no-code user interface that allows researchers to set up robust experiments with LLMs without coding knowledge. We hope that QSTN will support the reproducibility and reliability of LLM-based research in the future.

[13] An Agentic AI System for Multi-Framework Communication Coding

Bohao Yang, Rui Yang, Joshua M. Biro, Haoyuan Wang, Jessica L. Handley, Brianna Richardson, Sophia Bessias, Nicoleta Economou-Zavlanos, Armando D. Bedoya, Monica Agrawal, Michael M. Zavlanos, Anand Chowdhury, Raj M. Ratwani, Kai Sun, Kathryn I. Pollak, Michael J. Pencina, Chuan Hong

Main category: cs.CL

TL;DR: MOSAIC is a multi-agent AI system for clinical communication annotation that achieves high accuracy (F1=0.928) by combining specialized agents for planning, updating, annotation, and verification.

DetailsMotivation: Current clinical communication annotation methods are labor-intensive, inconsistent, and difficult to scale. Single-task LLM approaches lack adaptability, interpretability, and reliability across different communication frameworks and clinical domains.

Method: Developed MOSAIC using LangGraph-based architecture with four core agents: Plan Agent (codebook selection/workflow planning), Update Agent (maintaining retrieval databases), Annotation Agents (codebook-guided RAG with dynamic few-shot prompting), and Verification Agent (consistency checks/feedback).

Result: Achieved overall F1 score of 0.928 on test set (50 transcripts). Best performance in Rheumatology (F1=0.962) and strongest for Patient Behavior categories. Outperformed baseline benchmarks in ablation studies.

Conclusion: MOSAIC demonstrates that structured multi-agent AI systems can effectively automate clinical communication annotation with high accuracy, addressing scalability and consistency challenges while maintaining adaptability across clinical domains.

Abstract: Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.

[14] Automatic Essay Scoring and Feedback Generation in Basque Language Learning

Ekhi Azurmendi, Xabier Arregi, Oier Lopez de Lacalle

Main category: cs.CL

TL;DR: First public Basque AES dataset with 3,200 CEFR C1 essays, fine-tuned open-source models outperform GPT-5/Claude in scoring and feedback, with novel evaluation methodology.

DetailsMotivation: Address the lack of publicly available datasets for Automatic Essay Scoring (AES) and feedback generation in low-resource languages like Basque, particularly at CEFR C1 proficiency level, to enable transparent and reproducible NLP research in educational contexts.

Method: Created dataset of 3,200 Basque essays with expert annotations (scores for correctness, richness, coherence, cohesion, task alignment plus detailed feedback). Fine-tuned open-source models including RoBERTa-EusCrawl and Latxa 8B/70B for scoring and explanation generation. Proposed novel evaluation methodology combining automatic consistency metrics with expert validation of extracted learner errors.

Result: Encoder models remain reliable for AES, while supervised fine-tuning of Latxa significantly enhances performance, surpassing state-of-the-art closed-source systems (GPT-5, Claude Sonnet 4.5) in scoring consistency and feedback quality. Fine-tuned Latxa produces criterion-aligned, pedagogically meaningful feedback and identifies wider range of error types than proprietary models.

Conclusion: This resource and benchmark establish foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages like Basque, demonstrating that open-source models can outperform proprietary systems when properly fine-tuned on domain-specific data.

Abstract: This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level. The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples. We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation. Our experiments show that encoder models remain highly reliable for AES, while supervised fine-tuning (SFT) of Latxa significantly enhances performance, surpassing state-of-the-art (SoTA) closed-source systems such as GPT-5 and Claude Sonnet 4.5 in scoring consistency and feedback quality. We also propose a novel evaluation methodology for assessing feedback generation, combining automatic consistency metrics with expert-based validation of extracted learner errors. Results demonstrate that the fine-tuned Latxa model produces criterion-aligned, pedagogically meaningful feedback and identifies a wider range of error types than proprietary models. This resource and benchmark establish a foundation for transparent, reproducible, and educationally grounded NLP research in low-resource languages such as Basque.

[15] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

David Samuel, Lilja Øvrelid, Erik Velldal, Andrey Kutuzov

Main category: cs.CL

TL;DR: A post-training method for lower-resource languages that preserves fluency when aligning language models with disfluent reward models, using on-policy training without requiring instruction-tuning data in the target language.

DetailsMotivation: Lower-resource languages lack both native-speaker datasets and fluent language models for generating synthetic data, making preference optimization challenging. Previous work has focused mainly on English and Chinese, leaving a gap for other languages.

Method: Proposes an on-policy training method for preference alignment without instruction-tuning data in the target language. Compares with supervised finetuning on machine-translated data and multilingual finetuning approaches.

Result: Case study on Norwegian Bokmål shows that on-policy training outperforms alternatives and preserves fluency, as evaluated by native-speaker assessments, without requiring hard-to-obtain data.

Conclusion: On-policy training is crucial for developing fluent preference-aligned language models for lower-resource languages, providing an effective solution without needing native-speaker instruction data.

Abstract: We propose a post-training method for lower-resource languages that preserves fluency of language models even when aligned by disfluent reward models. Preference-optimization is now a well-researched topic, but previous work has mostly addressed models for English and Chinese. Lower-resource languages lack both datasets written by native speakers and language models capable of generating fluent synthetic data. Thus, in this work, we focus on developing a fluent preference-aligned language model without any instruction-tuning data in the target language. Our approach uses an on-policy training method, which we compare with two common approaches: supervised finetuning on machine-translated data and multilingual finetuning. We conduct a case study on Norwegian Bokmål and evaluate fluency through native-speaker assessments. The results show that the on-policy aspect is crucial and outperforms the alternatives without relying on any hard-to-obtain data.

[16] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki

Main category: cs.CL

TL;DR: The paper proposes an adaptive aggregation method for aligning LLMs with diverse human preferences in federated learning, achieving better fairness while maintaining competitive alignment quality compared to standard aggregation techniques.

DetailsMotivation: Standard methods for aligning LLMs in federated learning environments often fail to adequately represent diverse human preferences and viewpoints, creating a need for approaches that can balance alignment quality with fairness across different groups.

Method: The authors introduce a comprehensive evaluation framework for federated LLM alignment, where groups locally evaluate rollouts and produce reward signals. They evaluate standard aggregation techniques (min, max, average) and propose a novel adaptive scheme that dynamically adjusts preference weights based on each group’s historical alignment performance.

Result: Experiments on question-answering tasks using a PPO-based RLHF pipeline show that the adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores compared to standard aggregation methods.

Conclusion: The work provides both a robust methodology for evaluating LLM behavior across diverse populations and a practical solution for developing truly pluralistic and fairly aligned models in federated learning settings.

Abstract: This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group’s historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.

[17] Ask, Answer, and Detect: Role-Playing LLMs for Personality Detection with Question-Conditioned Mixture-of-Experts

Yifan Lyu, Liang Zhang

Main category: cs.CL

TL;DR: ROME: A framework that injects psychological knowledge into personality detection by using LLMs to simulate user responses to psychometric questionnaires, transforming free-form posts into questionnaire-grounded evidence for better personality prediction.

DetailsMotivation: Existing personality detection methods face two main challenges: 1) limited supervision signals due to label scarcity, and 2) under-specified semantic mappings between user language and abstract psychological constructs. Current "posts -> user vector -> labels" approaches don't effectively leverage psychological knowledge.

Method: ROME uses LLMs’ role-play capability to simulate user responses to validated psychometric questionnaires. It transforms user posts into question-level answers that provide intermediate supervision. A question-conditioned Mixture-of-Experts module routes over post and question representations, learning to answer questionnaire items. Predicted answers are summarized into an interpretable answer vector and fused with user representation for final prediction in a multi-task learning framework.

Result: Extensive experiments on two real-world datasets show ROME consistently outperforms state-of-the-art baselines, achieving 15.41% improvement on the Kaggle dataset.

Conclusion: ROME successfully addresses label scarcity and semantic mapping challenges in personality detection by explicitly injecting psychological knowledge through questionnaire simulation, providing interpretable intermediate supervision and improving prediction accuracy significantly.

Abstract: Understanding human personality is crucial for web applications such as personalized recommendation and mental health assessment. Existing studies on personality detection predominantly adopt a “posts -> user vector -> labels” modeling paradigm, which encodes social media posts into user representations for predicting personality labels (e.g., MBTI labels). While recent advances in large language models (LLMs) have improved text encoding capacities, these approaches remain constrained by limited supervision signals due to label scarcity, and under-specified semantic mappings between user language and abstract psychological constructs. We address these challenges by proposing ROME, a novel framework that explicitly injects psychological knowledge into personality detection. Inspired by standardized self-assessment tests, ROME leverages LLMs’ role-play capability to simulate user responses to validated psychometric questionnaires. These generated question-level answers transform free-form user posts into interpretable, questionnaire-grounded evidence linking linguistic cues to personality labels, thereby providing rich intermediate supervision to mitigate label scarcity while offering a semantic reasoning chain that guides and simplifies the text-to-personality mapping learning. A question-conditioned Mixture-of-Experts module then jointly routes over post and question representations, learning to answer questionnaire items under explicit supervision. The predicted answers are summarized into an interpretable answer vector and fused with the user representation for final prediction within a multi-task learning framework, where question answering serves as a powerful auxiliary task for personality detection. Extensive experiments on two real-world datasets demonstrate that ROME consistently outperforms state-of-the-art baselines, achieving improvements (15.41% on Kaggle dataset).

[18] DiscoVerse: Multi-Agent Pharmaceutical Co-Scientist for Traceable Drug Discovery and Reverse Translation

Xiaochen Zheng, Alvaro Serra, Ilya Schneider Chernov, Maddalena Marchesi, Eunice Musvasva, Tatyana Y. Doktorova

Main category: cs.CL

TL;DR: DiscoVerse is a multi-agent AI system developed at Roche to enable evidence-based querying of pharmaceutical R&D archives for reverse translation, achieving high recall and moderate precision on real drug development data.

DetailsMotivation: Pharmaceutical R&D generates vast heterogeneous data from discontinued programs that could be invaluable for reverse translation, but practical reuse is often infeasible due to data complexity and institutional memory challenges.

Method: DiscoVerse is a human-in-the-loop multi-agent co-scientist system with role-specialized agents aligned with scientist workflows. It retrieves relevant data, links across documents, summarizes findings, and preserves institutional memory from confidential pharmaceutical archives.

Result: On seven benchmark queries, DiscoVerse achieved near-perfect recall (≥0.99) with moderate precision (0.71-0.91). Qualitative assessments and three real-world pharmaceutical use cases demonstrated faithful, source-linked synthesis across preclinical and clinical evidence.

Conclusion: DiscoVerse represents the first systematically assessed agentic framework for reverse translation on real pharmaceutical data, showing promising decision-making insights through authorized access to confidential drug development archives spanning over four decades.

Abstract: Pharmaceutical research and development has accumulated vast and heterogeneous archives of data. Much of this knowledge stems from discontinued programs, and reusing these archives is invaluable for reverse translation. However, in practice, such reuse is often infeasible. In this work, we introduce DiscoVerse, a multi-agent co-scientist designed to support pharmaceutical research and development at Roche. Designed as a human-in-the-loop assistant, DiscoVerse enables domain-specific queries by delivering evidence-based answers: it retrieves relevant data, links across documents, summarises key findings and preserves institutional memory. We assess DiscoVerse through expert evaluation of source-linked outputs. Our evaluation spans a selected subset of 180 molecules from Roche’s research and development repositories, encompassing over 0.87 billion BPE tokens and more than four decades of research. To our knowledge, this represents the first agentic framework to be systematically assessed on real pharmaceutical data for reverse translation, enabled by authorized access to confidential archives covering the full lifecycle of drug development. Our contributions include: role-specialized agent designs aligned with scientist workflows; human-in-the-loop support for reverse translation; expert evaluation; and a large-scale demonstration showing promising decision-making insights. In brief, across seven benchmark queries, DiscoVerse achieved near-perfect recall ($\geq 0.99$) with moderate precision ($0.71-0.91$). Qualitative assessments and three real-world pharmaceutical use cases further showed faithful, source-linked synthesis across preclinical and clinical evidence.

[19] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

Ferdinand Kapl, Emmanouil Angelis, Tobias Höppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer

Main category: cs.CL

TL;DR: Gradual depth growth in Transformers improves reasoning by overcoming the “Curse of Depth” - where later layers contribute less in standard models. Growth enables better depth utilization, alters residual stream structure, and creates permutable computational blocks.

DetailsMotivation: While MIDAS showed that gradually growing Transformer depth reduces training cost and improves reasoning, there's no mechanistic understanding of why this works. The paper aims to connect this to the "Curse of Depth" phenomenon where later layers contribute less in standard models.

Method: Uses depth-wise analyses to study gradual middle stacking growth. Demonstrates how growth affects depth utilization, residual stream structure, and computational block formation. Also proposes a lightweight modification to MIDAS for further improvements.

Result: Gradual growth yields more effective depth utilization, alters residual stream structure, and facilitates permutable computational blocks. The proposed MIDAS modification shows further improvements on downstream reasoning benchmarks.

Conclusion: Gradual depth growth leads to distinct computational circuits that overcome limited depth utilization in standard models, providing a mechanistic explanation for MIDAS’s success and suggesting improved training strategies.

Abstract: Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

[20] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang

Main category: cs.CL

TL;DR: RAGLens is a lightweight hallucination detector for RAG systems that uses sparse autoencoders to identify hallucination features in LLM internal representations, achieving better performance than existing methods while providing interpretable rationales.

DetailsMotivation: Existing RAG hallucination detection methods either require large annotated datasets for training or rely on expensive external LLM judges, while approaches using internal LLM representations have limited accuracy. There's a need for a more effective and efficient solution.

Method: Uses sparse autoencoders (SAEs) to disentangle LLM internal activations, identifies hallucination-specific features through information-based feature selection and additive feature modeling, creating RAGLens - a lightweight detector that analyzes internal representations.

Result: RAGLens achieves superior hallucination detection performance compared to existing methods, provides interpretable rationales for its decisions, enables post-hoc mitigation of unfaithful RAG outputs, and reveals insights about hallucination signal distribution in LLMs.

Conclusion: RAGLens demonstrates that mechanistic interpretability techniques like sparse autoencoders can effectively identify hallucination features in LLM internal representations, enabling lightweight yet accurate hallucination detection for RAG systems with interpretable rationales.

Abstract: Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.

[21] Detecting value-expressive text posts in Russian social media

Maria Milkova, Maksim Rudnev, Lidia Okolskaya

Main category: cs.CL

TL;DR: Researchers developed a model to detect value-expressive posts in Russian social media VKontakte, using ensemble human-AI annotation and transformer embeddings, achieving F1=0.77 for value detection.

DetailsMotivation: Studying personal values in social media can reveal societal value evolution, especially when traditional survey methods are inefficient for hard-to-reach populations. However, social media content often contains stereotyped speech rather than authentic value expressions.

Method: Created training dataset of 5,035 VKontakte posts annotated by experts, crowd-workers, and ChatGPT. Applied ensemble human-AI annotation with active learning. Trained classification models using embeddings from pre-trained transformer language models, including fine-tuned rubert-tiny2.

Result: Crowd-workers and experts showed moderate agreement; ChatGPT was more consistent but struggled with spam detection. Best performance achieved with fine-tuned rubert-tiny2 embeddings: F1 = 0.77, F1-macro = 0.83 for value detection.

Conclusion: The developed model provides a crucial tool for studying values within and between Russian social media users, enabling analysis of value expression in social media content despite challenges with stereotyped speech.

Abstract: Basic values are concepts or beliefs which pertain to desirable end-states and transcend specific situations. Studying personal values in social media can illuminate how and why societal values evolve especially when the stimuli-based methods, such as surveys, are inefficient, for instance, in hard-to-reach populations. On the other hand, user-generated content is driven by the massive use of stereotyped, culturally defined speech constructions rather than authentic expressions of personal values. We aimed to find a model that can accurately detect value-expressive posts in Russian social media VKontakte. A training dataset of 5,035 posts was annotated by three experts, 304 crowd-workers and ChatGPT. Crowd-workers and experts showed only moderate agreement in categorizing posts. ChatGPT was more consistent but struggled with spam detection. We applied an ensemble of human- and AI-assisted annotation involving active learning approach, subsequently trained several classification models using embeddings from various pre-trained transformer-based language models. The best performance was achieved with embeddings from a fine-tuned rubert-tiny2 model, yielding high value detection quality (F1 = 0.77, F1-macro = 0.83). This model provides a crucial step to a study of values within and between Russian social media users.

[22] B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg

Main category: cs.CL

TL;DR: B-cos Language Models (LMs) transform pre-trained language models into inherently explainable architectures for NLP tasks, producing more faithful and interpretable explanations than post-hoc methods while maintaining comparable task performance.

DetailsMotivation: Current post-hoc explanation methods for black-box models suffer from faithfulness and interpretability issues, while existing B-cos networks have been limited to computer vision applications. There's a need to extend explainable architectures to natural language processing.

Method: Direct transformation of pre-trained language models into B-cos LMs through B-cos conversion combined with task fine-tuning, improving efficiency over previous methods. Also explores transforming decoder-only models for generation tasks.

Result: B-cos LMs produce more faithful and human-interpretable explanations than post-hoc methods while maintaining task performance comparable to conventional fine-tuning. In-depth analysis reveals differences in learning processes and explanation patterns.

Conclusion: B-cos LMs successfully extend explainable architectures to NLP, offering a promising approach for building inherently interpretable language models that balance explanation quality with task performance.

Abstract: Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. Our code is available at https://github.com/Ewanwong/bcos_lm.

[23] Survey and Experiments on Mental Disorder Detection via Social Media: From Large Language Models and RAG to Agents

Zhuohan Ge, Nicole Hu, Yubo Wang, Darian Li, Xinyi Zhu, Haoyang Li, Xin Zhang, Mingtao Zhang, Shihao Qi, Yuming Xu, Han Shi, Chen Jason Zhang, Qing Li

Main category: cs.CL

TL;DR: Survey paper on LLM-based methods for social media mental disorder analysis, covering standard models, RAG for reliability, and agentic systems for autonomous reasoning and intervention.

DetailsMotivation: Mental disorders are a global health challenge, and social media offers real-time data for digital phenotyping. LLMs provide better semantic understanding than traditional methods but face reliability issues (hallucinations, lack of memory) that limit clinical use. Existing literature hasn't sufficiently explored how RAG and agentic systems can address these limitations.

Method: Systematic survey of LLM-based methods for social media mental disorder analysis, organized by technical paradigm (standard pretrained models, RAG, agentic systems) and clinical targets. Comprehensive evaluation of LLM performance across tasks, including impact of RAG. Establishes unified benchmark for the field.

Result: The paper establishes a comprehensive framework for understanding LLM applications in mental health analysis, extending beyond common internalizing disorders to include psychotic disorders and externalizing behaviors. Provides evaluation of how RAG and agentic systems can improve reliability and reasoning capabilities.

Conclusion: This work paves the way for developing trustworthy, autonomous AI systems that can deliver precise and explainable mental health support by addressing current limitations of LLMs through advanced enhancement techniques like RAG and agentic systems.

Abstract: Mental disorders represent a critical global health challenge, and social media is increasingly viewed as a vital resource for real-time digital phenotyping and intervention. Large Language Models (LLMs) offer stronger semantic understanding and reasoning than traditional deep learning, but their use in high-stakes clinical settings is limited by hallucinations and the lack of persistent memory. However, existing literature has not sufficiently investigated how advanced enhancement techniques, specifically Retrieval-Augmented Generation (RAG) and Agentic systems, can address these reliability and reasoning limitations. Here, we systematically survey the evolving landscape of LLM-based methods for social media mental disorder analysis, spanning standard pretrained language models, RAG to mitigate hallucinations and contextual gaps, and agentic systems for autonomous reasoning and multi-step intervention. We organize existing work by technical paradigm and clinical target, extending beyond common internalizing disorders to include psychotic disorders and externalizing behaviors. Additionally, the paper comprehensively evaluates the performance of LLMs, including the impact of RAG, across various tasks. This work establishes a unified benchmark for the field, paving the way for the development of trustworthy, autonomous AI systems that can deliver precise and explainable mental health support.

[24] EEG-to-Text Translation: A Model for Deciphering Human Brain Activity

Saydul Akbar Murad, Ashim Dahal, Nick Rahimi

Main category: cs.CL

TL;DR: R1 Translator: A new EEG-to-text decoding model combining bidirectional LSTM encoder with pretrained transformer decoder, achieving state-of-the-art performance on ROUGE metrics, CER, and WER compared to T5 and Brain Translator.

DetailsMotivation: With rapid advancement of large language models, bridging the gap between human brain and language processing has become important. Existing EEG-to-text decoding models face significant performance limitations that need to be overcome.

Method: Proposes R1 Translator model combining bidirectional LSTM encoder with pretrained transformer-based decoder. Processes EEG embeddings through LSTM to capture sequential dependencies, then feeds them into transformer decoder for text generation.

Result: R1 Translator outperforms both T5 and Brain Translator across all metrics: ROUGE-1 score of 38.00% (9% higher than T5, 3% better than Brain), ROUGE-L F1 score of 32.51% (3% better than T5, 2% better than Brain), CER of 0.5795 (2% lower than T5, 4% lower than Brain), WER of 0.7280 (4.3% better than T5, 3.6% better than Brain).

Conclusion: R1 Translator successfully improves EEG-to-text decoding performance by combining LSTM’s sequential modeling with transformer’s generation capabilities, achieving state-of-the-art results and demonstrating the effectiveness of hybrid architecture for brain signal decoding.

Abstract: With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at https://github.com/Mmurrad/EEG-To-text.

[25] OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh

Main category: cs.CL

TL;DR: Omniguard is a novel approach for detecting harmful prompts across languages and modalities by identifying cross-aligned internal representations in LLMs/MLLMs and building language/modality-agnostic classifiers.

DetailsMotivation: Current harmful prompt detection methods are fallible and vulnerable to attacks exploiting mismatched generalization of model capabilities, particularly in low-resource languages or non-text modalities like images and audio.

Method: Omniguard identifies internal representations of LLMs/MLLMs that are aligned across languages or modalities, then uses these representations to build language-agnostic or modality-agnostic classifiers for harmful prompt detection.

Result: Improves harmful prompt classification by 11.57% in multilingual settings, 20.44% for image-based prompts, and sets new SOTA for audio-based prompts. Also highly efficient (≈120× faster than next fastest baseline).

Conclusion: Omniguard provides an effective and efficient solution for cross-lingual and cross-modal harmful prompt detection, addressing critical vulnerabilities in current safety mechanisms for LLMs/MLLMs.

Abstract: The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose Omniguard, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. Omniguard improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting, by 20.44% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, Omniguard is also very efficient ($\approx!120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.

[26] Bench4KE: Benchmarking Automated Competency Question Generation

Anna Sofia Lippolis, Minh Davide Ragagni, Paolo Ciancarini, Andrea Giovanni Nuzzolese, Valentina Presutti

Main category: cs.CL

TL;DR: Bench4KE is an API-based benchmarking system for evaluating LLM-based Knowledge Engineering automation tools, starting with Competency Question generation, using curated datasets and similarity metrics.

DetailsMotivation: LLMs enable Knowledge Engineering automation, but evaluation of tools like Competency Question generators lacks standardization, undermining methodological rigor and hindering replication/comparison of results.

Method: Developed Bench4KE - an extensible API-based benchmarking system with curated gold standard datasets from 17 real-world ontology projects, using similarity metrics to assess generated CQ quality.

Result: Comparative analysis of 6 recent LLM-based CQ generation systems, establishing baseline for future research. System designed to accommodate additional KE automation tasks like SPARQL query generation and ontology testing.

Conclusion: Bench4KE addresses standardization gap in KE automation evaluation, provides reproducible benchmarking framework, and supports future expansion to other knowledge engineering tasks.

Abstract: The availability of Large Language Models (LLMs) presents a unique opportunity to reinvigorate research on Knowledge Engineering (KE) automation. This trend is already evident in recent efforts developing LLM-based methods and tools for the automatic generation of Competency Questions (CQs), natural language questions used by ontology engineers to define the functional requirements of an ontology. However, the evaluation of these tools lacks standardization. This undermines the methodological rigor and hinders the replication and comparison of results. To address this gap, we introduce Bench4KE, an extensible API-based benchmarking system for KE automation. The presented release focuses on evaluating tools that generate CQs automatically. Bench4KE provides a curated gold standard consisting of CQ datasets from 17 real-world ontology engineering projects and uses a suite of similarity metrics to assess the quality of the CQs generated. We present a comparative analysis of 6 recent CQ generation systems, which are based on LLMs, establishing a baseline for future research. Bench4KE is also designed to accommodate additional KE automation tasks, such as SPARQL query generation, ontology testing and drafting. Code and datasets are publicly available under the Apache 2.0 license.

[27] Shrinking the Generation-Verification Gap with Weak Verifiers

Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Ré

Main category: cs.CL

TL;DR: Weaver is a framework that combines multiple weak verifiers into a strong ensemble using weak supervision, improving language model response selection without extensive labeled data.

DetailsMotivation: Current verifiers for language models are either unscalable (humans) or limited in utility, and there's a significant performance gap between LM judges/reward models and oracle verifiers with perfect accuracy.

Method: Weaver combines multiple weak, imperfect verifiers using weighted ensembles. It uses weak supervision to estimate each verifier’s accuracy without extensive labeled data, normalizes outputs using dataset statistics, filters low-quality verifiers, and creates unified scores.

Result: Weaver significantly improves Pass@1 performance across reasoning and math tasks, achieving o3-mini-level accuracy (87.7% average) with Llama 3.3 70B as generator and ensemble of 70B or smaller verifiers. This mirrors the jump between GPT-4o and o3-mini.

Conclusion: Weaver effectively closes the performance gap between current verifiers and oracle verifiers by combining weak verifiers through weak supervision, and computational costs can be reduced by training a smaller cross-encoder using Weaver’s combined scores.

Abstract: Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier’s accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver’s effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver’s combined output scores.

[28] Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores

Vivek Chari, Benjamin Van Durme

Main category: cs.CL

TL;DR: Compactor is a training-free KV compression method using leverage scores for token importance, achieving same performance with 20% fewer tokens and 68% KV memory reduction on Longbench.

DetailsMotivation: Large context windows in modern LLMs create significant KV cache memory bottlenecks during inference, requiring efficient compression methods to reduce memory burden without performance loss.

Method: Training-free, query-agnostic KV compression using approximate leverage scores to determine token importance, with context-calibrated compression to infer maximum compression before performance loss.

Result: Achieves same performance as competing methods with 20% fewer tokens, 68% KV memory reduction on Longbench, validated on 27 tasks from RULER and Longbench across Qwen 2.5 and Llama 3.1 models.

Conclusion: Compactor provides a practical, high-performance solution for alleviating LLM memory bottlenecks, with released inference engine compactor-vllm supporting efficient sparse memory access patterns.

Abstract: Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. We present Compactor, a training-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 20% fewer tokens in both synthetic and real-world context tasks, while being more task-robust. We further introduce a procedure for context-calibrated compression: inferring the maximum compression a given context supports before significant performance loss. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 68%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families. Finally, we release compactor-vllm, an inference engine and suite of optimized Triton kernels designed to efficiently support the sparse, non-contiguous memory access patterns inherent to compressed KV caches. This work demonstrates that Compactor offers a practical, high-performance solution for alleviating the memory bottleneck in modern LLM deployment.

[29] CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings

Kyeongkyu Lee, Seonghwan Yoon, Hongki Lim

Main category: cs.CL

TL;DR: CLARIFID is a novel framework for automatic radiology report generation that optimizes diagnostic correctness by mimicking expert workflow, using section-aware pretraining, reinforcement learning with clinical reward signals, controlled decoding, and multi-view image fusion.

DetailsMotivation: Current radiology report generation methods produce fluent text but lack factual correctness and often use single-view images, limiting diagnostic reliability and comprehensiveness. There's a need for clinically reliable automated reporting that mirrors expert diagnostic workflows.

Method: CLARIFID employs a four-part approach: (1) section-aware pretraining to learn logical flow from Findings to Impression, (2) fine-tuning with Proximal Policy Optimization using CheXbert F1 score as reward, (3) controlled decoding that completes Findings before synthesizing Impression, and (4) multi-view fusion using vision-transformer encoder. Inference uses next-token forcing and report-level re-ranking.

Result: Experimental results on MIMIC-CXR dataset show superior clinical efficacy compared to existing baselines, with improved diagnostic correctness and comprehensive reporting.

Conclusion: CLARIFID successfully addresses the clinical reliability gap in radiology report generation by mirroring expert diagnostic workflow, optimizing for factual correctness through reinforcement learning, and incorporating multi-view image analysis, resulting in clinically effective automated reporting.

Abstract: Automatic generation of radiology reports has the potential to alleviate radiologists’ significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) employs controlled decoding that completes “Findings” before synthesizing the “Impression”, and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive “Findings” section before synthesizing the “Impression” and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on clinical efficacy scores.

[30] LLMs Can’t Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

Maojia Song, Tej Deep Pala, Ruiwen Zhou, Weisheng Jin, Amir Zadeh, Chuan Li, Dorien Herremans, Soujanya Poria

Main category: cs.CL

TL;DR: KAIROS benchmark reveals LLMs’ social influence susceptibility varies by model size: larger models are more resilient to peer influence, while smaller models remain vulnerable and require specialized training for robustness.

DetailsMotivation: LLMs are increasingly used in multi-agent systems where peer interactions affect decisions, but prior work focused narrowly on conformity bias. The paper aims to study broader social dynamics including rapport building, quality discernment, and resistance to misleading inputs - essential for achieving collective intelligence.

Method: Introduces KAIROS benchmark that simulates quiz-style collaboration with peer agents where rapport levels and behaviors can be precisely controlled. Evaluates prompting, supervised fine-tuning, and reinforcement learning via Group Relative Policy Optimisation (GRPO) to analyze how rapport, peer actions, and self-confidence influence decision-making.

Result: Model scale is the primary factor moderating susceptibility to social influence: larger models are more resilient and benefit from prompting-based mitigation, while smaller models remain vulnerable. Only carefully configured GRPO training yields consistent robustness and performance gains for small models.

Conclusion: The study provides a comprehensive framework for analyzing LLM social dynamics in multi-agent settings, revealing critical scaling effects and offering practical mitigation strategies through specialized training approaches for different model sizes.

Abstract: Large language models (LLMs) are increasingly integrated into multi-agent systems (MAS), where peer interactions shape individual decisions. While prior work has mainly examined conformity bias, we broaden the view to include how LLMs build rapport from prior interactions, discern and integrate high-quality peer information, and resist misleading inputs-abilities essential for achieving collective intelligence under complex social dynamics. We introduce KAIROS, a benchmark that simulates quiz-style collaboration with peer agents whose rapport levels and behaviours can be precisely controlled in both historical interactions and the current round. This unified setup enables systematic analysis of how rapport, peer actions, and the model’s self-confidence jointly influence decision-making. Using KAIROS, we evaluate prompting, supervised fine-tuning, and reinforcement learning via Group Relative Policy Optimisation (GRPO). Results show that model scale is a primary factor moderating susceptibility to social influence: larger models are more resilient and benefit from prompting-based mitigation, whereas smaller models remain vulnerable. Only carefully configured GRPO training yields consistent robustness and performance gains for small models.

[31] SENSE models: an open source solution for multilingual and multimodal semantic-based tasks

Salima Mdhaffar, Haroun Elleuch, Chaimae Chellaf, Ha Nguyen, Yannick Estève

Main category: cs.CL

TL;DR: SENSE is an open-source multilingual speech-text alignment model using teacher-student framework, integrated into SpeechBrain toolkit with competitive performance on semantic tasks.

DetailsMotivation: To create an open-source solution for aligning speech and text representations across multiple languages, inspired by existing frameworks like SAMU-XLSR and SONAR, while improving upon them with better components.

Method: Uses teacher-student framework to align self-supervised speech encoder with language-agnostic text encoder representations at utterance level. Updates SAMU-XLSR by selecting stronger teacher text model and better initial speech encoder.

Result: Achieves highly competitive performance on multilingual and multimodal semantic tasks. First SENSE model has been publicly released with source code integrated into SpeechBrain toolkit.

Conclusion: SENSE provides effective multilingual speech-text alignment with competitive performance, offers new insights into semantic capture in aligned speech encoders, and contributes open-source tools to the community.

Abstract: This paper introduces SENSE (Shared Embedding for N-lingual Speech and tExt), an open-source solution inspired by the SAMU-XLSR framework and conceptually similar to Meta AI’s SONAR models. These approaches rely on a teacher-student framework to align a self-supervised speech encoder with the language-agnostic continuous representations of a text encoder at the utterance level. We describe how the original SAMU-XLSR method has been updated by selecting a stronger teacher text model and a better initial speech encoder. The source code for training and using SENSE models has been integrated into the SpeechBrain toolkit, and the first SENSE model we trained has been publicly released. We report experimental results on multilingual and multimodal semantic tasks, where our SENSE model achieves highly competitive performance. Finally, this study offers new insights into how semantics are captured in such semantically aligned speech encoders.

[32] Do Natural Language Descriptions of Model Activations Convey Privileged Information?

Millicent Li, Alberto Mario Ceballos Arroyo, Giordano Rogers, Naomi Saphra, Byron C. Wallace

Main category: cs.CL

TL;DR: Activation verbalization methods may not provide privileged insights into LLM internals, often reflecting verbalizer LLM knowledge rather than target model knowledge, requiring better benchmarks and controls.

DetailsMotivation: To critically evaluate whether activation verbalization methods actually provide meaningful insights into LLM internal representations and operations, or if they merely convey information about inputs or reflect the verbalizer LLM's own knowledge.

Method: Evaluated popular verbalization methods across datasets from prior work, tested if they succeed without target model access, and conducted controlled experiments to determine whether verbalizations reflect target model knowledge or verbalizer LLM parametric knowledge.

Result: Verbalization methods can succeed on benchmarks without target model access, suggesting flawed evaluation datasets. Controlled experiments show verbalizations often reflect verbalizer LLM knowledge rather than target LLM knowledge.

Conclusion: Current verbalization methods may not provide meaningful insights into LLM operations; targeted benchmarks and experimental controls are needed to rigorously assess whether verbalization methods offer genuine insights into LLM internals.

Abstract: Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target model represents and operates on inputs. But do such activation verbalization approaches actually provide privileged knowledge about the internal workings of the target model, or do they merely convey information about its inputs? We critically evaluate popular verbalization methods across datasets used in prior work and find that they can succeed at benchmarks without any access to target model internals, suggesting that these datasets may not be ideal for evaluating verbalization methods. We then run controlled experiments which reveal that verbalizations often reflect the parametric knowledge of the verbalizer LLM which generated them, rather than the knowledge of the target LLM whose activations are decoded. Taken together, our results indicate a need for targeted benchmarks and experimental controls to rigorously assess whether verbalization methods provide meaningful insights into the operations of LLMs.

[33] Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

Yoonah Park, Haesung Pyun, Yohan Jo

Main category: cs.CL

TL;DR: KAPPA is a parameter-free intervention that aligns hidden state coordinates in LLMs to bridge the knowledge-prediction gap on multiple-choice questions, improving accuracy by projecting states to match knowledge coordinates.

DetailsMotivation: LLMs often fail on multiple-choice questions despite having correct knowledge, showing a knowledge-prediction gap. The authors want to understand the mechanism behind this gap and develop a method to alleviate it.

Method: Probing analysis reveals residual streams contain knowledge and prediction basis subspaces. KAPPA transforms hidden states to align prediction coordinates with knowledge coordinates through projection-based adjustment in this subspace.

Result: KAPPA substantially improves accuracy on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge, outperforming baselines. It also extends effectiveness to free-form questions beyond MCQs.

Conclusion: The work provides a geometric understanding of the knowledge-prediction gap and offers a practical, parameter-free method for better aligning model behavior with its latent knowledge.

Abstract: Large Language Models (LLMs) often fail on multiple-choice questions (MCQs) despite demonstrating correct knowledge in other contexts, such as free-form generation. To investigate the mechanism underlying this knowledge-prediction gap on MCQs and alleviate it, we conduct a probing analysis and find that residual streams in certain layers contain a subspace spanned by two important bases: a \emph{knowledge basis} that encodes the probability of the ground-truth answer for a given MCQ and a \emph{prediction basis} that encodes the probability of the answer choice predicted by the model. We observe that incorrect predictions arise from a misalignment of the model’s hidden states along these two bases. Hence, we introduce \textbf{KAPPA} (Knowledge-Aligned Prediction through Projection-based Adjustment), a parameter-free intervention that transforms the hidden states to align the prediction coordinate with the knowledge coordinate within this subspace. Experiments on binary-choice reformulations of Big-Bench-Hard and ARC-Challenge show that KAPPA substantially improves accuracy and consistently outperforms baselines. While optimal subspaces differ across tasks, subspaces generalize to some extent, as supported by cross-dataset experiments. Moreover, KAPPA extends its effectiveness to free-form questions beyond MCQs. Our work provides a new geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.

[34] Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in LLMs

Jongwook Han, Jongwon Lim, Injin Kong, Yohan Jo

Main category: cs.CL

TL;DR: LLMs express values through intrinsic (learned) and prompted (explicit) mechanisms that partly share components but also have unique elements, leading to different steerability and response diversity patterns.

DetailsMotivation: To understand whether intrinsic and prompted value expression in LLMs rely on overlapping or distinct mechanisms, which is crucial for value alignment and persona steering applications.

Method: Used two mechanistic approaches: (1) value vectors extracted from residual stream representing value mechanisms, and (2) value neurons (MLP neurons) that contribute to value expressions.

Result: Intrinsic and prompted value mechanisms share common components crucial for value expression but also have unique elements. Prompted mechanisms show higher steerability, while intrinsic mechanisms produce greater response diversity. Unique intrinsic components promote lexical diversity, while unique prompted components strengthen instruction following and affect distant tasks like jailbreaking.

Conclusion: The study reveals that intrinsic and prompted value mechanisms in LLMs are partially overlapping but functionally distinct, with different implications for steerability and response diversity, providing important insights for value alignment research.

Abstract: Large language models (LLMs) can express different values in two distinct ways: (1) intrinsic expression, reflecting the model’s inherent values learned during training, and (2) prompted expression, elicited by explicit prompts. Given their widespread use in value alignment and persona steering, it is paramount to clearly understand their underlying mechanisms, particularly whether they mostly overlap (as one might expect) or rely on substantially different mechanisms, but this remains largely understudied. We analyze this at the mechanistic level using two approaches: (1) value vectors, feature directions representing value mechanisms extracted from the residual stream, and (2) value neurons, MLP neurons that contribute to value expressions. We demonstrate that intrinsic and prompted value mechanisms partly share common components that are crucial for inducing value expression, but also possess unique elements that manifest in different ways. As a result, these mechanisms lead to different degrees of value steerability (prompted > intrinsic) and response diversity (intrinsic > prompted). In particular, components unique to the intrinsic mechanism seem to promote lexical diversity in responses, whereas those specific to the prompted mechanism primarily strengthen instruction following, taking effect even in distant tasks like jailbreaking.

[35] MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev

Main category: cs.CL

TL;DR: MixtureVitae is a legally-safe pretraining corpus combining public-domain, permissively licensed, and justified low-risk sources with instruction/reasoning data, achieving strong performance while minimizing legal risks.

DetailsMotivation: To create a practical foundation for training capable LLMs that minimizes legal risks associated with indiscriminate web scraping while maintaining competitive model performance.

Method: Risk-mitigated sourcing strategy combining public-domain text, permissively licensed content (CC-BY/Apache), justified low-risk additions (government works, EU TDM-eligible sources), plus targeted instruction, reasoning and synthetic data. Uses transparent multi-stage pipeline for license-aware filtering, safety/quality screening, and domain-aware mixing.

Result: Models trained on MixtureVitae consistently outperform other permissive datasets across standard benchmarks, particularly strong on math/code tasks and competitive on QA. At 1.7B/300B setting, surpasses FineWeb-Edu and approaches DCLM in later training stages.

Conclusion: Permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness.

Abstract: We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

[36] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL

Dzmitry Pihulski, Karol Charchut, Viktoria Novogrodskaia, Jan Kocoń

Main category: cs.CL

TL;DR: LLMSQL is a cleaned and re-annotated version of WikiSQL designed for modern LLMs, fixing structural and annotation issues to serve as an LLM-ready benchmark for text-to-SQL tasks.

DetailsMotivation: WikiSQL has declined in usage due to structural and annotation issues (case sensitivity inconsistencies, data type mismatches, syntax errors, unanswered questions), making it unsuitable for modern LLM evaluation. There's a need for a clean, LLM-ready benchmark for text-to-SQL research.

Method: Systematic revision and transformation of WikiSQL: classified errors and implemented automated methods for cleaning and re-annotation. Created LLMSQL with clean natural language questions and full SQL queries as plain text (unlike original WikiSQL designed for pointer-network models).

Result: Evaluated multiple LLMs on LLMSQL: DeepSeek-R1 achieves 88.40% accuracy in zero-shot setting; models under 10B parameters surpass 90% accuracy after fine-tuning. LLMSQL enables straightforward generation and evaluation for modern text-to-SQL models.

Conclusion: LLMSQL serves as an LLM-ready benchmark for text-to-SQL research, addressing WikiSQL’s limitations and providing clean data suitable for modern LLM evaluation, with strong performance demonstrated across various models.

Abstract: Converting natural language questions into SQL queries enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early text-to-SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the large language model era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models, including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek-R1, and others. Notably, DeepSeek-R1 achieves 88.40% accuracy in a zero-shot setting, and models under 10B parameters surpass 90% accuracy after fine-tuning. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark. Unlike the original WikiSQL, which was tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural-language-to-SQL models.

[37] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei

Main category: cs.CL

TL;DR: DeliberationBank: A human-grounded dataset and DeliberationJudge model for evaluating LLM-generated deliberation summaries to address fairness concerns like minority underrepresentation and input order bias.

DetailsMotivation: LLMs show promise for summarizing large-scale public deliberations but risk underrepresenting minority perspectives and exhibiting input order bias, raising fairness concerns in high-stakes policy contexts. Current evaluation methods relying on LLMs as judges show weak alignment with human judgments.

Method: Created DeliberationBank dataset with (1) opinion data from 3,000 participants across ten deliberation questions, and (2) summary judgment data annotated by 4,500 participants across four dimensions. Trained DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives.

Result: DeliberationJudge is more efficient and more aligned with human judgments than LLM judges. Evaluation of 18 LLMs revealed persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions.

Conclusion: The framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

Abstract: Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

[38] StreamingThinker: Large Language Models Can Think While Reading

Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

Main category: cs.CL

TL;DR: StreamingThinker enables LLMs to think while reading input rather than waiting for complete input, reducing latency by 80% in token waiting and 60% in time-level latency while maintaining comparable performance to batch thinking.

DetailsMotivation: Current LLM reasoning paradigm waits for entire input before starting to think, causing unnecessary latency and weakening attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading.

Method: StreamingThinker framework with streaming CoT generation, streaming-constraint training, and streaming parallel inference. Uses streaming reasoning units with quality control, order-preserving reasoning via streaming attention masks and position encoding, and parallel KV caches that decouple input encoding from reasoning generation.

Result: Evaluated on Qwen3 model family across math reasoning, logical reasoning, and context-based QA tasks. Preserves performance comparable to batch thinking while achieving 80% reduction in token waiting before reasoning onset and over 60% reduction in time-level latency for final answer production.

Conclusion: StreamingThinker demonstrates effectiveness of streaming paradigm for LLM reasoning, enabling thinking while reading with significant latency reductions while maintaining reasoning quality.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80% reduction in token waiting before the onset of reasoning and a more than 60% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker.

[39] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong, Sunghyun Lee, Jaeyoung Lee, Seonah Han, Youngjae Yu

Main category: cs.CL

TL;DR: MLLMs show phonetic intuition aligning with linguistic research on sound symbolism, with attention patterns focusing on iconic phonemes across text and audio modalities.

DetailsMotivation: To investigate how Multimodal Large Language Models interpret auditory information through sound symbolism (non-arbitrary associations between phonetic forms and meanings), bridging AI and cognitive linguistics.

Method: Created LEX-ICON dataset with 8,052 words from 4 languages and 2,930 pseudo-words annotated with semantic features. Analyzed MLLMs’ performance on phonetic iconicity across textual (orthographic/IPA) and auditory inputs across 25 semantic dimensions, measuring phoneme-level attention fraction scores layer-wise.

Result: (1) MLLMs demonstrate phonetic intuitions that align with existing linguistic research across multiple semantic dimensions; (2) Phonosemantic attention patterns reveal models’ focus on iconic phonemes.

Conclusion: Provides first large-scale quantitative analysis of phonetic iconicity in MLLMs’ interpretability, bridging AI and cognitive linguistics through sound symbolism as a probe for auditory information processing.

Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.

[40] AraLingBench A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Sina Mukalled, Nadine Rizk, Fatima Karnib, Issam Lakkis, Ammar Mohanna, Bernard Ghanem

Main category: cs.CL

TL;DR: AraLingBench is a human-annotated benchmark for evaluating Arabic linguistic competence in LLMs, covering grammar, morphology, spelling, reading comprehension, and syntax through 150 expert-designed multiple choice questions.

DetailsMotivation: Current Arabic LLM evaluations focus on knowledge-based tasks, but there's a need to assess deeper linguistic competence and structural language understanding beyond surface-level proficiency and memorization.

Method: Created a fully human-annotated benchmark with 150 expert-designed multiple choice questions spanning five core linguistic categories: grammar, morphology, spelling, reading comprehension, and syntax.

Result: Evaluation of 35 Arabic and bilingual LLMs revealed that models demonstrate strong surface-level proficiency but struggle with deeper grammatical and syntactic reasoning, showing a gap between knowledge-based benchmarks and true linguistic mastery.

Conclusion: AraLingBench provides a diagnostic framework for developing Arabic LLMs by isolating fundamental linguistic skills, highlighting that many models succeed through memorization rather than authentic comprehension, and the benchmark code is publicly available.

Abstract: We present AraLingBench: a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language models (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert-designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than authentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The full evaluation code is publicly available on GitHub.

[41] TS-PEFT: Unveiling Token-Level Redundancy in Parameter-Efficient Fine-Tuning

Dabiao Ma, Ziming Dai, Zhimin Xin, Shu Wang, Ye Wang, Haojun Fei

Main category: cs.CL

TL;DR: TS-PEFT: A token-sparse PEFT method that dynamically skips redundant token updates during fine-tuning, achieving comparable or better performance than dense baselines while reducing computational cost.

DetailsMotivation: Current PEFT methods assume all tokens contribute equally to downstream tasks and require parameter updates, but this paper challenges this convention by revealing pervasive token-level redundancy in fine-tuning large models.

Method: TS-PEFT uses proximal optimization to dynamically identify and skip redundant token updates during training, learning token-level sparsity patterns.

Result: By discarding 40%-60% of token updates, TS-PEFT consistently matches or surpasses performance of dense baselines (LoRA, DoRA) across NLU, Commonsense Reasoning, and Visual Instruction Tuning tasks.

Conclusion: Indiscriminately updating all tokens is computationally superfluous and introduces optimization noise; learned token-level sparsity serves as a superior indicator of module importance compared to traditional weight norms.

Abstract: Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention and unveil a pervasive token-level redundancy in the fine-tuning of large models. We propose TS-PEFT, a theoretically grounded framework utilizing proximal optimization to dynamically identify and skip redundant token updates during training. Our extensive experiments across Natural Language Understanding, Commonsense Reasoning, and Visual Instruction Tuning demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Strikingly, by discarding 40%-60% of token updates, TS-PEFT consistently matches or surpasses the performance of dense baselines (e.g., LoRA, DoRA). Furthermore, we provide an in-depth analysis revealing that the learned token-level sparsity serves as a superior indicator of module importance compared to traditional weight norms, offering a novel data-driven perspective on the intrinsic adaptation mechanism of large models.

[42] Representational Stability of Truth in Large Language Models

Samantha Dies, Courtney Maynard, Germans Savcisens, Tina Eliassi-Rad

Main category: cs.CL

TL;DR: LLMs show varying stability in truth representations depending on statement familiarity - unfamiliar neither statements cause large boundary shifts while familiar fictional statements remain coherent.

DetailsMotivation: To understand how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations, and to assess robustness of veracity representations to perturbations in truth definitions.

Method: Introduce representational stability measured by training linear probes on LLM activations to separate true from not-true statements, then measuring how decision boundaries shift under controlled label changes. Tested on 16 open-source models across 3 factual domains with two types of neither statements: unfamiliar (fact-like assertions about entities absent from training data) and familiar (nonfactual claims from fictional contexts).

Result: Unfamiliar neither statements induce largest boundary shifts (up to 40% flipped truth judgements in fragile domains like word definitions), while familiar fictional statements remain more coherently clustered with smaller changes (≤8.2%).

Conclusion: Representational stability stems more from epistemic familiarity than linguistic form. The approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty rather than optimizing for output accuracy alone.

Abstract: Large language models (LLMs) are widely used for factual tasks such as “What treats asthma?” or “What is the capital of Latvia?”. However, it remains unclear how stably LLMs encode distinctions between true, false, and neither-true-nor-false content in their internal probabilistic representations. We introduce representational stability as the robustness of an LLM’s veracity representations to perturbations in the operational definition of truth. We assess representational stability by (i) training a linear probe on an LLM’s activations to separate true from not-true statements and (ii) measuring how its learned decision boundary shifts under controlled label changes. Using activations from sixteen open-source models and three factual domains, we compare two types of neither statements. The first are fact-like assertions about entities we believe to be absent from any training data. We call these unfamiliar neither statements. The second are nonfactual claims drawn from well-known fictional contexts. We call these familiar neither statements. The unfamiliar statements induce the largest boundary shifts, producing up to $40%$ flipped truth judgements in fragile domains (such as word definitions), while familiar fictional statements remain more coherently clustered and yield smaller changes ($\leq 8.2%$). These results suggest that representational stability stems more from epistemic familiarity than from linguistic form. More broadly, our approach provides a diagnostic for auditing and training LLMs to preserve coherent truth assignments under semantic uncertainty, rather than optimizing for output accuracy alone.

[43] Mortgage Language Model: Domain-Adaptive Pretraining with Residual Instruction, Alignment Tuning, and Task-Specific Routing

Manish Jain, Satheesh Kumar Ponnambalam, Salman Faroz, Chandrakanth Lns, Vinay Sharma

Main category: cs.CL

TL;DR: MortgageLLM is a domain-specific LLM for mortgage finance using a dual-expert approach to overcome performance trade-offs between structured tasks and conversational fidelity.

DetailsMotivation: LLMs lack domain-specific knowledge for specialized sectors like mortgage finance, and single multi-task models suffer from performance trade-offs where optimizing for structured tasks degrades conversational capabilities.

Method: Dual-track specialization from LLaMA-3.1-8B base model creates two specialists: conversational Q&A model and structured task model for classification/summarization, using instruction residual technique to restore instruction-following without SFT, with intelligent task routing via few-shot classification.

Result: MLM v2 significantly outperforms base LLaMA-3.1-8B-Instruct: LLM-as-a-Judge scores of 4.58 vs 3.99 (summarization), 4.09 vs 4.0 (Q&A), 2.6 vs 1.2 (classification); BERTScore improvements of 0.77 vs 0.74, 0.68 vs 0.58, 0.75 vs 0.73 respectively.

Conclusion: The dual-expert architecture effectively addresses domain specialization challenges in mortgage finance, enabling both conversational fidelity and structured task performance without trade-offs, validated by substantial benchmark improvements.

Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.

[44] Understanding LLM Reasoning for Abstractive Summarization

Haohan Yuan, Haopeng Zhang

Main category: cs.CL

TL;DR: Reasoning strategies in LLMs for summarization show mixed results - explicit reasoning improves fluency but hurts factual faithfulness, while implicit reasoning does the opposite. More reasoning doesn’t always mean better summaries.

DetailsMotivation: While LLMs excel at analytical reasoning tasks, their effectiveness for abstractive summarization is widely assumed but not properly verified. The paper aims to systematically test whether reasoning capabilities actually improve summarization quality and faithfulness.

Method: Tailored general reasoning strategies to summarization domain, then conducted large-scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models across 8 diverse datasets. Evaluated both summary quality and factual faithfulness.

Result: Reasoning is not universally effective - its impact depends on specific strategy and context. Found trade-off: explicit reasoning strategies improve fluency but reduce factual grounding, while implicit reasoning in LRMs shows opposite pattern. Increasing reasoning budget doesn’t improve and can even hurt factual consistency.

Conclusion: Effective summarization requires faithful compression rather than creative over-thinking. The study reveals nuanced relationship between reasoning and summarization, challenging assumptions about universal benefits of reasoning for this task.

Abstract: While the reasoning capabilities of Large Language Models (LLMs) excel in analytical tasks such as mathematics and code generation, their utility for abstractive summarization remains widely assumed but largely unverified. To bridge this gap, we first tailor general reasoning strategies to the summarization domain. We then conduct a systematic, large scale comparative study of 8 reasoning strategies and 3 Large Reasoning Models (LRMs) across 8 diverse datasets, assessing both summary quality and faithfulness. Our findings show that reasoning is not a universal solution and its effectiveness is highly dependent on the specific strategy and context. Specifically, we observe a trade-off between summary quality and factual faithfulness: explicit reasoning strategies tend to improve fluency at the expense of factual grounding, while implicit reasoning in LRMs exhibits the inverse pattern. Furthermore, increasing an LRM’s internal reasoning budget does not improve, and can even hurt, factual consistency, suggesting that effective summarization demands faithful compression rather than creative over-thinking.

[45] Arbitrage: Efficient Reasoning via Advantage-Aware Speculation

Monishwaran Maheswaran, Rishabh Tiwari, Yuezhou Hu, Kerem Dilmen, Coleman Hooper, Haocheng Xi, Nicholas Lee, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Main category: cs.CL

TL;DR: Arbitrage: A step-level speculative decoding framework that uses a lightweight router to dynamically choose between draft and target models for reasoning tasks, achieving 2x latency reduction while maintaining accuracy.

DetailsMotivation: Traditional speculative decoding struggles with reasoning tasks due to token mismatches in semantically equivalent steps. Step-level verification helps but still wastes compute by regenerating many rejected steps with little improvement.

Method: Proposes Arbitrage framework with a lightweight router trained to predict when the target model will produce meaningfully better steps. Instead of fixed acceptance thresholds, it dynamically routes generation based on relative advantage between draft and target models, approximating an ideal Arbitrage Oracle.

Result: Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level speculative decoding baselines, reducing inference latency by up to ~2x at matched accuracy.

Conclusion: Arbitrage achieves near-optimal efficiency-accuracy trade-offs by intelligently routing generation between draft and target models, significantly improving the performance-cost ratio for reasoning tasks with LLMs.

Abstract: Modern Large Language Models achieve impressive reasoning capabilities with long Chain of Thoughts, but they incur substantial computational cost during inference, and this motivates techniques to improve the performance-cost ratio. Among these techniques, Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens, which are then verified in parallel by a more capable target model. However, due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks. Although recent works have shifted to step-level semantic verification, which improve efficiency by accepting or rejecting entire reasoning steps, existing step-level methods still regenerate many rejected steps with little improvement, wasting valuable target compute. To address this challenge, we propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models. Instead of applying a fixed acceptance threshold, Arbitrage uses a lightweight router trained to predict when the target model is likely to produce a meaningfully better step. This routing approximates an ideal Arbitrage Oracle that always chooses the higher-quality step, achieving near-optimal efficiency-accuracy trade-offs. Across multiple mathematical reasoning benchmarks, Arbitrage consistently surpasses prior step-level Speculative Decoding baselines, reducing inference latency by up to $\sim2\times$ at matched accuracy.

[46] Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support

Raunak Jain, Mudita Khurana

Main category: cs.CL

TL;DR: The paper identifies a “complementarity gap” where human-AI teams don’t outperform the best individuals in high-stakes decisions, and proposes Collaborative Causal Sensemaking (CCS) as a research agenda to develop LLM agents as collaborative partners rather than just answer engines.

DetailsMotivation: Current LLM-based agents fail to reliably outperform the best individuals in human-AI teams for high-stakes expert decision support, due to a fundamental mismatch: agents are trained as answer engines rather than as partners in collaborative sensemaking processes.

Method: Proposes Collaborative Causal Sensemaking (CCS) as a research agenda with three key components: 1) new training environments that reward collaborative thinking, 2) representations for shared human-AI mental models, and 3) evaluation centered on trust and complementarity rather than just accuracy.

Result: The paper presents a conceptual framework identifying sensemaking (co-constructing causal explanations, surfacing uncertainties, and adapting goals) as the key missing capability in current LLM agent training pipelines.

Conclusion: Developing CCS capabilities can advance multi-agent systems research toward creating agents that “think with their human partners rather than for them,” bridging the complementarity gap in high-stakes decision support.

Abstract: LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. These directions can advance MAS research toward agents that think with their human partners rather than for them.

cs.CV

[47] Detection of Cyberbullying in GIF using AI

Pal Dave, Xiaohong Yuan, Madhuri Siddula, Kaushik Roy

Main category: cs.CV

TL;DR: Researchers collected a GIF dataset from Twitter using cyberbullying-related hashtags and applied VGG16 deep learning model to detect cyberbullying in GIFs with 97% accuracy.

DetailsMotivation: Cyberbullying is escalating on social media through various formats including GIFs/stickers, but existing research focuses mainly on text and images, with very few studies addressing GIF-based cyberbullying detection.

Method: 1) Extracted cyberbullying-related hashtags from Twitter, 2) Used GIPHY API to download GIFs based on these hashtags, 3) Created dataset of 4100+ GIFs (cyberbullying and non-cyberbullying), 4) Applied pre-trained VGG16 deep learning model for classification.

Result: The VGG16 model achieved 97% accuracy in detecting cyberbullying from GIFs. The researchers also provide a GIF dataset for future research in this area.

Conclusion: This study successfully demonstrates the feasibility of using deep learning models to detect cyberbullying in GIFs, addressing a gap in existing research and providing a valuable dataset for further investigation.

Abstract: Cyberbullying is a well-known social issue, and it is escalating day by day. Due to the vigorous development of the internet, social media provide many different ways for the user to express their opinions and exchange information. Cyberbullying occurs on social media using text messages, comments, sharing images and GIFs or stickers, and audio and video. Much research has been done to detect cyberbullying on textual data; some are available for images. Very few studies are available to detect cyberbullying on GIFs/stickers. We collect a GIF dataset from Twitter and Applied a deep learning model to detect cyberbullying from the dataset. Firstly, we extracted hashtags related to cyberbullying using Twitter. We used these hashtags to download GIF file using publicly available API GIPHY. We collected over 4100 GIFs including cyberbullying and non cyberbullying. we applied deep learning pre-trained model VGG16 for the detection of the cyberbullying. The deep learning model achieved the accuracy of 97%. Our work provides the GIF dataset for researchers working in this area.

[48] PAVAS: Physics-Aware Video-to-Audio Synthesis

Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji

Main category: cs.CV

TL;DR: PAVAS introduces physics-aware video-to-audio synthesis by incorporating object-level physical parameters (mass, velocity) into a latent diffusion model, enabling more physically realistic sound generation.

DetailsMotivation: Current V2A models are appearance-driven and lack consideration of the physical factors that shape real-world sounds, limiting their ability to generate physically plausible audio.

Method: Uses Physics-Driven Audio Adapter (Phy-Adapter) that receives object-level physical parameters from Physical Parameter Estimator (PPE). PPE uses VLM to infer object mass and segmentation-based 3D reconstruction for motion trajectory and velocity computation, integrated into latent diffusion framework.

Result: Outperforms existing V2A models in both quantitative and qualitative evaluations, producing physically plausible and perceptually coherent audio. Introduces VGG-Impact benchmark and Audio-Physics Correlation Coefficient (APCC) metric for evaluation.

Conclusion: Incorporating physical reasoning into V2A generation significantly improves audio realism and physical plausibility, addressing limitations of appearance-driven approaches.

Abstract: Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.

[49] Near-real time fires detection using satellite imagery in Sudan conflict

Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart

Main category: cs.CV

TL;DR: Deep learning with 4-band Planet Labs imagery enables near real-time monitoring of fire damage in Sudan conflicts, outperforming baseline methods with minimal delay.

DetailsMotivation: The ongoing war in Sudan creates an urgent need for rapid conflict monitoring. Advances in deep learning and readily available satellite imagery provide an opportunity for near real-time analysis of conflict impacts like fire damage.

Method: Uses 4-band imagery from Planet Labs combined with a deep learning model to monitor fire damage in armed conflicts. The approach is validated through five case studies in Sudan, comparing against baseline methods.

Result: The automated deep learning method captures active fires and charred areas more accurately than baseline approaches. Interestingly, using 8-band imagery or time series of such imagery provides only marginal gains over the simpler 4-band approach.

Conclusion: Fire damage in armed conflicts can be effectively monitored with minimal delay using 4-band satellite imagery and deep learning, offering a practical solution for rapid conflict analysis without needing more complex 8-band or time series data.

Abstract: The challenges of ongoing war in Sudan highlight the need for rapid monitoring and analysis of such conflicts. Advances in deep learning and readily available satellite remote sensing imagery allow for near real-time monitoring. This paper uses 4-band imagery from Planet Labs with a deep learning model to show that fire damage in armed conflicts can be monitored with minimal delay. We demonstrate the effectiveness of our approach using five case studies in Sudan. We show that, compared to a baseline, the automated method captures the active fires and charred areas more accurately. Our results indicate that using 8-band imagery or time series of such imagery only result in marginal gains.

[50] Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Zekai Luo, Zongze Du, Zhouhang Zhu, Hao Zhong, Muzhi Zhu, Wen Wang, Yuling Xi, Chenchen Jing, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: LivingSwap is the first video reference guided face swapping model that uses keyframes as conditioning signals to achieve high fidelity and temporal consistency in long video sequences.

DetailsMotivation: Video face swapping in film production needs high fidelity and temporal consistency over long sequences, but current methods struggle with these challenges. The paper explores whether source video attributes can be leveraged like reference-guided image editing to improve both aspects.

Method: Uses keyframes as conditioning signals to inject target identity, combining keyframe conditioning with video reference guidance for temporal stitching. Created Face2Face dataset for training and reversed data pairs for reliable ground-truth supervision.

Result: Achieves state-of-the-art results, seamlessly integrating target identity with source video’s expressions, lighting, and motion while significantly reducing manual effort in production workflows.

Conclusion: LivingSwap successfully addresses video face swapping challenges by leveraging reference guidance and keyframe conditioning, enabling high-fidelity, temporally consistent results for production applications.

Abstract: Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video’s expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap

[51] Restrictive Hierarchical Semantic Segmentation for Stratified Tooth Layer Detection

Ryan Banks, Camila Lindoni Azevedo, Hongying Tang, Yunpeng Li

Main category: cs.CV

TL;DR: Hierarchical semantic segmentation framework for dental anatomy using recurrent level-wise prediction with feature conditioning and consistency constraints improves segmentation accuracy and anatomical coherence.

DetailsMotivation: Existing hierarchy-aware segmentation methods only use weak supervision through loss functions, lacking explicit anatomical structure encoding needed for accurate dental disease staging.

Method: Recurrent level-wise prediction with restrictive output heads and top-down feature conditioning using FiLM modulation, probabilistic composition rules, and hierarchical loss combining Dice, cross-entropy, and consistency terms.

Result: Hierarchical variants consistently increased IoU, Dice, and recall (especially for fine-grained anatomies) and produced more anatomically coherent masks, though with increased false positives (higher recall than precision).

Conclusion: Explicit hierarchical structuring improves both performance and clinical plausibility in dental imaging, particularly beneficial in low-data regimes.

Abstract: Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy-aware methodologies. However, existing hierarchy-aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level-wise prediction scheme with restrictive output heads and top-down feature conditioning. At each depth of the class tree, the backbone is re-run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature-wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per-level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL-pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5-fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine-grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.

[52] FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

Jiyoon Pyo, Yuankun Jiao, Dongwon Jung, Zekun Li, Leeje Jang, Sofia Kirsanova, Jina Kim, Yijun Lin, Qin Liu, Junyi Xie, Hadi Askari, Nan Xu, Muhao Chen, Yao-Yi Chiang

Main category: cs.CV

TL;DR: FRIEDA benchmark evaluates complex cartographic reasoning in LVLMs, revealing significant gaps in spatial intelligence compared to human performance.

DetailsMotivation: Current map VQA benchmarks treat maps as charts, missing the layered symbology and spatial relations (topological, metric, directional) essential for real-world cartographic reasoning needed in disaster response and urban planning.

Method: Created FRIEDA benchmark using real map images from various domains/regions, targeting all three spatial relation categories with multi-step inference questions requiring cross-map grounding. Evaluated 11 LVLMs in direct and contextual settings.

Result: Top models (Gemini-2.5-Pro and GPT-5-Think) achieved only 38.20% and 37.20% accuracy respectively, far below human performance of 84.87%, revealing persistent gaps in multi-step cartographic reasoning.

Conclusion: FRIEDA establishes a rigorous benchmark exposing significant limitations in LVLMs’ spatial intelligence, positioning it as essential for driving progress in cartographic reasoning capabilities.

Abstract: Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model studies on map visual question-answering often treat maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20% and 37.20% accuracy, respectively, far below human performance of 84.87%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.

[53] SSplain: Sparse and Smooth Explainer for Retinopathy of Prematurity Classification

Elifnur Sunger, Tales Imbiriba, Peter Campbell, Deniz Erdogmus, Stratis Ioannidis, Jennifer Dy

Main category: cs.CV

TL;DR: SSplain is a new explainer method for ROP classification that generates pixel-wise explanations while preserving image smoothness and sparsity, outperforming existing methods.

DetailsMotivation: Existing explainer methods for medical image classification fail to preserve important image structures like smoothness and sparsity, making explanations less realistic and harder for clinicians to trust and understand.

Method: SSplain formulates an optimization problem with combinatorial constraints to generate pixel-wise explanations that enforce smoothness and sparsity, solved using the Alternating Direction Method of Multipliers (ADMM).

Result: SSplain outperforms common explainers in post-hoc accuracy and smoothness analyses, identifies clinically relevant features for ROP, and generalizes well to other public datasets.

Conclusion: SSplain provides more realistic and clinically interpretable explanations for medical image classification by preserving image structures, enhancing clinician trust and understanding of black-box neural network models.

Abstract: Neural networks are frequently used in medical diagnosis. However, due to their black-box nature, model explainers are used to help clinicians understand better and trust model outputs. This paper introduces an explainer method for classifying Retinopathy of Prematurity (ROP) from fundus images. Previous methods fail to generate explanations that preserve input image structures such as smoothness and sparsity. We introduce Sparse and Smooth Explainer (SSplain), a method that generates pixel-wise explanations while preserving image structures by enforcing smoothness and sparsity. This results in realistic explanations to enhance the understanding of the given black-box model. To achieve this goal, we define an optimization problem with combinatorial constraints and solve it using the Alternating Direction Method of Multipliers (ADMM). Experimental results show that SSplain outperforms commonly used explainers in terms of both post-hoc accuracy and smoothness analyses. Additionally, SSplain identifies features that are consistent with domain-understandable features that clinicians consider as discriminative factors for ROP. We also show SSplain’s generalization by applying it to additional publicly available datasets. Code is available at https://github.com/neu-spiral/SSplain.

[54] Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment

Youngjoon Jang, Liliane Momeni, Zifan Jiang, Joon Son Chung, Gül Varol, Andrew Zisserman

Main category: cs.CV

TL;DR: A unified model for sign language understanding that performs both sign language translation (SLT) and sign-subtitle alignment (SSA), achieving state-of-the-art results on BSL and showing strong cross-linguistic generalization to ASL.

DetailsMotivation: To develop a practical system for converting continuous signing videos into spoken language text and aligning signing with subtitles, which is essential for communication, corpus construction, and educational applications.

Method: Three-component approach: (1) lightweight visual backbone using human keypoints and lip-region images for privacy preservation, (2) Sliding Perceiver mapping network to aggregate visual features into word-level embeddings, and (3) multi-task scalable training jointly optimizing SLT and SSA with multilingual pretraining on BSL and ASL datasets.

Result: Achieved state-of-the-art results on BOBSL (BSL) dataset for both SLT and SSA, demonstrated robust zero-shot generalization and fine-tuned performance on How2Sign (ASL), showing scalable translation across different sign languages.

Conclusion: The unified model successfully bridges vision-text gap for sign language understanding, enables practical applications through privacy-preserving design, and shows promising cross-linguistic generalization potential for scalable sign language translation.

Abstract: Our aim is to develop a unified model for sign language understanding, that performs sign language translation (SLT) and sign-subtitle alignment (SSA). Together, these two tasks enable the conversion of continuous signing videos into spoken language text and also the temporal alignment of signing with subtitles – both essential for practical communication, large-scale corpus construction, and educational applications. To achieve this, our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images while preserving signer privacy; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings to bridge the vision-text gap; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA, reinforcing both linguistic and temporal alignment. To promote cross-linguistic generalisation, we pretrain our model on large-scale sign-text corpora covering British Sign Language (BSL) and American Sign Language (ASL) from the BOBSL and YouTube-SL-25 datasets. With this multilingual pretraining and strong model design, we achieve state-of-the-art results on the challenging BOBSL (BSL) dataset for both SLT and SSA. Our model also demonstrates robust zero-shot generalisation and finetuned SLT performance on How2Sign (ASL), highlighting the potential of scalable translation across different sign languages.

[55] Towards Sustainable Universal Deepfake Detection with Frequency-Domain Masking

Chandler Timm C. Doloriel, Habib Ullah, Kristian Hovde Liland, Fadi Al Machot, Ngai-Man Cheung

Main category: cs.CV

TL;DR: Frequency-domain masking improves universal deepfake detection by enhancing generalization to unseen generators while maintaining performance under model pruning for sustainable AI.

DetailsMotivation: Universal deepfake detection needs robust generalization to new AI-generated images while minimizing computational overhead for large-scale screening in the Green AI era.

Method: Introduces frequency-domain masking as a training strategy, using random masking and geometric transformations with focus on frequency masking for superior generalization properties.

Result: Achieves state-of-the-art generalization on GAN- and diffusion-generated datasets, maintains performance under significant model pruning, and offers scalable resource-conscious solution.

Conclusion: Frequency-based masking is a practical step toward sustainable and generalizable deepfake detection, balancing accuracy with computational efficiency.

Abstract: Universal deepfake detection aims to identify AI-generated images across a broad range of generative models, including unseen ones. This requires robust generalization to new and unseen deepfakes, which emerge frequently, while minimizing computational overhead to enable large-scale deepfake screening, a critical objective in the era of Green AI. In this work, we explore frequency-domain masking as a training strategy for deepfake detectors. Unlike traditional methods that rely heavily on spatial features or large-scale pretrained models, our approach introduces random masking and geometric transformations, with a focus on frequency masking due to its superior generalization properties. We demonstrate that frequency masking not only enhances detection accuracy across diverse generators but also maintains performance under significant model pruning, offering a scalable and resource-conscious solution. Our method achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets and exhibits consistent robustness under structured pruning. These results highlight the potential of frequency-based masking as a practical step toward sustainable and generalizable deepfake detection. Code and models are available at: https://github.com/chandlerbing65nm/FakeImageDetection.

[56] Mask to Adapt: Simple Random Masking Enables Robust Continual Test-Time Learning

Chandler Timm C. Doloriel

Main category: cs.CV

TL;DR: M2A: Simple continual test-time adaptation using random masking with consistency and entropy losses, achieving strong performance without complex uncertainty/attention mechanisms.

DetailsMotivation: Current CTTA methods rely on calibrated uncertainty or stable attention scores with complex masking designs. The paper questions whether simple random masking can work under strong corruption.

Method: M2A generates short sequences of randomly masked views (spatial or frequency) and adapts using two objectives: mask consistency loss (aligning predictions across views) and entropy minimization loss (encouraging confident outputs).

Result: On CIFAR10C/CIFAR100C/ImageNetC (severity 5), M2A (Spatial) achieves 8.3%/19.8%/39.2% mean error, outperforming or matching strong CTTA baselines. M2A (Frequency) performs worse. Ablations show simple random masking is effective and robust.

Conclusion: Simple random masking with consistency and entropy objectives is sufficient for effective test-time adaptation without relying on uncertainty or attention signals, challenging the need for complex custom masking designs.

Abstract: Distribution shifts at test time degrade image classifiers. Recent continual test-time adaptation (CTTA) methods use masking to regulate learning, but often depend on calibrated uncertainty or stable attention scores and introduce added complexity. We ask: do we need custom-made masking designs, or can a simple random masking schedule suffice under strong corruption? We introduce Mask to Adapt (M2A), a simple CTTA approach that generates a short sequence of masked views (spatial or frequency) and adapts with two objectives: a mask consistency loss that aligns predictions across different views and an entropy minimization loss that encourages confident outputs. Motivated by masked image modeling, we study two common masking families – spatial masking and frequency masking – and further compare subtypes within each (spatial: patch vs.\ pixel; frequency: all vs.\ low vs.\ high). On CIFAR10C/CIFAR100C/ImageNetC (severity~5), M2A (Spatial) attains 8.3%/19.8%/39.2% mean error, outperforming or matching strong CTTA baselines, while M2A (Frequency) lags behind. Ablations further show that simple random masking is effective and robust. These results indicate that a simple random masking schedule, coupled with consistency and entropy objectives, is sufficient to drive effective test-time adaptation without relying on uncertainty or attention signals.

[57] Identification of Deforestation Areas in the Amazon Rainforest Using Change Detection Models

Christian Massao Konishi, Helio Pedrini

Main category: cs.CV

TL;DR: This paper evaluates various change detection models for deforestation monitoring in the Amazon using PRODES data, addressing limitations in existing approaches through standardized evaluation, modern architectures, and preprocessing techniques.

DetailsMotivation: The preservation of the Amazon Rainforest is crucial for climate change, biodiversity, and indigenous cultures. Existing machine learning approaches for deforestation detection using PRODES data have limitations: unsatisfactory effectiveness, lack of modern architectures like Transformers, and absence of methodological standardization for direct comparisons between studies.

Method: The authors evaluate various change detection models on a unified dataset, including fully convolutional models and Transformer-based self-attention networks. They investigate the impact of different pre- and post-processing techniques such as filtering based on connected component size, texture replacement, and image enhancements. They also test different strategies for combining models to achieve superior results.

Result: The proposed approaches significantly improve individual model effectiveness. Through model combination strategies, they achieve an F1-score of 80.41%, which is comparable to other recent works in the literature.

Conclusion: The study demonstrates that standardized evaluation, modern architectures, and appropriate preprocessing techniques can substantially improve deforestation detection performance. The achieved F1-score shows the potential for effective machine learning approaches to support satellite-based monitoring of Amazon deforestation.

Abstract: The preservation of the Amazon Rainforest is one of the global priorities in combating climate change, protecting biodiversity, and safeguarding indigenous cultures. The Satellite-based Monitoring Project of Deforestation in the Brazilian Legal Amazon (PRODES), a project of the National Institute for Space Research (INPE), stands out as a fundamental initiative in this effort, annually monitoring deforested areas not only in the Amazon but also in other Brazilian biomes. Recently, machine learning models have been developed using PRODES data to support this effort through the comparative analysis of multitemporal satellite images, treating deforestation detection as a change detection problem. However, existing approaches present significant limitations: models evaluated in the literature still show unsatisfactory effectiveness, many do not incorporate modern architectures, such as those based on self-attention mechanisms, and there is a lack of methodological standardization that allows direct comparisons between different studies. In this work, we address these gaps by evaluating various change detection models in a unified dataset, including fully convolutional models and networks incorporating self-attention mechanisms based on Transformers. We investigate the impact of different pre- and post-processing techniques, such as filtering deforested areas predicted by the models based on the size of connected components, texture replacement, and image enhancements; we demonstrate that such approaches can significantly improve individual model effectiveness. Additionally, we test different strategies for combining the evaluated models to achieve results superior to those obtained individually, reaching an F1-score of 80.41%, a value comparable to other recent works in the literature.

[58] CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning

Zeyuan Chen, Xiang Zhang, Haiyang Xu, Jianwen Xie, Zhuowen Tu

Main category: cs.CV

TL;DR: CVP is a multimodal model for spatial reasoning inspired by human central and peripheral vision, using target-affinity tokens and allocentric grids to achieve state-of-the-art 3D scene understanding.

DetailsMotivation: Existing approaches rely on unstructured representations (point clouds, voxels, patch features) with implicit scene context injection via coordinate embeddings, resulting in limited spatial reasoning capabilities due to lack of explicit, high-level structural understanding.

Method: Introduces two complementary components in a Large Multimodal Model: 1) target-affinity token (analogous to central vision) that guides attention toward query-relevant objects, and 2) allocentric grid (akin to peripheral vision) that captures global scene context and spatial arrangements.

Result: CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.

Conclusion: The central-peripheral vision-inspired framework enables structured, context-aware understanding of complex 3D environments by combining focused attention on relevant objects with global spatial awareness.

Abstract: We present a central-peripheral vision-inspired framework (CVP), a simple yet effective multimodal model for spatial reasoning that draws inspiration from the two types of human visual fields – central vision and peripheral vision. Existing approaches primarily rely on unstructured representations, such as point clouds, voxels, or patch features, and inject scene context implicitly via coordinate embeddings. However, this often results in limited spatial reasoning capabilities due to the lack of explicit, high-level structural understanding. To address this limitation, we introduce two complementary components into a Large Multimodal Model-based architecture: target-affinity token, analogous to central vision, that guides the model’s attention toward query-relevant objects; and allocentric grid, akin to peripheral vision, that captures global scene context and spatial arrangements. These components work in tandem to enable structured, context-aware understanding of complex 3D environments. Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.

[59] Fourier-RWKV: A Multi-State Perception Network for Efficient Image Dehazing

Lirong Zheng, Yanshan Li, Rui Yu, Kaihao Zhang

Main category: cs.CV

TL;DR: Fourier-RWKV: A novel dehazing framework using Multi-State Perception with linear complexity, combining spatial, frequency-domain, and semantic-relation perception for efficient non-uniform haze removal.

DetailsMotivation: Image dehazing is crucial for visual perception but challenging under real-world non-uniform haze conditions. Transformer-based methods have quadratic computational complexity that hinders real-time deployment, creating a need for efficient dehazing solutions.

Method: Proposes Fourier-RWKV framework with Multi-State Perception: (1) Spatial-form Perception via Deformable Quad-directional Token Shift (DQ-Shift) for local haze variations; (2) Frequency-domain Perception via Fourier Mix block extending RWKV’s WKV attention to Fourier domain for long-range dependencies; (3) Semantic-relation Perception via Semantic Bridge Module with Dynamic Semantic Kernel Fusion for encoder-decoder alignment.

Result: Extensive experiments on multiple benchmarks show state-of-the-art performance across diverse haze scenarios with significantly reduced computational overhead, achieving favorable trade-off between restoration quality and practical efficiency.

Conclusion: Fourier-RWKV establishes an effective dehazing framework that addresses the computational limitations of Transformers while maintaining high performance through synergistic multi-state perception, enabling real-world deployment with linear complexity.

Abstract: Image dehazing is crucial for reliable visual perception, yet it remains highly challenging under real-world non-uniform haze conditions. Although Transformer-based methods excel at capturing global context, their quadratic computational complexity hinders real-time deployment. To address this, we propose Fourier Receptance Weighted Key Value (Fourier-RWKV), a novel dehazing framework based on a Multi-State Perception paradigm. The model achieves comprehensive haze degradation modeling with linear complexity by synergistically integrating three distinct perceptual states: (1) Spatial-form Perception, realized through the Deformable Quad-directional Token Shift (DQ-Shift) operation, which dynamically adjusts receptive fields to accommodate local haze variations; (2) Frequency-domain Perception, implemented within the Fourier Mix block, which extends the core WKV attention mechanism of RWKV from the spatial domain to the Fourier domain, preserving the long-range dependencies essential for global haze estimation while mitigating spatial attenuation; (3) Semantic-relation Perception, facilitated by the Semantic Bridge Module (SBM), which utilizes Dynamic Semantic Kernel Fusion (DSK-Fusion) to precisely align encoder-decoder features and suppress artifacts. Extensive experiments on multiple benchmarks demonstrate that Fourier-RWKV delivers state-of-the-art performance across diverse haze scenarios while significantly reducing computational overhead, establishing a favorable trade-off between restoration quality and practical efficiency. Code is available at: https://github.com/Dilizlr/Fourier-RWKV.

[60] Deep Learning, Machine Learning – Digital Signal and Image Processing: From Theory to Application

Weiche Hsieh, Ziqian Bi, Junyu Liu, Benji Peng, Sen Zhang, Xuanhe Pan, Jiawei Xu, Jinlang Wang, Keyu Chen, Caitlyn Heqi Yin, Pohsun Feng, Yizhu Wen, Tianyang Wang, Ming Li, Jintao Ren, Xinyuan Song, Qian Niu, Silin Chen, Ming Liu

Main category: cs.CV

TL;DR: This paper explores the integration of Machine Learning and Deep Learning with Digital Signal Processing and Digital Image Processing techniques for computer vision applications, focusing on image enhancement, filtering, and pattern recognition using mathematical transforms like DFT and Fourier Transform.

DetailsMotivation: The motivation is to bridge traditional DSP/DIP methodologies with modern ML/DL approaches to create more robust, scalable, and high-performance solutions for computer vision tasks, addressing the need for efficient real-time data processing and automated feature extraction.

Method: The method integrates mathematical frameworks including Discrete Fourier Transform (DFT), Z-Transform, and Fourier Transform methods with ML/DL algorithms. Implementation is done using Python to optimize real-time data processing, enabling robust data manipulation and feature extraction for AI-driven tasks.

Result: The work demonstrates the potential of combining ML/DL with DSP/DIP to advance computer vision methodologies, providing scalable, high-performance solutions that contribute to artificial intelligence applications, automated feature extraction, and diverse domain applications.

Conclusion: The integration of ML/DL with traditional DSP/DIP techniques offers transformative potential for computer vision, enabling more sophisticated image processing, pattern recognition, and real-time data processing capabilities that advance AI applications across multiple domains.

Abstract: Digital Signal Processing (DSP) and Digital Image Processing (DIP) with Machine Learning (ML) and Deep Learning (DL) are popular research areas in Computer Vision and related fields. We highlight transformative applications in image enhancement, filtering techniques, and pattern recognition. By integrating frameworks like the Discrete Fourier Transform (DFT), Z-Transform, and Fourier Transform methods, we enable robust data manipulation and feature extraction essential for AI-driven tasks. Using Python, we implement algorithms that optimize real-time data processing, forming a foundation for scalable, high-performance solutions in computer vision. This work illustrates the potential of ML and DL to advance DSP and DIP methodologies, contributing to artificial intelligence, automated feature extraction, and applications across diverse domains.

[61] Accuracy Does Not Guarantee Human-Likeness in Monocular Depth Estimators

Yuki Kubota, Taiki Fukiage

Main category: cs.CV

TL;DR: This paper investigates the relationship between model accuracy and human similarity in monocular depth estimation, finding that improving accuracy doesn’t necessarily lead to more human-like behavior.

DetailsMotivation: The motivation is to understand whether there's a divergence between model accuracy and human perception in monocular depth estimation, similar to what's observed in object recognition. Current benchmarks rely on sensor-based ground truth rather than human perceptual estimates, creating a gap in understanding how well DNNs align with human perception.

Method: The researchers systematically investigated 69 monocular depth estimators using the KITTI dataset. They applied affine fitting to decompose prediction errors into interpretable components on a factor-by-factor basis, allowing them to dissect error patterns and compare model behavior with human perception.

Result: Results show that while humans and DNNs share certain estimation biases (positive error correlations), there are distinct trade-off relationships between model accuracy and human similarity. Improving accuracy does not necessarily lead to more human-like behavior in depth estimation.

Conclusion: The study concludes that traditional accuracy metrics are insufficient for evaluating depth estimation models, and there’s a need for multifaceted, human-centric evaluations that consider perceptual alignment alongside accuracy.

Abstract: Monocular depth estimation is a fundamental capability for real-world applications such as autonomous driving and robotics. Although deep neural networks (DNNs) have achieved superhuman accuracy on physical-based benchmarks, a key challenge remains: aligning model representations with human perception, a promising strategy for enhancing model robustness and interpretability. Research in object recognition has revealed a complex trade-off between model accuracy and human-like behavior, raising a question whether a similar divergence exist in depth estimation, particularly for natural outdoor scenes where benchmarks rely on sensor-based ground truth rather than human perceptual estimates. In this study, we systematically investigated the relationship between model accuracy and human similarity across 69 monocular depth estimators using the KITTI dataset. To dissect the structure of error patterns on a factor-by-factor basis, we applied affine fitting to decompose prediction errors into interpretable components. Intriguingly, our results reveal while humans and DNNs share certain estimation biases (positive error correlations), we observed distinct trade-off relationships between model accuracy and human similarity. This finding indicates that improving accuracy does not necessarily lead to more human-like behavior, underscoring the necessity of developing multifaceted, human-centric evaluations beyond traditional accuracy.

[62] GeoLoom: High-quality Geometric Diagram Generation from Textual Input

Xiaojing Wei, Ting Zhang, Wei He, Jingdong Wang, Hua Huang

Main category: cs.CV

TL;DR: GeoLoom is a text-to-geometric-diagram generation framework using formal language translation and Monte Carlo optimization for precise spatial accuracy.

DetailsMotivation: Geometric diagram generation requires strict spatial accuracy while offering well-defined constraints, presenting both challenge and opportunity. Current methods lack the structural fidelity needed for precise geometric representations.

Method: Two-component framework: 1) Autoformalization module translating natural language to formal language GeoLingua, 2) Coordinate solver using Monte Carlo optimization to map formal constraints to precise coordinates. Introduces GeoNF dataset aligning natural and formal descriptions, and constraint-based evaluation metric for structural deviation.

Result: GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing mathematically grounded supervision for iterative refinement.

Conclusion: GeoLoom offers a principled foundation for interpretable and scalable diagram generation in geometric domains through formal language translation and constraint-based optimization.

Abstract: High-quality geometric diagram generation presents both a challenge and an opportunity: it demands strict spatial accuracy while offering well-defined constraints to guide generation. Inspired by recent advances in geometry problem solving that employ formal languages and symbolic solvers for enhanced correctness and interpretability, we propose GeoLoom, a novel framework for text-to-diagram generation in geometric domains. GeoLoom comprises two core components: an autoformalization module that translates natural language into a specifically designed generation-oriented formal language GeoLingua, and a coordinate solver that maps formal constraints to precise coordinates using the efficient Monte Carlo optimization. To support this framework, we introduce GeoNF, a dataset aligning natural language geometric descriptions with formal GeoLingua descriptions. We further propose a constraint-based evaluation metric that quantifies structural deviation, offering mathematically grounded supervision for iterative refinement. Empirical results demonstrate that GeoLoom significantly outperforms state-of-the-art baselines in structural fidelity, providing a principled foundation for interpretable and scalable diagram generation.

[63] The Missing Point in Vision Transformers for Universal Image Segmentation

Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi

Main category: cs.CV

TL;DR: ViT-P is a two-stage segmentation framework that decouples mask generation from classification using a Vision Transformer-based point classifier, achieving SOTA results on multiple benchmarks without requiring pre-training modifications.

DetailsMotivation: Current mask-based segmentation approaches produce high-quality masks but struggle with accurate classification, especially for ambiguous boundaries and imbalanced class distributions. There's a need for a method that can effectively classify masks without requiring extensive fine-grained annotations.

Method: Two-stage framework: (1) Proposal generator produces class-agnostic mask proposals, (2) Point-based classification model built on Vision Transformer (ViT) refines predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter that integrates various pre-trained ViTs without architectural modifications. Can leverage coarse and bounding box annotations to enhance classification without fine annotation training.

Result: Achieves state-of-the-art results: 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. Demonstrates effectiveness across COCO, ADE20K, and Cityscapes datasets.

Conclusion: ViT-P effectively addresses classification challenges in segmentation by decoupling mask generation and classification, offering a flexible adapter approach that works with various pre-trained ViTs. The method reduces annotation costs by utilizing coarse annotations while maintaining strong performance across multiple segmentation tasks.

Abstract: Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.

[64] CourtMotion: Learning Event-Driven Motion Representations from Skeletal Data for Basketball

Omer Sela, Michael Chertok, Lior Wolf

Main category: cs.CV

TL;DR: CourtMotion: A spatiotemporal framework using skeletal tracking data and Graph Neural Networks + Transformers to predict basketball events like passes, shots, and steals with 35% better trajectory prediction than position-only models.

DetailsMotivation: Traditional basketball analysis using only player positions misses crucial motion indicators like body orientation, defensive stance, and shooting preparation. There's a need to understand both physical motion patterns and their semantic significance in game context.

Method: Two-stage approach: 1) Graph Neural Networks process skeletal tracking data to capture nuanced motion patterns, 2) Transformer architecture with specialized attention mechanisms models player interactions. Event projection heads explicitly connect movements to basketball events.

Result: 35% reduction in trajectory prediction error vs state-of-the-art position-based models. Consistent performance gains across key basketball analytics tasks including pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition.

Conclusion: CourtMotion provides a powerful pretrained foundation for multiple downstream basketball analytics tasks by effectively capturing the relationship between physical motion patterns and their tactical purposes in professional basketball.

Abstract: This paper presents CourtMotion, a spatiotemporal modeling framework for analyzing and predicting game events and plays as they develop in professional basketball. Anticipating basketball events requires understanding both physical motion patterns and their semantic significance in the context of the game. Traditional approaches that use only player positions fail to capture crucial indicators such as body orientation, defensive stance, or shooting preparation motions. Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns, then employs a Transformer architecture with specialized attention mechanisms to model player interactions. We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their tactical purposes. Experiments on NBA tracking data demonstrate significant improvements over position-only baselines: 35% reduction in trajectory prediction error compared to state-of-the-art position-based models and consistent performance gains across key basketball analytics tasks. The resulting pretrained model serves as a powerful foundation for multiple downstream tasks, with pick detection, shot taker identification, assist prediction, shot location classification, and shot type recognition demonstrating substantial improvements over existing methods.

[65] Animal Re-Identification on Microcontrollers

Yubo Chen, Di Zhao, Yun Sing Koh, Talia Xu

Main category: cs.CV

TL;DR: Proposes an on-device Animal Re-ID framework for MCU-class hardware that achieves competitive accuracy with 100x smaller models and enables fast adaptation with minimal data.

DetailsMotivation: Existing Animal Re-ID models are too large for deployment on low-power edge devices like collar tags or MCUs in wildlife monitoring and livestock management applications where wireless connectivity is limited.

Method: 1) Characterizes gap between SOTA models and MCU hardware, 2) Designs compact CNN architecture by scaling MobileNetV2 for low-resolution inputs, 3) Introduces data-efficient fine-tuning strategy requiring only 3 images per animal for site adaptation.

Result: Achieves competitive retrieval accuracy across 6 public datasets with 100x model size reduction. On cattle dataset, performs fully on-device inference with minimal accuracy drop and unchanged Top-1 accuracy compared to cluster version.

Conclusion: Practical, adaptable Animal Re-ID is achievable on MCU-class devices, enabling scalable deployment in real field environments for wildlife monitoring and livestock management.

Abstract: Camera-based animal re-identification (Animal Re-ID) can support wildlife monitoring and precision livestock management in large outdoor environments with limited wireless connectivity. In these settings, inference must run directly on collar tags or low-power edge nodes built around microcontrollers (MCUs), yet most Animal Re-ID models are designed for workstations or servers and are too large for devices with small memory and low-resolution inputs. We propose an on-device framework. First, we characterise the gap between state-of-the-art Animal Re-ID models and MCU-class hardware, showing that straightforward knowledge distillation from large teachers offers limited benefit once memory and input resolution are constrained. Second, guided by this analysis, we design a high-accuracy Animal Re-ID architecture by systematically scaling a CNN-based MobileNetV2 backbone for low-resolution inputs. Third, we evaluate the framework with a real-world dataset and introduce a data-efficient fine-tuning strategy to enable fast adaptation with just three images per animal identity at a new site. Across six public Animal Re-ID datasets, our compact model achieves competitive retrieval accuracy while reducing model size by over two orders of magnitude. On a self-collected cattle dataset, the deployed model performs fully on-device inference with only a small accuracy drop and unchanged Top-1 accuracy relative to its cluster version. We demonstrate that practical, adaptable Animal Re-ID is achievable on MCU-class devices, paving the way for scalable deployment in real field environments.

[66] Blur2Sharp: Human Novel Pose and View Synthesis with Generative Prior Refinement

Chia-Hern Lai, I-Hsuan Lo, Yen-Ku Yeh, Thanh-Nguyen Truong, Ching-Chun Huang

Main category: cs.CV

TL;DR: Blur2Sharp generates sharp, geometrically consistent novel-view human avatars from single reference view using 3D-aware neural rendering and diffusion models.

DetailsMotivation: Current human avatar generation methods produce either geometrically inconsistent multi-view images or blurry outputs under diverse viewing angles and complex motions, lacking both photorealism and geometric consistency.

Method: Dual-conditioning architecture: 1) Human NeRF generates geometrically coherent multi-view renderings for target poses with 3D structural guidance, 2) Diffusion model refines these renderings while preserving details, 3) Hierarchical feature fusion incorporates texture, normal, and semantic priors from SMPL models.

Result: Blur2Sharp consistently surpasses state-of-the-art techniques in novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.

Conclusion: The proposed framework successfully addresses the trade-off between geometric consistency and photorealism in human avatar generation, enabling sharp, geometrically consistent novel-view images from single reference view.

Abstract: The creation of lifelike human avatars capable of realistic pose variation and viewpoint flexibility remains a fundamental challenge in computer vision and graphics. Current approaches typically yield either geometrically inconsistent multi-view images or sacrifice photorealism, resulting in blurry outputs under diverse viewing angles and complex motions. To address these issues, we propose Blur2Sharp, a novel framework integrating 3D-aware neural rendering and diffusion models to generate sharp, geometrically consistent novel-view images from only a single reference view. Our method employs a dual-conditioning architecture: initially, a Human NeRF model generates geometrically coherent multi-view renderings for target poses, explicitly encoding 3D structural guidance. Subsequently, a diffusion model conditioned on these renderings refines the generated images, preserving fine-grained details and structural fidelity. We further enhance visual quality through hierarchical feature fusion, incorporating texture, normal, and semantic priors extracted from parametric SMPL models to simultaneously improve global coherence and local detail accuracy. Extensive experiments demonstrate that Blur2Sharp consistently surpasses state-of-the-art techniques in both novel pose and view generation tasks, particularly excelling under challenging scenarios involving loose clothing and occlusions.

[67] VisKnow: Constructing Visual Knowledge Base for Object Understanding

Ziwei Yao, Qiyang Wan, Ruiping Wang, Xilin Chen

Main category: cs.CV

TL;DR: Proposes Visual Knowledge Base (VisKnow) framework to structure multi-modal object knowledge as graphs, with AnimalKB as case study covering 406 animal categories with 22K textual triplets, 420K images, and region annotations.

DetailsMotivation: Current object understanding in computer vision is limited to simple category recognition. Comprehensive object understanding requires multi-modal data (visual annotations, textual knowledge) that is currently task-oriented and not systematically organized.

Method: VisKnow framework extracts multi-modal, object-level knowledge by integrating enriched aligned text and image-source knowledge with region annotations at object and part levels, combining expert design and large-scale model application.

Result: Built AnimalKB covering 406 animal categories with 22K textual knowledge triplets, 420K images, and region annotations. Experiments show AnimalKB enhances zero-shot recognition, fine-grained VQA, and serves as benchmark for knowledge graph completion and part segmentation.

Conclusion: Visual knowledge bases can advance visual understanding and practical applications. The framework demonstrates potential for automatically constructing comprehensive object knowledge bases.

Abstract: Understanding objects is fundamental to computer vision. Beyond object recognition that provides only a category label as typical output, in-depth object understanding represents a comprehensive perception of an object category, involving its components, appearance characteristics, inter-category relationships, contextual background knowledge, etc. Developing such capability requires sufficient multi-modal data, including visual annotations such as parts, attributes, and co-occurrences for specific tasks, as well as textual knowledge to support high-level tasks like reasoning and question answering. However, these data are generally task-oriented and not systematically organized enough to achieve the expected understanding of object categories. In response, we propose the Visual Knowledge Base that structures multi-modal object knowledge as graphs, and present a construction framework named VisKnow that extracts multi-modal, object-level knowledge for object understanding. This framework integrates enriched aligned text and image-source knowledge with region annotations at both object and part levels through a combination of expert design and large-scale model application. As a specific case study, we construct AnimalKB, a structured animal knowledge base covering 406 animal categories, which contains 22K textual knowledge triplets extracted from encyclopedic documents, 420K images, and corresponding region annotations. A series of experiments showcase how AnimalKB enhances object-level visual tasks such as zero-shot recognition and fine-grained VQA, and serves as challenging benchmarks for knowledge graph completion and part segmentation. Our findings highlight the potential of automatically constructing visual knowledge bases to advance visual understanding and its practical applications. The project page is available at https://vipl-vsu.github.io/VisKnow.

[68] SOP^2: Transfer Learning with Scene-Oriented Prompt Pool on 3D Object Detection

Ching-Hung Cheng, Hsiu-Fu Wu, Bing-Chen Wu, Khanh-Phong Bui, Van-Tin Luu, Ching-Chun Huang

Main category: cs.CV

TL;DR: This paper explores prompt tuning methods for 3D object detection, investigating whether models trained on Waymo dataset can serve as foundation models for other scenarios, and proposes a Scene-Oriented Prompt Pool (SOP²).

DetailsMotivation: To explore the effectiveness of prompt tuning methods in 3D object detection, inspired by the success of prompt tuning in NLP with LLMs. The goal is to investigate whether a model trained on large-scale Waymo dataset can serve as a foundation model and adapt to other 3D detection scenarios.

Method: Sequentially examines impact of prompt tokens and prompt generators, and proposes a Scene-Oriented Prompt Pool (SOP²) for 3D object detection. Investigates transfer learning techniques like fine-tuning and prompt tuning for adapting foundation models to various downstream tasks.

Result: Demonstrates the effectiveness of prompt pools in 3D object detection, showing that prompt tuning methods can work effectively in the 3D domain similar to their success in NLP.

Conclusion: Prompt tuning methods are effective for 3D object detection, and the proposed Scene-Oriented Prompt Pool (SOP²) shows promise. The work aims to inspire future researchers to explore the potential of prompts in the 3D field.

Abstract: With the rise of Large Language Models (LLMs) such as GPT-3, these models exhibit strong generalization capabilities. Through transfer learning techniques such as fine-tuning and prompt tuning, they can be adapted to various downstream tasks with minimal parameter adjustments. This approach is particularly common in the field of Natural Language Processing (NLP). This paper aims to explore the effectiveness of common prompt tuning methods in 3D object detection. We investigate whether a model trained on the large-scale Waymo dataset can serve as a foundation model and adapt to other scenarios within the 3D object detection field. This paper sequentially examines the impact of prompt tokens and prompt generators, and further proposes a Scene-Oriented Prompt Pool (\textbf{SOP$^2$}). We demonstrate the effectiveness of prompt pools in 3D object detection, with the goal of inspiring future researchers to delve deeper into the potential of prompts in the 3D field.

[69] New VVC profiles targeting Feature Coding for Machines

Md Eimran Hossain Eimon, Ashan Perera, Juan Merlos, Velibor Adzic, Hari Kalva

Main category: cs.CV

TL;DR: The paper proposes three lightweight VVC profiles (Fast, Faster, Fastest) for compressing neural network features in split inference systems, achieving significant encoding speed improvements with minimal compression efficiency loss.

DetailsMotivation: Traditional video codecs are optimized for perceptual quality based on human visual system models, but these assumptions don't apply to intermediate neural network features in split inference systems. Features are abstract, sparse, and task-specific, making perceptual fidelity irrelevant.

Method: The authors investigate using Versatile Video Coding (VVC) for feature compression under MPEG-AI FCM standard. They perform tool-level analysis to understand impact of individual coding components on compression efficiency and downstream task accuracy. Based on insights, they propose three lightweight VVC profiles.

Result: Fast profile provides 2.96% BD-Rate gain while reducing encoding time by 21.8%. Faster achieves 1.85% BD-Rate gain with 51.5% speedup. Fastest reduces encoding time by 95.6% with only 1.71% loss in BD-Rate.

Conclusion: The proposed lightweight VVC profiles effectively balance compression efficiency and encoding speed for neural network feature compression in split inference systems, addressing the mismatch between traditional video coding assumptions and feature compression requirements.

Abstract: Modern video codecs have been extensively optimized to preserve perceptual quality, leveraging models of the human visual system. However, in split inference systems-where intermediate features from neural network are transmitted instead of pixel data-these assumptions no longer apply. Intermediate features are abstract, sparse, and task-specific, making perceptual fidelity irrelevant. In this paper, we investigate the use of Versatile Video Coding (VVC) for compressing such features under the MPEG-AI Feature Coding for Machines (FCM) standard. We perform a tool-level analysis to understand the impact of individual coding components on compression efficiency and downstream vision task accuracy. Based on these insights, we propose three lightweight essential VVC profiles-Fast, Faster, and Fastest. The Fast profile provides 2.96% BD-Rate gain while reducing encoding time by 21.8%. Faster achieves a 1.85% BD-Rate gain with a 51.5% speedup. Fastest reduces encoding time by 95.6% with only a 1.71% loss in BD-Rate.

[70] Beyond Real Weights: Hypercomplex Representations for Stable Quantization

Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas, Muhammad Rafsan Kabir, Robin Krambroeckers, Sifat Momen, Nabeel Mohammed, Shafin Rahman

Main category: cs.CV

TL;DR: Progressive reparameterization using PHM layers compresses multimodal language models by replacing dense feed-forward blocks, reducing parameters and FLOPs while maintaining performance.

DetailsMotivation: Multimodal language models require large parameter capacity to align visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently.

Method: Progressive reparameterization strategy that gradually replaces dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers, using residual interpolation schedule with lightweight reconstruction and knowledge distillation losses.

Result: Substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. Maintains performance comparable to base models with significant reductions in model size and inference latency.

Conclusion: Progressive PHM substitution offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.

Abstract: Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.

[71] MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv, Ruiqi Chen, Jing Yang, Yijia Fan, Xiaofei Sun, Jian Wang, Ziliang Chen, Liang Lin, Keze Wang

Main category: cs.CV

TL;DR: MM-CoT is a diagnostic benchmark that evaluates multimodal models’ Chain-of-Thought reasoning by testing visual grounding and logical coherence through constrained event chain selection with adversarial distractors.

DetailsMotivation: Existing benchmarks focus on generating explanations but neglect verifying whether reasoning is genuinely grounded in visual evidence and logically coherent. There's a need to assess if multimodal models' Chain-of-Thought reasoning is faithful to visual input and logically valid.

Method: Instead of free-form explanation generation, models must select the sole event chain satisfying two constraints: (1) visual consistency (anchored in observable evidence) and (2) logical coherence (causal and commonsense validity). Adversarial distractors are engineered to violate one of these constraints to expose distinct reasoning failures.

Result: Leading vision-language models struggle on MM-CoT, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. The benchmark shows low correlation with existing benchmarks, confirming it measures a unique combination of visual grounding and logical reasoning.

Conclusion: MM-CoT provides a foundation for developing future multimodal models that reason not just plausibly, but faithfully and coherently within the visual world, addressing critical gaps in current evaluation methods for Chain-of-Thought reasoning.

Abstract: The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

[72] Pose-Based Sign Language Spotting via an End-to-End Encoder Architecture

Samuel Ebimobowei Johnny, Blessed Guda, Emmanuel Enejo Aaron, Assane Gueye

Main category: cs.CV

TL;DR: A novel pose-based approach for Sign Language Spotting that detects query signs in continuous sign sequences without intermediate gloss recognition, achieving 61.88% accuracy on WSLP 2025 dataset.

DetailsMotivation: Sign-to-sign retrieval (detecting specific signs in continuous sequences) is largely unexplored despite being crucial for bridging deaf-hearing communication gaps. Current approaches rely on intermediate gloss recognition or text matching, which have limitations.

Method: End-to-end model using pose keypoints from sign videos instead of RGB frames. Architecture employs encoder-only backbone with binary classification head to determine if query sign appears in target sequence. Focuses on pose representations to reduce computational cost and visual noise.

Result: Achieved 61.88% accuracy and 60.00% F1-score on Word Presence Prediction dataset from WSLP 2025 shared task, demonstrating effectiveness of pose-based framework for Sign Language Spotting.

Conclusion: The pose-based approach provides a strong foundation for automatic sign language retrieval and verification, offering computational efficiency and noise reduction compared to conventional methods. Establishes promising direction for future research in Sign Language Spotting.

Abstract: Automatic Sign Language Recognition (ASLR) has emerged as a vital field for bridging the gap between deaf and hearing communities. However, the problem of sign-to-sign retrieval or detecting a specific sign within a sequence of continuous signs remains largely unexplored. We define this novel task as Sign Language Spotting. In this paper, we present a first step toward sign language retrieval by addressing the challenge of detecting the presence or absence of a query sign video within a sentence-level gloss or sign video. Unlike conventional approaches that rely on intermediate gloss recognition or text-based matching, we propose an end-to-end model that directly operates on pose keypoints extracted from sign videos. Our architecture employs an encoder-only backbone with a binary classification head to determine whether the query sign appears within the target sequence. By focusing on pose representations instead of raw RGB frames, our method significantly reduces computational cost and mitigates visual noise. We evaluate our approach on the Word Presence Prediction dataset from the WSLP 2025 shared task, achieving 61.88% accuracy and 60.00% F1-score. These results demonstrate the effectiveness of our pose-based framework for Sign Language Spotting, establishing a strong foundation for future research in automatic sign language retrieval and verification. Code is available at https://github.com/EbimoJohnny/Pose-Based-Sign-Language-Spotting

[73] Geometry-Aware Sparse Depth Sampling for High-Fidelity RGB-D Depth Completion in Robotic Systems

Tony Salloom, Dandi Zhou, Xinhai Sun

Main category: cs.CV

TL;DR: Proposes normal-guided sparse depth sampling for depth completion that mimics real sensor reliability patterns instead of uniform random sampling.

DetailsMotivation: Current depth completion methods use unrealistic uniform random sampling of sparse depth from ground truth, ignoring that real sensors (RGB-D, stereo) have geometry-dependent and spatially nonuniform reliability patterns.

Method: Uses PCA-based surface normal estimation on RGB-D point clouds to compute per-pixel depth reliability measures, then samples sparse depth according to this reliability distribution. Integrated with Marigold-DC diffusion-based depth completion model.

Result: Experiments on NYU Depth v2 show improved accuracy, reduced artifacts near edges/discontinuities, and more realistic training conditions that better reflect real sensor behavior.

Conclusion: Geometry-aware sparse depth sampling produces more realistic training data and better depth completion results by mimicking real sensor reliability patterns.

Abstract: Accurate three-dimensional perception is essential for modern industrial robotic systems that perform manipulation, inspection, and navigation tasks. RGB-D and stereo vision sensors are widely used for this purpose, but the depth maps they produce are often noisy, incomplete, or biased due to sensor limitations and environmental conditions. Depth completion methods aim to generate dense, reliable depth maps from RGB images and sparse depth input. However, a key limitation in current depth completion pipelines is the unrealistic generation of sparse depth: sparse pixels are typically selected uniformly at random from dense ground-truth depth, ignoring the fact that real sensors exhibit geometry-dependent and spatially nonuniform reliability. In this work, we propose a normal-guided sparse depth sampling strategy that leverages PCA-based surface normal estimation on the RGB-D point cloud to compute a per-pixel depth reliability measure. The sparse depth samples are then drawn according to this reliability distribution. We integrate this sampling method with the Marigold-DC diffusion-based depth completion model and evaluate it on NYU Depth v2 using the standard metrics. Experiments show that our geometry-aware sparse depth improves accuracy, reduces artifacts near edges and discontinuities, and produces more realistic training conditions that better reflect real sensor behavior.

[74] FastBEV++: Fast by Algorithm, Deployable by Design

Yuanpeng Chen, Hui Song, Wei Tao, ShanHui Mo, Shuang Zhang, Xiao Hua, TianKun Zhao

Main category: cs.CV

TL;DR: FastBEV++ is a camera-only BEV perception framework that achieves state-of-the-art performance (0.359 NDS on nuScenes) with exceptional real-time speed (134+ FPS on Tesla T4) by using standard operator primitives instead of custom CUDA kernels.

DetailsMotivation: Current camera-only BEV perception systems face a fundamental tension between achieving state-of-the-art performance and maintaining deployment tractability on vehicles, due to reliance on computationally expensive view transformations and platform-specific kernels.

Method: Introduces two principles: 1) “Deployable by Design” via novel view transformation paradigm that decomposes projection into standard Index-Gather-Reshape pipeline using deterministic pre-sorting, enabling TensorRT-native portability without custom CUDA kernels; 2) “Fast by Algorithm” via end-to-end depth-aware fusion with jointly learned depth modulation, temporal aggregation, and robust data augmentation.

Result: Achieves new state-of-the-art 0.359 NDS on nuScenes benchmark while maintaining exceptional real-time performance exceeding 134 FPS on automotive-grade hardware (Tesla T4), demonstrating both high accuracy and deployment efficiency.

Conclusion: FastBEV++ presents a mature and scalable design philosophy for production autonomous systems by offering a solution free of custom plugins yet highly accurate, reconciling the tension between performance and deployment tractability in camera-only BEV perception.

Abstract: The advancement of camera-only Bird’s-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the “Deployable by Design” principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is “Fast by Algorithm”, leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV representation.Empirical validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: https://github.com/ymlab/advanced-fastbev

[75] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models

Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan, Wenhao Chai, Jian Wang, Keze Wang

Main category: cs.CV

TL;DR: HTC-VLM is a hybrid vision-language model that uses dual channels (continuous for fine details, discrete for symbolic anchors) to compress visual information from 580 tokens to just 1 token while maintaining 87.2% performance across benchmarks.

DetailsMotivation: Current VLMs face quadratic computational costs from processing hundreds of visual patch tokens, creating memory and context window strain. Traditional compression methods either dilute high-level semantics (continuous compression) or lose fine-grained details (discrete quantization).

Method: Hybrid framework with dual channels: continuous pathway for fine-grained details via ViT patches, and discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token using disentanglement attention mask and bottleneck.

Result: Achieves 87.2% average performance retention across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming leading continuous baseline at 81.0% with 580-to-1 compression ratio. Attention analysis shows compressed token prioritizes discrete anchor for semantic guidance.

Conclusion: A minimalist hybrid design can resolve the efficiency-fidelity dilemma in VLMs, enabling scalable multimodal reasoning through efficient and grounded representations that balance semantic preservation and detail retention.

Abstract: Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.

[76] Residual-SwinCA-Net: A Channel-Aware Integrated Residual CNN-Swin Transformer for Malignant Lesion Segmentation in BUSI

Saeeda Naz, Saddam Hussain Khan

Main category: cs.CV

TL;DR: A hybrid Residual-SwinCA-Net framework combining CNN and Transformer blocks achieves state-of-the-art breast lesion segmentation on BUSI dataset with 99.29% accuracy.

DetailsMotivation: To address challenges in breast ultrasound lesion segmentation by extracting both locally correlated features and global dependencies while handling ultrasound noise and maintaining morphological integrity of malignant lesions.

Method: Proposes a hybrid framework with residual CNN modules for local features, customized Swin Transformer blocks for global dependencies, Laplacian-of-Gaussian for tissue continuity, boundary-oriented operator for contour integrity, contraction strategy for scale invariance, MSCAS module for multi-scale attention, and Pixel-Attention for spatial weighting.

Result: Outperforms existing CNNs/ViTs on BUSI dataset with 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice score for breast lesion segmentation.

Conclusion: The Residual-SwinCA-Net framework significantly improves BUSI lesion diagnostic performance and supports timely clinical decision-making for breast cancer diagnosis.

Abstract: A novel deep hybrid Residual-SwinCA-Net segmentation framework is proposed in the study for addressing such challenges by extracting locally correlated and robust features, incorporating residual CNN modules. Furthermore, for learning global dependencies, Swin Transformer blocks are customized using internal residual pathways, which reinforce gradient stability, refine local patterns, and facilitate global feature fusion. Formerly, for enhancing tissue continuity, ultrasound noise suppressions, and accentuating fine structural transitions Laplacian-of-Gaussian regional operator is applied, and for maintaining the morphological integrity of malignant lesion contours, a boundary-oriented operator has been incorporated. Subsequently, a contraction strategy was applied stage-wise by progressively reducing features-map progressively for capturing scale invariance and enhancing the robustness of structural variability. In addition, each decoder level prior augmentation integrates a new Multi-Scale Channel Attention and Squeezing (MSCAS) module. The MSCAS selectively emphasizes encoder salient maps, retains discriminative global context, and complementary local structures with minimal computational cost while suppressing redundant activations. Finally, the Pixel-Attention module encodes class-relevant spatial cues by adaptively weighing malignant lesion pixels while suppressing background interference. The Residual-SwinCA-Net and existing CNNs/ViTs techniques have been implemented on the publicly available BUSI dataset. The proposed Residual-SwinCA-Net framework outperformed and achieved 99.29% mean accuracy, 98.74% IoU, and 0.9041 Dice for breast lesion segmentation. The proposed Residual-SwinCA-Net framework improves the BUSI lesion diagnostic performance and strengthens timely clinical decision-making.

[77] Distilling Future Temporal Knowledge with Masked Feature Reconstruction for 3D Object Detection

Haowen Zheng, Hu Zhu, Lu Deng, Weihao Gu, Yang Yang, Yanyan Liang

Main category: cs.CV

TL;DR: FTKD: Future Temporal Knowledge Distillation transfers future frame knowledge from offline teacher to online student models for camera-based 3D object detection without increasing inference cost.

DetailsMotivation: Existing knowledge distillation methods overlook future frames, focusing only on spatial features or temporal relations, making it hard for online models to learn future knowledge from offline models that use future frames.

Method: Proposes sparse query-based Future Temporal Knowledge Distillation (FTKD) with: 1) future-aware feature reconstruction without strict frame alignment, and 2) future-guided logit distillation leveraging teacher’s foreground/background context.

Result: Achieves up to 1.3 mAP and 1.3 NDS gains on nuScenes dataset, plus most accurate velocity estimation, without increasing inference cost.

Conclusion: FTKD effectively transfers future knowledge from offline to online models, improving 3D object detection performance while maintaining real-time inference efficiency.

Abstract: Camera-based temporal 3D object detection has shown impressive results in autonomous driving, with offline models improving accuracy by using future frames. Knowledge distillation (KD) can be an appealing framework for transferring rich information from offline models to online models. However, existing KD methods overlook future frames, as they mainly focus on spatial feature distillation under strict frame alignment or on temporal relational distillation, thereby making it challenging for online models to effectively learn future knowledge. To this end, we propose a sparse query-based approach, Future Temporal Knowledge Distillation (FTKD), which effectively transfers future frame knowledge from an offline teacher model to an online student model. Specifically, we present a future-aware feature reconstruction strategy to encourage the student model to capture future features without strict frame alignment. In addition, we further introduce future-guided logit distillation to leverage the teacher’s stable foreground and background context. FTKD is applied to two high-performing 3D object detection baselines, achieving up to 1.3 mAP and 1.3 NDS gains on the nuScenes dataset, as well as the most accurate velocity estimation, without increasing inference cost.

[78] Query-aware Hub Prototype Learning for Few-Shot 3D Point Cloud Semantic Segmentation

YiLin Zhou, Lili Wei, Zheming Xu, Ziyi Chen, Congyan Lang

Main category: cs.CV

TL;DR: QHP addresses prototype bias in few-shot 3D segmentation by generating query-aware prototypes through hub identification and distribution optimization.

DetailsMotivation: Existing metric-based prototype learning methods generate prototypes solely from support set without considering query relevance, causing prototype bias that overfits support characteristics and fails to generalize to query distribution, especially with distribution shifts.

Method: Proposes Query-aware Hub Prototype (QHP) learning with two modules: 1) Hub Prototype Generation (HPG) constructs bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query-relevant prototypes; 2) Prototype Distribution Optimization (PDO) uses purity-reweighted contrastive loss to refine prototypes by pulling bad hubs and outlier prototypes closer to class centers.

Result: Extensive experiments on S3DIS and ScanNet show QHP achieves substantial performance gains over state-of-the-art methods, effectively narrowing the semantic gap between prototypes and query sets.

Conclusion: QHP successfully addresses prototype bias in FS-3DSeg by explicitly modeling semantic correlations between support and query sets, improving generalization to query distribution.

Abstract: Few-shot 3D point cloud semantic segmentation (FS-3DSeg) aims to segment novel classes with only a few labeled samples. However, existing metric-based prototype learning methods generate prototypes solely from the support set, without considering their relevance to query data. This often results in prototype bias, where prototypes overfit support-specific characteristics and fail to generalize to the query distribution, especially in the presence of distribution shifts, which leads to degraded segmentation performance. To address this issue, we propose a novel Query-aware Hub Prototype (QHP) learning method that explicitly models semantic correlations between support and query sets. Specifically, we propose a Hub Prototype Generation (HPG) module that constructs a bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query-relevant prototypes that better capture cross-set semantics. To further mitigate the influence of bad hubs and ambiguous prototypes near class boundaries, we introduce a Prototype Distribution Optimization (PDO) module, which employs a purity-reweighted contrastive loss to refine prototype representations by pulling bad hubs and outlier prototypes closer to their corresponding class centers. Extensive experiments on S3DIS and ScanNet demonstrate that QHP achieves substantial performance gains over state-of-the-art methods, effectively narrowing the semantic gap between prototypes and query sets in FS-3DSeg.

[79] SFP: Real-World Scene Recovery Using Spatial and Frequency Priors

Yun Liu, Tao Li, Cosmin Ancuti, Wenqi Ren, Weisi Lin

Main category: cs.CV

TL;DR: SFP method uses spatial and frequency priors for real-world scene recovery, integrating transmission map estimation from spatial domain and adaptive frequency enhancement with novel priors, achieving superior performance across various degradation conditions.

DetailsMotivation: Existing scene recovery methods are insufficient: single priors can't handle multiple degradations, and complex networks trained on synthetic data generalize poorly to real-world scenarios. Need a more robust approach for diverse real-world degradations.

Method: Proposes Spatial and Frequency Priors (SFP): 1) Spatial prior: uses inverse of degraded image’s spectral projection to estimate transmission map for scattering degradation recovery; 2) Frequency prior: constructs adaptive frequency enhancement mask with two novel priors (DC component intensity similarity and low radial frequency magnitude distribution); 3) Weighted fusion strategy integrates spatial restoration, frequency enhancement, and salient input features.

Result: Extensive evaluations demonstrate SFP’s effectiveness and superiority for scene recovery under various degradation conditions, outperforming existing methods that rely on single priors or synthetic-trained networks.

Conclusion: SFP provides a robust scene recovery approach combining spatial and frequency priors that effectively handles multiple real-world degradations without complex network architectures or synthetic data limitations.

Abstract: Scene recovery serves as a critical task for various computer vision applications. Existing methods typically rely on a single prior, which is inherently insufficient to handle multiple degradations, or employ complex network architectures trained on synthetic data, which suffer from poor generalization for diverse real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery. In the spatial domain, we observe that the inverse of the degraded image exhibits a projection along its spectral direction that resembles the scene transmission. Leveraging this spatial prior, the transmission map is estimated to recover the scene from scattering degradation. In the frequency domain, a mask is constructed for adaptive frequency enhancement, with two parameters estimated using our proposed novel priors. Specifically, one prior assumes that the mean intensity of the degraded image’s direct current (DC) components across three channels in the frequency domain closely approximates that of each channel in the clear image. The second prior is based on the observation that, for clear images, the magnitude of low radial frequencies below 0.001 constitutes approximately 1% of the total spectrum. Finally, we design a weighted fusion strategy to integrate spatial-domain restoration, frequency-domain enhancement, and salient features from the input image, yielding the final recovered result. Extensive evaluations demonstrate the effectiveness and superiority of our proposed SFP for scene recovery under various degradation conditions.

[80] RLCNet: An end-to-end deep learning framework for simultaneous online calibration of LiDAR, RADAR, and Camera

Hafeez Husain Cholakkal, Stefano Arrigoni, Francesco Braghin

Main category: cs.CV

TL;DR: RLCNet: End-to-end deep learning framework for simultaneous online calibration of LiDAR, RADAR, and camera sensors in autonomous vehicles, featuring real-time operation with weighted moving average and outlier rejection.

DetailsMotivation: Extrinsic calibration of multimodal sensors (LiDAR, RADAR, camera) is essential for reliable perception in autonomous vehicles but remains challenging due to mechanical vibrations and cumulative sensor drift in dynamic environments.

Method: RLCNet - a novel end-to-end trainable deep learning framework for simultaneous online calibration of multimodal sensors. Incorporates online calibration framework with weighted moving average and outlier rejection for real-time operation, enabling dynamic adjustment of calibration parameters.

Result: Validated on real-world datasets, demonstrates robust performance under diverse conditions. Ablation study highlights significance of architectural choices. Comparisons with existing methods show superior accuracy and robustness.

Conclusion: RLCNet is designed for practical deployment and enables dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift, making it suitable for real-time autonomous vehicle applications.

Abstract: Accurate extrinsic calibration of LiDAR, RADAR, and camera sensors is essential for reliable perception in autonomous vehicles. Still, it remains challenging due to factors such as mechanical vibrations and cumulative sensor drift in dynamic environments. This paper presents RLCNet, a novel end-to-end trainable deep learning framework for the simultaneous online calibration of these multimodal sensors. Validated on real-world datasets, RLCNet is designed for practical deployment and demonstrates robust performance under diverse conditions. To support real-time operation, an online calibration framework is introduced that incorporates a weighted moving average and outlier rejection, enabling dynamic adjustment of calibration parameters with reduced prediction noise and improved resilience to drift. An ablation study highlights the significance of architectural choices, while comparisons with existing methods demonstrate the superior accuracy and robustness of the proposed approach.

[81] EgoX: Egocentric Video Generation from a Single Exocentric Video

Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo

Main category: cs.CV

TL;DR: EgoX: A framework for converting third-person (exocentric) videos to first-person (egocentric) videos using adapted video diffusion models with geometry-guided attention for coherent view synthesis.

DetailsMotivation: Egocentric perception provides immersive understanding from first-person perspective, but converting exocentric videos to egocentric format is challenging due to extreme camera pose variations and minimal view overlap, requiring faithful content preservation and geometrically consistent synthesis of unseen regions.

Method: EgoX leverages pretrained video diffusion models with lightweight LoRA adaptation, uses unified conditioning combining exocentric and egocentric priors via width and channel concatenation, and employs geometry-guided self-attention to selectively attend to spatially relevant regions for geometric coherence.

Result: The approach achieves coherent and realistic egocentric video generation with strong scalability and robustness across unseen and in-the-wild videos.

Conclusion: EgoX successfully addresses the challenging task of exocentric-to-egocentric video translation by combining adapted diffusion models with geometric priors, enabling immersive first-person video generation from third-person inputs.

Abstract: Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

[82] OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang

Main category: cs.CV

TL;DR: OpenSubject introduces a large-scale video-derived dataset (2.5M samples, 4.35M images) for subject-driven image generation and manipulation, addressing identity deviation and multi-subject scene challenges through a four-stage pipeline that leverages cross-frame identity priors.

DetailsMotivation: Current subject-driven image generation models often deviate from reference identities and struggle with complex scenes containing multiple subjects. There's a need for better training data to improve identity fidelity and performance in challenging scenarios.

Method: Four-stage pipeline: (1) Video curation with resolution/aesthetic filtering, (2) Cross-frame subject mining/pairing using VLM-based category consensus and diversity-aware pairing, (3) Identity-preserving reference image synthesis via segmentation map-guided outpainting and box-guided inpainting, (4) Verification/captioning with VLM validation and caption construction.

Result: Created OpenSubject dataset with 2.5M samples and 4.35M images. Training with OpenSubject improves generation and manipulation performance, particularly in complex scenes, as demonstrated through extensive experiments and a comprehensive benchmark evaluation.

Conclusion: OpenSubject provides a high-quality, large-scale dataset that effectively addresses identity deviation and multi-subject challenges in subject-driven generation and manipulation, enabling better model performance through cross-frame identity priors and systematic data curation.

Abstract: Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.

[83] Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation

Alexander Goslin

Main category: cs.CV

TL;DR: Terrain Diffusion: AI-based successor to Perlin noise that combines diffusion model fidelity with infinite procedural generation capabilities for seamless planet-scale terrain synthesis.

DetailsMotivation: Traditional procedural noise functions like Perlin noise have been limited in realism and large-scale coherence despite being fast and infinite. The paper aims to bridge the gap between high-fidelity diffusion models and the essential properties needed for procedural world generation.

Method: Introduces Terrain Diffusion with InfiniteDiffusion algorithm for infinite generation, hierarchical diffusion models for planetary context and local detail, compact Laplacian encoding for Earth-scale stability, open-source infinite-tensor framework for constant-memory manipulation, and few-step consistency distillation for efficiency.

Result: Establishes diffusion models as practical foundation for procedural world generation capable of synthesizing entire planets coherently, controllably, and without limits while maintaining seamless infinite extent, seed-consistency, and constant-time random access.

Conclusion: Terrain Diffusion successfully bridges the fidelity of modern diffusion models with the essential properties of procedural noise, creating an AI-era successor that enables realistic, coherent, and unbounded planet-scale terrain generation.

Abstract: For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, an AI-era successor to Perlin noise that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation, enabling seamless, real-time synthesis of boundless landscapes. A hierarchical stack of diffusion models couples planetary context with local detail, while a compact Laplacian encoding stabilizes outputs across Earth-scale dynamic ranges. An open-source infinite-tensor framework supports constant-memory manipulation of unbounded tensors, and few-step consistency distillation enables efficient generation. Together, these components establish diffusion models as a practical foundation for procedural world generation, capable of synthesizing entire planets coherently, controllably, and without limits.

[84] GeoDM: Geometry-aware Distribution Matching for Dataset Distillation

Xuhui Li, Zhengquan Luo, Zihui Cui, Zhiqiang Xu

Main category: cs.CV

TL;DR: GeoDM: A geometry-aware distribution-matching framework for dataset distillation that operates in product spaces of Euclidean, hyperbolic, and spherical manifolds to capture diverse data structures.

DetailsMotivation: Existing dataset distillation methods are confined to Euclidean spaces, only capturing linear structures and overlooking intrinsic data geometry like curvature. High-dimensional data often lie on low-dimensional manifolds, so distilled data should align with original data manifold geometry.

Method: Proposes GeoDM framework operating in Cartesian product of Euclidean, hyperbolic, and spherical manifolds. Introduces learnable curvature and weight parameters for different geometries, and designs optimal transport loss to enhance distribution fidelity.

Result: Theoretical analysis shows geometry-aware distribution matching yields smaller generalization error bound than Euclidean counterparts. Extensive experiments demonstrate outperforming state-of-the-art data distillation methods across various distribution-matching strategies.

Conclusion: GeoDM effectively captures diverse data structures (flat, hierarchical, cyclical) through unified representation in product manifolds, improving dataset distillation performance by aligning with intrinsic data geometry.

Abstract: Dataset distillation aims to synthesize a compact subset of the original data, enabling models trained on it to achieve performance comparable to those trained on the original large dataset. Existing distribution-matching methods are confined to Euclidean spaces, making them only capture linear structures and overlook the intrinsic geometry of real data, e.g., curvature. However, high-dimensional data often lie on low-dimensional manifolds, suggesting that dataset distillation should have the distilled data manifold aligned with the original data manifold. In this work, we propose a geometry-aware distribution-matching framework, called \textbf{GeoDM}, which operates in the Cartesian product of Euclidean, hyperbolic, and spherical manifolds, with flat, hierarchical, and cyclical structures all captured by a unified representation. To adapt to the underlying data geometry, we introduce learnable curvature and weight parameters for three kinds of geometries. At the same time, we design an optimal transport loss to enhance the distribution fidelity. Our theoretical analysis shows that the geometry-aware distribution matching in a product space yields a smaller generalization error bound than the Euclidean counterparts. Extensive experiments conducted on standard benchmarks demonstrate that our algorithm outperforms state-of-the-art data distillation methods and remains effective across various distribution-matching strategies for the single geometries.

[85] Detecting Dental Landmarks from Intraoral 3D Scans: the 3DTeethLand challenge

Achraf Ben-Hamadou, Nour Neifar, Ahmed Rekik, Oussama Smaoui, Firas Bouzguenda, Sergi Pujades, Niels van Nistelrooij, Shankeeth Vinayahalingam, Kaibo Shi, Hairong Jin, Youyi Zheng, Tibor Kubík, Oldřich Kodym, Petr Šilling, Kateřina Trávníčková, Tomáš Mojžiš, Jan Matula, Jeffry Hartanto, Xiaoying Zhu, Kim-Ngan Nguyen, Tudor Dascalu, Huikai Wu, and Weijie Liu, Shaojie Zhuang, Guangshun Wei, Yuanfeng Zhou

Main category: cs.CV

TL;DR: The paper introduces the 3DTeethLand challenge and dataset for 3D teeth landmark detection from intraoral scans, addressing clinical orthodontic needs through deep learning approaches.

DetailsMotivation: Precise teeth landmark detection is crucial for advanced orthodontic diagnostics, personalized treatment planning, and treatment monitoring, but faces challenges due to complex tooth geometry and individual variations.

Method: The paper presents the 3DTeethLand challenge organized with MICCAI 2024, which introduced the first publicly available dataset for 3D teeth landmark detection and called for algorithm submissions to address this problem.

Result: The challenge created a valuable public resource for assessing state-of-the-art methods in 3D teeth landmark detection and encouraged community contributions to solve this clinically significant problem.

Conclusion: The 3DTeethLand challenge and dataset provide an important foundation for advancing deep learning techniques in 3D teeth landmark detection, with significant implications for clinical orthodontics and dental care.

Abstract: Teeth landmark detection is a critical task in modern clinical orthodontics. Their precise identification enables advanced diagnostics, facilitates personalized treatment strategies, and supports more effective monitoring of treatment progress in clinical dentistry. However, several significant challenges may arise due to the intricate geometry of individual teeth and the substantial variations observed across different individuals. To address these complexities, the development of advanced techniques, especially through the application of deep learning, is essential for the precise and reliable detection of 3D tooth landmarks. In this context, the 3DTeethLand challenge was held in collaboration with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2024, calling for algorithms focused on teeth landmark detection from intraoral 3D scans. This challenge introduced the first publicly available dataset for 3D teeth landmark detection, offering a valuable resource to assess the state-of-the-art methods in this task and encourage the community to provide methodological contributions towards the resolution of their problem with significant clinical implications.

[86] GeoDiffMM: Geometry-Guided Conditional Diffusion for Motion Magnification

Xuedeng Liu, Jiabao Guo, Zheng Zhang, Fei Wang, Zhi Liu, Dan Guo

Main category: cs.CV

TL;DR: GeoDiffMM: A diffusion-based Lagrangian video motion magnification framework that uses optical flow as geometric conditioning to amplify subtle motions while suppressing noise.

DetailsMotivation: Existing Eulerian VMM approaches struggle to separate photon noise from true micro-motion when displacements are very small, even with decoupling representation learning techniques.

Method: Three-stage approach: 1) Noise-free Optical Flow Augmentation synthesizes diverse nonrigid motion fields without photon noise for supervision; 2) Diffusion Motion Magnifier conditions denoising on optical flow geometry prior and learnable magnification factor; 3) Flow-based Video Synthesis maps amplified motion back to image domain.

Result: Extensive experiments on real and synthetic datasets show GeoDiffMM outperforms state-of-the-art methods and significantly improves motion magnification quality.

Conclusion: GeoDiffMM successfully addresses the photon noise problem in VMM by leveraging diffusion models with geometric conditioning, enabling structurally consistent motion magnification while suppressing content-irrelevant perturbations.

Abstract: Video Motion Magnification (VMM) amplifies subtle macroscopic motions to a perceptible level. Recently, existing mainstream Eulerian approaches address amplification-induced noise via decoupling representation learning such as texture, shape and frequancey schemes, but they still struggle to separate photon noise from true micro-motion when motion displacements are very small. We propose GeoDiffMM, a novel diffusion-based Lagrangian VMM framework conditioned on optical flow as a geometric cue, enabling structurally consistent motion magnification. Specifically, we design a Noise-free Optical Flow Augmentation strategy that synthesizes diverse nonrigid motion fields without photon noise as supervision, helping the model learn more accurate geometry-aware optial flow and generalize better. Next, we develop a Diffusion Motion Magnifier that conditions the denoising process on (i) optical flow as a geometry prior and (ii) a learnable magnification factor controlling magnitude, thereby selectively amplifying motion components consistent with scene semantics and structure while suppressing content-irrelevant perturbations. Finally, we perform Flow-based Video Synthesis to map the amplified motion back to the image domain with high fidelity. Extensive experiments on real and synthetic datasets show that GeoDiffMM outperforms state-of-the-art methods and significantly improves motion magnification.

[87] Low Rank Support Quaternion Matrix Machine

Wang Chen, Ziyan Luo, Shuangyue Wang

Main category: cs.CV

TL;DR: LSQMM is a novel quaternion-based classification method for color images that treats RGB channels as pure quaternions to preserve channel relationships, uses quaternion nuclear norm for low-rank regularization, and employs ADMM optimization.

DetailsMotivation: Conventional methods represent color images as vectors/matrices/tensors in real field, losing intrinsic coupling relationships among RGB channels. Quaternion algebra can better preserve these relationships, as shown in image recovery tasks, but hasn't been well explored for classification.

Method: Treat RGB channels as pure quaternions to preserve channel coupling. Add quaternion nuclear norm regularization (extension of matrix nuclear norm) to hinge loss for low-rank structure promotion. Design ADMM-based iterative algorithm to solve the quaternion optimization model.

Result: Experimental results on multiple color image classification datasets show LSQMM outperforms state-of-the-art methods (SVM, SMM, STM) in classification accuracy, robustness, and computational efficiency.

Conclusion: LSQMM effectively leverages quaternion algebra for color image classification, preserving channel relationships and promoting low-rank structures, leading to superior performance compared to conventional real-field methods.

Abstract: Input features are conventionally represented as vectors, matrices, or third order tensors in the real field, for color image classification. Inspired by the success of quaternion data modeling for color images in image recovery and denoising tasks, we propose a novel classification method for color image classification, named as the Low-rank Support Quaternion Matrix Machine (LSQMM), in which the RGB channels are treated as pure quaternions to effectively preserve the intrinsic coupling relationships among channels via the quaternion algebra. For the purpose of promoting low-rank structures resulting from strongly correlated color channels, a quaternion nuclear norm regularization term, serving as a natural extension of the conventional matrix nuclear norm to the quaternion domain, is added to the hinge loss in our LSQMM model. An Alternating Direction Method of Multipliers (ADMM)-based iterative algorithm is designed to effectively resolve the proposed quaternion optimization model. Experimental results on multiple color image classification datasets demonstrate that our proposed classification approach exhibits advantages in classification accuracy, robustness and computational efficiency, compared to several state-of-the-art methods using support vector machines, support matrix machines, and support tensor machines.

[88] MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

Yufei Gao, Jiaying Fei, Nuo Chen, Ruirui Chen, Guohang Yan, Yunshi Lan, Botian Shi

Main category: cs.CV

TL;DR: MELLA is a multimodal multilingual dataset designed to enhance MLLMs for low-resource languages by improving both linguistic capabilities and cultural groundedness, moving beyond basic machine translation approaches.

DetailsMotivation: Current multilingual enhancement methods for MLLMs are limited to text modality or rely solely on machine translation, producing "thin descriptions" that lack multimodal informativeness and cultural groundedness crucial for effectively serving low-resource language users.

Method: Proposes a dual-source strategy: 1) native web alt-text for cultural groundedness, and 2) MLLM-generated captions for linguistic capabilities. Implements this as MELLA, a multimodal multilingual dataset for fine-tuning MLLMs.

Result: Fine-tuning on MELLA shows general performance improvement across eight languages on various MLLM backbones, with models producing “thick descriptions” that demonstrate both cultural knowledge enhancement and linguistic capability enhancement.

Conclusion: The MELLA dataset effectively bridges the gap for low-resource language MLLMs by addressing both linguistic capability and cultural groundedness, moving beyond basic translation approaches to provide culturally aware multimodal understanding.

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce “thin descriptions”, they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing “thick descriptions”. We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.

[89] Interpreting Structured Perturbations in Image Protection Methods for Diffusion Models

Michael R. Martin, Garrick Chan, Kwan-Liu Ma

Main category: cs.CV

TL;DR: This paper provides a systematic explainable AI analysis of image protection mechanisms like Glaze and Nightshade, revealing they operate as structured, low-entropy perturbations tightly coupled to image content rather than inducing global representational drift.

DetailsMotivation: While Glaze and Nightshade show empirical effectiveness in disrupting text-to-image models, their internal structure, detectability, and representational behavior remain poorly understood, creating a need for systematic analysis.

Method: Uses a unified framework integrating white-box feature-space inspection and black-box signal-level probing, including latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization.

Result: Protection mechanisms operate as structured, low-entropy perturbations tightly coupled to image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure. Detectability is governed by perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure. Frequency analysis shows energy redistribution along dominant image-aligned frequency axes rather than diffuse noise.

Conclusion: Contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This advances interpretability of adversarial image protection and informs future defense and detection strategies for generative AI systems.

Abstract: Recent image protection mechanisms such as Glaze and Nightshade introduce imperceptible, adversarially designed perturbations intended to disrupt downstream text-to-image generative models. While their empirical effectiveness is known, the internal structure, detectability, and representational behavior of these perturbations remain poorly understood. This study provides a systematic, explainable AI analysis using a unified framework that integrates white-box feature-space inspection and black-box signal-level probing. Through latent-space clustering, feature-channel activation analysis, occlusion-based spatial sensitivity mapping, and frequency-domain characterization, we show that protection mechanisms operate as structured, low-entropy perturbations tightly coupled to underlying image content across representational, spatial, and spectral domains. Protected images preserve content-driven feature organization with protection-specific substructure rather than inducing global representational drift. Detectability is governed by interacting effects of perturbation entropy, spatial deployment, and frequency alignment, with sequential protection amplifying detectable structure rather than suppressing it. Frequency-domain analysis shows that Glaze and Nightshade redistribute energy along dominant image-aligned frequency axes rather than introducing diffuse noise. These findings indicate that contemporary image protection operates through structured feature-level deformation rather than semantic dislocation, explaining why protection signals remain visually subtle yet consistently detectable. This work advances the interpretability of adversarial image protection and informs the design of future defenses and detection strategies for generative AI systems.

[90] PointDico: Contrastive 3D Representation Learning Guided by Diffusion Models

Pengbo Li, Yiding Sun, Haozhe Cheng

Main category: cs.CV

TL;DR: PointDico is a novel 3D representation learning model that integrates diffusion and contrastive learning through knowledge distillation to overcome limitations of existing methods in handling unordered point clouds.

DetailsMotivation: Existing self-supervised methods struggle with 3D data due to its unordered and uneven density. Contrastive models tend to overfit, while 3D Mask Autoencoders have difficulty handling unordered point clouds. The authors aim to combine the strengths of diffusion and contrastive models for better 3D representation learning.

Method: PointDico integrates denoising generative modeling (diffusion) and cross-modal contrastive learning through knowledge distillation, where the diffusion model guides the contrastive model. It uses a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and a dual-channel design to integrate local and global contextual information.

Result: PointDico achieves state-of-the-art results: 94.32% accuracy on ScanObjectNN and 86.5% instance mIoU on ShapeNetPart.

Conclusion: The proposed integration of diffusion and contrastive learning through knowledge distillation effectively addresses the challenges of 3D representation learning, achieving superior performance on benchmark datasets.

Abstract: Self-supervised representation learning has shown significant improvement in Natural Language Processing and 2D Computer Vision. However, existing methods face difficulties in representing 3D data because of its unordered and uneven density. Through an in-depth analysis of mainstream contrastive and generative approaches, we find that contrastive models tend to suffer from overfitting, while 3D Mask Autoencoders struggle to handle unordered point clouds. This motivates us to learn 3D representations by sharing the merits of diffusion and contrast models, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose \textit{PointDico}, a novel model that seamlessly integrates these methods. \textit{PointDico} learns from both denoising generative modeling and cross-modal contrastive learning through knowledge distillation, where the diffusion model serves as a guide for the contrastive model. We introduce a hierarchical pyramid conditional generator for multi-scale geometric feature extraction and employ a dual-channel design to effectively integrate local and global contextual information. \textit{PointDico} achieves a new state-of-the-art in 3D representation learning, \textit{e.g.}, \textbf{94.32%} accuracy on ScanObjectNN, \textbf{86.5%} Inst. mIoU on ShapeNetPart.

[91] Bi^2MAC: Bimodal Bi-Adaptive Mask-Aware Convolution for Remote Sensing Pansharpening

Xianghong Xiao, Zeyu Xia, Zhou Fei, Jinliang Xiao, Haorui Chen, Liangjian Deng

Main category: cs.CV

TL;DR: Bi²MAC: A bimodal bi-adaptive mask-aware convolution method for pansharpening that uses soft/hard masks to route features to different branches - compact branch for redundant features and focused branch for heterogeneous features, achieving SOTA performance with lower computational cost.

DetailsMotivation: Conventional deep learning pansharpening methods struggle with regional heterogeneity in feature representations. Existing adaptive convolution methods have high computational costs and limited ability to effectively capture heterogeneous regions in remote sensing images.

Method: Proposes Bimodal Bi-Adaptive Mask-Aware Convolution (Bi²MAC) with a lightweight module generating both soft and hard masks. Soft masks modulate input features preliminarily, while hard masks guide different region types into separate branches: redundant features go to a compact branch for low-cost global processing, and heterogeneous features go to a focused branch with more computational resources for fine-grained modeling.

Result: Extensive experiments on multiple benchmark datasets show Bi²MAC achieves state-of-the-art performance while requiring substantially lower training time and parameter counts, with minimal computational cost among adaptive convolution models.

Conclusion: Bi²MAC effectively exploits information from different region types while intelligently allocating computational resources, overcoming limitations of conventional adaptive convolution methods for pansharpening tasks.

Abstract: Pansharpening aims to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to generate a high-resolution multispectral image (HRMS). Conventional deep learning-based methods are inherently limited in their ability to adapt to regional heterogeneity within feature representations. Although various adaptive convolution methods have been proposed to address this limitation, they often suffer from excessive computational costs and a limited ability to capture heterogeneous regions in remote sensing images effectively. To overcome these challenges, we propose Bimodal Bi-Adaptive Mask-Aware Convolution (Bi^2MAC), which effectively exploits information from different types of regions while intelligently allocating computational resources. Specifically, we design a lightweight module to generate both soft and hard masks, which are used to modulate the input features preliminarily and to guide different types of regions into separate processing branches, respectively. Redundant features are directed to a compact branch for low-cost global processing. In contrast, heterogeneous features are routed to a focused branch that invests more computational resources for fine-grained modeling. Extensive experiments on multiple benchmark datasets demonstrate that Bi^2MAC achieves state-of-the-art (SOTA) performance while requiring substantially lower training time and parameter counts, and the minimal computational cost among adaptive convolution models.

[92] HybridSplat: Fast Reflection-baked Gaussian Tracing using Hybrid Splatting

Chang Liu, Hongliang Yuan, Lianghao Zhang, Sichao Wang, Jianwei Guo, Shi-Sheng Huang

Main category: cs.CV

TL;DR: HybridSplat: A hybrid splatting framework that accelerates rendering of complex reflective scenes using reflection-baked Gaussian tracing and achieves 7x speedup with 4x fewer primitives.

DetailsMotivation: 3D Gaussian splatting shows promise for photorealistic novel view synthesis but faces bottlenecks in rendering speed and memory storage, especially for complex reflective scenes.

Method: Proposes HybridSplat with reflection-baked Gaussian tracing that bakes view-dependent reflection within each Gaussian primitive, integrates reflective and base primitives in unified framework, and uses pipeline-level acceleration with reflection-sensitive Gaussian pruning.

Result: Achieves ~7x faster rendering speed across complex reflective scenes from Ref-NeRF and NeRF-Casting with 4x fewer Gaussian primitives compared to ray-tracing based baselines.

Conclusion: HybridSplat serves as a new state-of-the-art method for complex reflective scenes, offering faster rendering, lower memory usage while preserving reflection quality.

Abstract: Rendering complex reflection of real-world scenes using 3D Gaussian splatting has been a quite promising solution for photorealistic novel view synthesis, but still faces bottlenecks especially in rendering speed and memory storage. This paper proposes a new Hybrid Splatting(HybridSplat) mechanism for Gaussian primitives. Our key idea is a new reflection-baked Gaussian tracing, which bakes the view-dependent reflection within each Gaussian primitive while rendering the reflection using tile-based Gaussian splatting. Then we integrate the reflective Gaussian primitives with base Gaussian primitives using a unified hybrid splatting framework for high-fidelity scene reconstruction. Moreover, we further introduce a pipeline-level acceleration for the hybrid splatting, and reflection-sensitive Gaussian pruning to reduce the model size, thus achieving much faster rendering speed and lower memory storage while preserving the reflection rendering quality. By extensive evaluation, our HybridSplat accelerates about 7x rendering speed across complex reflective scenes from Ref-NeRF, NeRF-Casting with 4x fewer Gaussian primitives than similar ray-tracing based Gaussian splatting baselines, serving as a new state-of-the-art method especially for complex reflective scenes.

[93] DINO-BOLDNet: A DINOv3-Guided Multi-Slice Attention Network for T1-to-BOLD Generation

Jianwei Wang, Qing Wang, Menglan Ruan, Rongjun Ge, Chunfeng Yang, Yang Chen, Chunming Xie

Main category: cs.CV

TL;DR: DINO-BOLDNet generates BOLD fMRI images from T1w structural MRI using DINOv3-guided multi-slice attention framework.

DetailsMotivation: To recover missing BOLD information when BOLD images are corrupted or unavailable, enabling downstream tasks that require functional imaging data.

Method: Uses frozen self-supervised DINOv3 encoder for structural representations, slice-attention module for cross-slice context fusion, multi-scale decoder for functional contrast restoration, and DINO-based perceptual loss for structural/textural consistency.

Result: Outperforms conditional GAN baseline in both PSNR and MS-SSIM metrics on clinical dataset of 248 subjects. First framework capable of generating mean BOLD images directly from T1w images.

Conclusion: Demonstrates the potential of self-supervised transformer guidance for structural-to-functional mapping in neuroimaging.

Abstract: Generating BOLD images from T1w images offers a promising solution for recovering missing BOLD information and enabling downstream tasks when BOLD images are corrupted or unavailable. Motivated by this, we propose DINO-BOLDNet, a DINOv3-guided multi-slice attention framework that integrates a frozen self-supervised DINOv3 encoder with a lightweight trainable decoder. The model uses DINOv3 to extract within-slice structural representations, and a separate slice-attention module to fuse contextual information across neighboring slices. A multi-scale generation decoder then restores fine-grained functional contrast, while a DINO-based perceptual loss encourages structural and textural consistency between predictions and ground-truth BOLD in the transformer feature space. Experiments on a clinical dataset of 248 subjects show that DINO-BOLDNet surpasses a conditional GAN baseline in both PSNR and MS-SSIM. To our knowledge, this is the first framework capable of generating mean BOLD images directly from T1w images, highlighting the potential of self-supervised transformer guidance for structural-to-functional mapping.

[94] TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, Yuan Liu

Main category: cs.CV

TL;DR: TrackingWorld: A novel pipeline for dense 3D tracking of almost all pixels in world-centric coordinates from monocular video, addressing camera motion separation and tracking newly emerging objects.

DetailsMotivation: Existing monocular 3D tracking methods fail to properly separate camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in videos.

Method: Three-stage pipeline: 1) Tracking upsampler lifts sparse 2D tracks to dense 2D tracks; 2) Generalizes to newly emerging objects by applying upsampler to all frames and eliminating redundant tracks in overlapped regions; 3) Optimization-based framework back-projects dense 2D tracks into world-centric 3D trajectories by estimating camera poses and 3D coordinates.

Result: Extensive evaluations on both synthetic and real-world datasets demonstrate accurate and dense 3D tracking in a world-centric coordinate frame.

Conclusion: TrackingWorld successfully addresses limitations of existing methods by enabling dense 3D tracking of almost all pixels within a world-centric coordinate system, properly separating camera motion and handling newly emerging dynamic subjects.

Abstract: Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.

[95] SCU-CGAN: Enhancing Fire Detection through Synthetic Fire Image Generation and Dataset Augmentation

Ju-Young Kim, Ji-Hong Park, Gun-Woo Kim

Main category: cs.CV

TL;DR: SCU-CGAN model generates realistic fire images from nonfire images to augment limited fire datasets, improving fire detection model accuracy by 56.5% in mAP@0.5:0.95 for YOLOv5 nano.

DetailsMotivation: Fire detection is crucial for safety but limited by insufficient fire datasets. Existing models suffer from poor performance due to data scarcity, especially in household IoT fire detection systems.

Method: Proposed SCU-CGAN model integrates U-Net, CBAM (Convolutional Block Attention Module), and an additional discriminator to generate realistic fire images from nonfire images, creating augmented datasets for training.

Result: SCU-CGAN achieved 41.5% improvement in KID score over CycleGAN, demonstrating superior image quality. Augmented dataset improved YOLOv5 nano’s mAP@0.5:0.95 by 56.5% without model structure changes.

Conclusion: SCU-CGAN effectively addresses fire dataset scarcity by generating high-quality synthetic fire images, significantly boosting fire detection model performance through data augmentation.

Abstract: Fire has long been linked to human life, causing severe disasters and losses. Early detection is crucial, and with the rise of home IoT technologies, household fire detection systems have emerged. However, the lack of sufficient fire datasets limits the performance of detection models. We propose the SCU-CGAN model, which integrates U-Net, CBAM, and an additional discriminator to generate realistic fire images from nonfire images. We evaluate the image quality and confirm that SCU-CGAN outperforms existing models. Specifically, SCU-CGAN achieved a 41.5% improvement in KID score compared to CycleGAN, demonstrating the superior quality of the generated fire images. Furthermore, experiments demonstrate that the augmented dataset significantly improves the accuracy of fire detection models without altering their structure. For the YOLOv5 nano model, the most notable improvement was observed in the mAP@0.5:0.95 metric, which increased by 56.5%, highlighting the effectiveness of the proposed approach.

[96] The Unseen Bias: How Norm Discrepancy in Pre-Norm MLLMs Leads to Visual Information Loss

Bozhou Li, Xinda Xue, Sihan Yang, Yang Shi, Xinlong Chen, Yushuo Guan, Yuanxing Zhang, Wentao Zhang

Main category: cs.CV

TL;DR: The paper identifies a norm disparity problem in MLLMs where visual tokens have much higher norms than text tokens, causing asymmetric update dynamics that impair cross-modal fusion. The authors propose adding a single LayerNorm after the visual projector to align norms, which significantly improves performance on both multimodal and text-only benchmarks.

DetailsMotivation: Current Multimodal Large Language Models (MLLMs) using Pre-Norm architecture suffer from a critical flaw: severe norm disparity between high-norm visual tokens and low-norm text tokens. This imbalance creates asymmetric update dynamics where visual tokens transform much slower than text tokens, fundamentally impairing effective cross-modal feature fusion.

Method: The authors first provide formal theoretical analysis showing the norm disparity induces “representational inertia” in visual tokens. They empirically validate this phenomenon across mainstream MLLMs. As solution, they propose inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment between visual and text tokens.

Result: Experiments on LLaVA-1.5 architecture show significant performance gains on a wide suite of multimodal benchmarks. Notably, improvements also extend to text-only evaluations like MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.

Conclusion: The norm disparity in MLLMs is a fundamental architectural issue that causes asymmetric update dynamics and impairs cross-modal fusion. A simple LayerNorm intervention effectively aligns token norms, leading to substantial performance improvements across both multimodal and unimodal tasks, demonstrating that architectural balance is crucial for holistic model capability.

Abstract: Multimodal Large Language Models (MLLMs), which couple pre-trained vision encoders and language models, have shown remarkable capabilities. However, their reliance on the ubiquitous Pre-Norm architecture introduces a subtle yet critical flaw: a severe norm disparity between the high-norm visual tokens and the low-norm text tokens. In this work, we present a formal theoretical analysis demonstrating that this imbalance is not a static issue. Instead, it induces an asymmetric update dynamic,'' where high-norm visual tokens exhibit a representational inertia,’’ causing them to transform semantically much slower than their textual counterparts. This fundamentally impairs effective cross-modal feature fusion. Our empirical validation across a range of mainstream MLLMs confirms that this theoretical dynamic – the persistence of norm disparity and the resulting asymmetric update rates – is a prevalent phenomenon. Based on this insight, we propose a remarkably simple yet effective solution: inserting a single, carefully initialized LayerNorm layer after the visual projector to enforce norm alignment. Experiments conducted on the LLaVA-1.5 architecture show that this intervention yields significant performance gains not only on a wide suite of multimodal benchmarks but also, notably, on text-only evaluations such as MMLU, suggesting that resolving the architectural imbalance leads to a more holistically capable model.

[97] Simultaneous Enhancement and Noise Suppression under Complex Illumination Conditions

Jing Tao, You Li, Banglei Guan, Yang Shang, Qifeng Yu

Main category: cs.CV

TL;DR: A novel framework for simultaneous image enhancement and noise suppression under complex lighting conditions using gradient-domain weighted guided filtering, Retinex decomposition, and multi-exposure fusion.

DetailsMotivation: Existing image enhancement methods either amplify noise excessively or only work well under specific illumination conditions, limiting their effectiveness in challenging lighting scenarios.

Method: Three-stage approach: 1) GDWGIF for illumination estimation and initial quality improvement, 2) Retinex decomposition into illumination/reflection layers with parallel processing, 3) Multi-exposure fusion and linear stretching for dynamic range optimization.

Result: Experimental evaluation on real-world datasets shows superior performance compared to state-of-the-art methods in both contrast enhancement and noise suppression.

Conclusion: The proposed framework effectively addresses the limitations of existing methods by simultaneously enhancing image quality while suppressing noise under complex illumination conditions.

Abstract: Under challenging light conditions, captured images often suffer from various degradations, leading to a decline in the performance of vision-based applications. Although numerous methods have been proposed to enhance image quality, they either significantly amplify inherent noise or are only effective under specific illumination conditions. To address these issues, we propose a novel framework for simultaneous enhancement and noise suppression under complex illumination conditions. Firstly, a gradient-domain weighted guided filter (GDWGIF) is employed to accurately estimate illumination and improve image quality. Next, the Retinex model is applied to decompose the captured image into separate illumination and reflection layers. These layers undergo parallel processing, with the illumination layer being corrected to optimize lighting conditions and the reflection layer enhanced to improve image quality. Finally, the dynamic range of the image is optimized through multi-exposure fusion and a linear stretching strategy. The proposed method is evaluated on real-world datasets obtained from practical applications. Experimental results demonstrate that our proposed method achieves better performance compared to state-of-the-art methods in both contrast enhancement and noise suppression.

[98] Detection of Digital Facial Retouching utilizing Face Beauty Information

Philipp Srock, Juan E. Tapia, Christoph Busch

Main category: cs.CV

TL;DR: This paper addresses the problem of facial retouching detection in biometric systems, proposing to use beauty assessment algorithms and AI-based feature extraction to improve detection rates when the retouching algorithm is unknown.

DetailsMotivation: Facial retouching is widely used in social media and professional photography to enhance beauty, but when retouched images are used as biometric samples in face recognition systems, it becomes a security problem. Since retouching challenges face recognition systems, detecting facial retouching is increasingly necessary.

Method: The paper studies changes in beauty assessment algorithms of retouched images, assesses different AI-based feature extraction methods for retouching detection, and evaluates whether face beauty assessment can enhance detection rates. The approach works in scenarios where the attacking retouching algorithm is unknown.

Result: The method achieved 1.1% D-EER (Detection Equal Error Rate) on single image detection, demonstrating effective performance in detecting retouched facial images.

Conclusion: The research shows that analyzing beauty assessment algorithms and using AI-based feature extraction can effectively detect facial retouching in biometric systems, achieving low error rates even when the specific retouching algorithm is unknown.

Abstract: Facial retouching to beautify images is widely spread in social media, advertisements, and it is even applied in professional photo studios to let individuals appear younger, remove wrinkles and skin impurities. Generally speaking, this is done to enhance beauty. This is not a problem itself, but when retouched images are used as biometric samples and enrolled in a biometric system, it is one. Since previous work has proven facial retouching to be a challenge for face recognition systems,the detection of facial retouching becomes increasingly necessary. This work proposes to study and analyze changes in beauty assessment algorithms of retouched images, assesses different feature extraction methods based on artificial intelligence in order to improve retouching detection, and evaluates whether face beauty can be exploited to enhance the detection rate. In a scenario where the attacking retouching algorithm is unknown, this work achieved 1.1% D-EER on single image detection.

[99] Towards Visual Re-Identification of Fish using Fine-Grained Classification for Electronic Monitoring in Fisheries

Samitha Nuwan Thilakarathna, Ercan Avsar, Martin Mathias Nielsen, Malte Pedersen

Main category: cs.CV

TL;DR: The paper develops an optimized deep learning pipeline for automated fish re-identification using the AutoFish dataset, showing that Vision Transformer-based Swin-T with hard triplet mining and custom image transformations outperforms CNN-based ResNet-50.

DetailsMotivation: Electronic Monitoring systems generate more video data than can be manually reviewed, creating a need for automated fish re-identification to support sustainable marine resource management.

Method: Developed an optimized deep learning pipeline using the AutoFish dataset (simulating EM systems with conveyor belts). Employed hard triplet mining with custom image transformation pipeline including dataset-specific normalization. Compared Vision Transformer-based Swin-T architecture against CNN-based ResNet-50.

Result: Swin-T consistently outperformed ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. Analysis revealed the primary challenge is distinguishing visually similar individuals of the same species (intra-species errors), where viewpoint inconsistency is more detrimental than partial occlusion.

Conclusion: The proposed optimized deep learning pipeline with hard triplet mining and custom transformations effectively addresses fish re-identification challenges in EM systems, with Vision Transformers showing superior performance over traditional CNNs for this task.

Abstract: Accurate fisheries data are crucial for effective and sustainable marine resource management. With the recent adoption of Electronic Monitoring (EM) systems, more video data is now being collected than can be feasibly reviewed manually. This paper addresses this challenge by developing an optimized deep learning pipeline for automated fish re-identification (Re-ID) using the novel AutoFish dataset, which simulates EM systems with conveyor belts with six similarly looking fish species. We demonstrate that key Re-ID metrics (R1 and mAP@k) are substantially improved by using hard triplet mining in conjunction with a custom image transformation pipeline that includes dataset-specific normalization. By employing these strategies, we demonstrate that the Vision Transformer-based Swin-T architecture consistently outperforms the Convolutional Neural Network-based ResNet-50, achieving peak performance of 41.65% mAP@k and 90.43% Rank-1 accuracy. An in-depth analysis reveals that the primary challenge is distinguishing visually similar individuals of the same species (Intra-species errors), where viewpoint inconsistency proves significantly more detrimental than partial occlusion. The source code and documentation are available at: https://github.com/msamdk/Fish_Re_Identification.git

[100] SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos

Mingqi Gao, Yunqi Miao, Jungong Han

Main category: cs.CV

TL;DR: SAM-Body4D is a training-free framework that improves temporal consistency and occlusion robustness for 3D human mesh recovery from videos by using identity-consistent masklets and occlusion-aware refinement.

DetailsMotivation: Current image-based HMR methods like SAM 3D Body work well on single images but suffer from temporal inconsistency and degraded performance under occlusions when applied to videos due to per-frame inference.

Method: 1) Generate identity-consistent masklets using promptable video segmentation, 2) Refine masklets with Occlusion-Aware module to recover missing regions, 3) Use refined masklets to guide SAM 3D Body for consistent mesh trajectories, 4) Padding-based parallel strategy for efficient multi-human inference.

Result: Experimental results show improved temporal stability and robustness in challenging in-the-wild videos without any retraining.

Conclusion: SAM-Body4D provides a training-free solution for temporally consistent and occlusion-robust human mesh recovery from videos by leveraging human continuity and masklet refinement.

Abstract: Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.

[101] Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

Tao Chen, Shaobo Ju, Qiong Wu, Chenxin Fang, Kun Zhang, Jun Peng, Hui Li, Yiyi Zhou, Rongrong Ji

Main category: cs.CV

TL;DR: OneClip-RAG is a novel video retrieval-augmented generation method that enables MLLMs to process long videos efficiently by using one-shot video clips for retrieval augmentation with query-guided chunking.

DetailsMotivation: Current MLLMs have excessive memory overhead and can only process limited video frames, making them ineffective for long video understanding. There's a need for an efficient method to handle long videos while maintaining knowledge integrity and semantic coherence.

Method: Proposes OneClip-RAG with: 1) One-shot video clip based retrieval augmentation using video clips as retrieval units, 2) Query-guided video chunking algorithm that unifies clip chunking and cross-modal retrieval in one step, 3) SynLongVideo dataset for training, and 4) Progressive training regime for improved instruction following.

Result: Significant performance gains on long-video benchmarks, boosting InternLV2 8B and Qwen2-VL 7B to GPT-4o level on MLVU. Superior efficiency: enables LLaVA-Video to understand up to 1-hour videos in <2.2 minutes on single 4090 GPU.

Conclusion: OneClip-RAG effectively addresses MLLMs’ limitations in processing long videos, offering both performance improvements and computational efficiency through its novel retrieval augmentation paradigm and query-guided chunking approach.

Abstract: Due to excessive memory overhead, most Multimodal Large Language Models (MLLMs) can only process videos of limited frames. In this paper, we propose an effective and efficient paradigm to remedy this shortcoming, termed One-shot video-Clip based Retrieval AuGmentation (OneClip-RAG). Compared with existing video RAG methods, OneClip-RAG makes full use of the merits of video clips for augmented video understanding in terms of both knowledge integrity and semantic coherence. Besides, it is also equipped with a novel query-guided video chunking algorithm that can unify clip chunking and cross-modal retrieval in one processing step, avoiding redundant computations. To improve instruction following, we further propose a new dataset called SynLongVideo and design a progressive training regime for OneClip-RAG. OneClip-RAG is plugged into five recent MLLMs and validated on a set of long-video benchmarks. Experimental results not only show the obvious performance gains by OneClip-RAG over MLLMs, e.g., boosting InternLV2 8B and Qwen2-VL 7B to the level of GPT-4o on MLVU, but also show its superior efficiency in handling long videos. e.g., enabling LLaVA-Video understand up to an hour of videos in less than 2.2 minutes on a single 4090 GPU.

[102] SDT-6D: Fully Sparse Depth-Transformer for Staged End-to-End 6D Pose Estimation in Industrial Multi-View Bin Picking

Nico Leuze, Maximilian Hoh, Samed Doğan, Nicolas R. -Peña, Alfred Schoettl

Main category: cs.CV

TL;DR: A depth-only 6D pose estimation method for cluttered bin-picking using multi-view depth fusion with staged heatmaps and sparse transformers, enabling high-resolution volumetric processing for accurate pose recovery in occlusion-heavy industrial environments.

DetailsMotivation: 6D pose estimation in densely packed industrial bin-picking environments is challenging due to occlusions, reflections, and textureless parts. Existing methods struggle with these conditions, and sparse 3D approaches remain underexplored for close-range robotic applications despite their potential.

Method: A holistic depth-only approach that fuses multi-view depth maps into either fine-grained 3D point clouds or sparse Truncated Signed Distance Fields (TSDF). Uses staged heatmap mechanism for scene-adaptive attention priors across resolutions, density-aware sparse transformer blocks to handle occlusions and non-uniform 3D data distribution, and operates fully sparse for high-resolution volumetric representations. Employs per-voxel voting strategy for simultaneous pose predictions of arbitrary numbers of objects.

Result: Validated on IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios. The method effectively handles occlusions and complex industrial environments.

Conclusion: The proposed sparse, depth-only framework enables accurate 6D pose estimation in challenging industrial bin-picking environments by leveraging high-resolution volumetric representations and novel attention mechanisms, addressing key limitations of existing approaches in cluttered settings.

Abstract: Accurately recovering 6D poses in densely packed industrial bin-picking environments remain a serious challenge, owing to occlusions, reflections, and textureless parts. We introduce a holistic depth-only 6D pose estimation approach that fuses multi-view depth maps into either a fine-grained 3D point cloud in its vanilla version, or a sparse Truncated Signed Distance Field (TSDF). At the core of our framework lies a staged heatmap mechanism that yields scene-adaptive attention priors across different resolutions, steering computation toward foreground regions, thus keeping memory requirements at high resolutions feasible. Along, we propose a density-aware sparse transformer block that dynamically attends to (self-) occlusions and the non-uniform distribution of 3D data. While sparse 3D approaches has proven effective for long-range perception, its potential in close-range robotic applications remains underexplored. Our framework operates fully sparse, enabling high-resolution volumetric representations to capture fine geometric details crucial for accurate pose estimation in clutter. Our method processes the entire scene integrally, predicting the 6D pose via a novel per-voxel voting strategy, allowing simultaneous pose predictions for an arbitrary number of target objects. We validate our method on the recently published IPD and MV-YCB multi-view datasets, demonstrating competitive performance in heavily cluttered industrial and household bin picking scenarios.

[103] LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training

Qing Xu, Kun Yuan, Yuxiang Luo, Yuhao Zhai, Wenting Duan, Nassir Navab, Zhen Chen

Main category: cs.CV

TL;DR: LapFM is a surgical foundation model that learns robust segmentation from massive unlabeled laparoscopic images using hierarchical concept evolving pre-training, outperforming existing methods in universal laparoscopic segmentation.

DetailsMotivation: Surgical segmentation faces challenges with annotation scarcity and semantic inconsistency across diverse procedures. Existing approaches fine-tune natural foundation models as domain adapters rather than true surgical foundation models, limiting generalization across surgical targets.

Method: LapFM uses Hierarchical Concept Evolving Pre-training: 1) Laparoscopic Concept Hierarchy (LCH) with hierarchical mask decoder and parent-child query embeddings to unify diverse entities; 2) Confidence-driven Evolving Labeling that iteratively generates and filters pseudo-labels based on hierarchical consistency, creating LapBench-114K dataset.

Result: Extensive experiments show LapFM significantly outperforms state-of-the-art methods, establishing new standards for granularity-adaptive generalization in universal laparoscopic segmentation.

Conclusion: LapFM successfully evolves robust segmentation capabilities from massive unlabeled surgical images, addressing annotation scarcity and semantic inconsistency while achieving superior generalization across diverse surgical targets.

Abstract: Surgical segmentation is pivotal for scene understanding yet remains hindered by annotation scarcity and semantic inconsistency across diverse procedures. Existing approaches typically fine-tune natural foundation models (e.g., SAM) with limited supervision, functioning merely as domain adapters rather than surgical foundation models. Consequently, they struggle to generalize across the vast variability of surgical targets. To bridge this gap, we present LapFM, a foundation model designed to evolve robust segmentation capabilities from massive unlabeled surgical images. Distinct from medical foundation models relying on inefficient self-supervised proxy tasks, LapFM leverages a Hierarchical Concept Evolving Pre-training paradigm. First, we establish a Laparoscopic Concept Hierarchy (LCH) via a hierarchical mask decoder with parent-child query embeddings, unifying diverse entities (i.e., Anatomy, Tissue, and Instrument) into a scalable knowledge structure with cross-granularity semantic consistency. Second, we propose a Confidence-driven Evolving Labeling that iteratively generates and filters pseudo-labels based on hierarchical consistency, progressively incorporating reliable samples from unlabeled images into training. This process yields LapBench-114K, a large-scale benchmark comprising 114K image-mask pairs. Extensive experiments demonstrate that LapFM significantly outperforms state-of-the-art methods, establishing new standards for granularity-adaptive generalization in universal laparoscopic segmentation. The source code is available at https://github.com/xq141839/LapFM.

[104] Leveraging Multispectral Sensors for Color Correction in Mobile Cameras

Luca Cogo, Marco Buzzelli, Simone Bianco, Javier Vazquez-Corral, Raimondo Schettini

Main category: cs.CV

TL;DR: Unified learning-based framework for end-to-end color correction using both high-res RGB and low-res multispectral sensors, outperforming separate-stage methods by up to 50% error reduction.

DetailsMotivation: Existing snapshot multispectral imaging systems capture richer spectral data than RGB sensors but most methods treat color correction in separate stages, discarding MS data early. There's a need for a unified approach that fully leverages both sensor types.

Method: Proposes a unified learning-based framework that: 1) performs end-to-end color correction, 2) jointly leverages high-resolution RGB and low-resolution MS sensor data in a single model. The framework is flexible enough to refactor two different state-of-the-art image-to-image architectures. Created a dedicated dataset by aggregating public spectral datasets rendered under multiple RGB camera sensitivities.

Result: Extensive experiments show the approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. The unified framework produces coherent and color-accurate outputs.

Conclusion: The proposed end-to-end learning framework successfully integrates RGB and MS sensor data for superior color correction, demonstrating flexibility across architectures and significant performance improvements over existing methods.

Abstract: Recent advances in snapshot multispectral (MS) imaging have enabled compact, low-cost spectral sensors for consumer and mobile devices. By capturing richer spectral information than conventional RGB sensors, these systems can enhance key imaging tasks, including color correction. However, most existing methods treat the color correction pipeline in separate stages, often discarding MS data early in the process. We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. Our approach integrates the full pipeline within a single model, producing coherent and color-accurate outputs. We demonstrate the flexibility and generality of our framework by refactoring two different state-of-the-art image-to-image architectures. To support training and evaluation, we construct a dedicated dataset by aggregating and repurposing publicly available spectral datasets, rendering under multiple RGB camera sensitivities. Extensive experiments show that our approach improves color accuracy and stability, reducing error by up to 50% compared to RGB-only and MS-driven baselines. Datasets, code, and models will be made available upon acceptance.

[105] Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning

Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang

Main category: cs.CV

TL;DR: CAMA is a training-free attention modulation method that improves multimodal in-context learning by dynamically adjusting attention logits to strengthen focus on semantically important tokens, especially visual ones.

DetailsMotivation: Multimodal ICL performance remains unstable even with well-matched demonstrations, showing LVLMs struggle to fully utilize provided context. Existing approaches focus on prompt engineering or post-hoc calibration, but the paper addresses inherent limitations in attention mechanisms.

Method: Context-Aware Modulated Attention (CAMA) - a training-free, plug-and-play method that dynamically adjusts attention logits based on input in-context sequences. Uses two-stage modulation process to strengthen attention to semantically important tokens, particularly visual ones.

Result: Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can activate intended benefits of prompt engineering methods and remains robust across different sequence configurations.

Conclusion: CAMA opens up new directions for improving multimodal reasoning through deeper understanding of attention dynamics, addressing fundamental weaknesses in LVLMs’ self-attention mechanisms for ICL.

Abstract: Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.

[106] Uncertainty-Aware Subset Selection for Robust Visual Explainability under Distribution Shifts

Madhav Gupta, Vishak Prasad C, Ganesh Ramakrishnan

Main category: cs.CV

TL;DR: Subset selection methods for explaining vision models degrade under out-of-distribution (OOD) conditions, producing unreliable explanations. The paper introduces an uncertainty-aware submodular optimization framework that improves robustness without extra training.

DetailsMotivation: Current subset selection methods for explaining deep vision models work well in-distribution but perform poorly under out-of-distribution conditions, producing redundant, unstable, and uncertainty-sensitive explanations. This limits their real-world applicability where OOD scenarios are common.

Method: Combines submodular subset selection with layer-wise, gradient-based uncertainty estimation. Uses adaptive weight perturbations to estimate uncertainty, then guides submodular optimization with these estimates to ensure diverse and informative subset selection. No additional training or auxiliary models required.

Result: Empirical evaluations show the framework mitigates weaknesses of existing methods under OOD scenarios while also improving performance in ID settings. Produces more robust, stable, and informative explanations compared to baseline subset selection methods.

Conclusion: Current subset-based explanation methods have significant limitations under OOD conditions. Uncertainty-driven optimization can enhance attribution and object-level interpretability, making AI explanations more transparent and trustworthy for real-world vision applications.

Abstract: Subset selection-based methods are widely used to explain deep vision models: they attribute predictions by highlighting the most influential image regions and support object-level explanations. While these methods perform well in in-distribution (ID) settings, their behavior under out-of-distribution (OOD) conditions remains poorly understood. Through extensive experiments across multiple ID-OOD sets, we find that reliability of the existing subset based methods degrades markedly, yielding redundant, unstable, and uncertainty-sensitive explanations. To address these shortcomings, we introduce a framework that combines submodular subset selection with layer-wise, gradient-based uncertainty estimation to improve robustness and fidelity without requiring additional training or auxiliary models. Our approach estimates uncertainty via adaptive weight perturbations and uses these estimates to guide submodular optimization, ensuring diverse and informative subset selection. Empirical evaluations show that, beyond mitigating the weaknesses of existing methods under OOD scenarios, our framework also yields improvements in ID settings. These findings highlight limitations of current subset-based approaches and demonstrate how uncertainty-driven optimization can enhance attribution and object-level interpretability, paving the way for more transparent and trustworthy AI in real-world vision applications.

[107] Fine-grained Spatiotemporal Grounding on Egocentric Videos

Shuo Liang, Yiwu Zhong, Zi-Yuan Hu, Yeyao Tao, Liwei Wang

Main category: cs.CV

TL;DR: EgoMask is the first pixel-level benchmark for spatiotemporal video grounding in egocentric videos, addressing challenges like shorter object durations and sparser trajectories compared to exocentric videos.

DetailsMotivation: Egocentric video grounding is important for AR/robotics but underexplored compared to exocentric videos. The paper identifies key discrepancies: shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts in egocentric videos.

Method: 1) Systematic analysis of egocentric vs exocentric video discrepancies; 2) Automatic annotation pipeline to create EgoMask benchmark with referring expressions and object masks across short-, medium-, and long-term videos; 3) Creation of EgoMask-Train large-scale training dataset.

Result: State-of-the-art spatiotemporal grounding models perform poorly on EgoMask benchmark, but fine-tuning on EgoMask-Train yields significant improvements while preserving performance on exocentric datasets.

Conclusion: EgoMask provides essential resources and insights for advancing egocentric video understanding, addressing unique challenges in egocentric spatiotemporal grounding through novel benchmark and training dataset.

Abstract: Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .

[108] Team-Aware Football Player Tracking with SAM: An Appearance-Based Approach to Occlusion Recovery

Chamath Ranasinghe, Uthayasanker Thayasivam

Main category: cs.CV

TL;DR: Lightweight SAM-based football player tracking method combining Segment Anything Model with CSRT trackers and jersey color appearance models achieves 7.6-7.7 FPS with stable memory usage, maintaining high tracking success in various occlusion scenarios but struggling with long-term occlusions.

DetailsMotivation: Football player tracking faces challenges from frequent occlusions, similar player appearances, and rapid motion in crowded scenes. There's a need for lightweight, practical tracking systems that can handle these domain-specific challenges under resource constraints.

Method: Team-aware tracking system combining Segment Anything Model (SAM) for precise initialization with CSRT trackers and HSV histogram-based appearance models for re-identification. Uses jersey color-based appearance models to improve occlusion recovery.

Result: Achieves 7.6-7.7 FPS with stable memory usage (~1880 MB). Maintains 100% tracking success in light occlusions and 90% in crowded penalty-box scenarios with 5+ players. Appearance-based re-identification recovers 50% of heavy occlusions, but struggles with long-term occlusions where players leave frame (only 8.66% re-acquisition success).

Conclusion: Classical tracker-based methods work well with continuous visibility but require stronger re-identification mechanisms for extended absences. The SAM + CSRT combination provides consistent performance across crowd densities but has limitations with long-term occlusions. Results offer practical guidelines for deploying football tracking systems under resource constraints.

Abstract: Football player tracking is challenged by frequent occlusions, similar appearances, and rapid motion in crowded scenes. This paper presents a lightweight SAM-based tracking method combining the Segment Anything Model (SAM) with CSRT trackers and jersey color-based appearance models. We propose a team-aware tracking system that uses SAM for precise initialization and HSV histogram-based re-identification to improve occlusion recovery. Our evaluation measures three dimensions: processing speed (FPS and memory), tracking accuracy (success rate and box stability), and robustness (occlusion recovery and identity consistency). Experiments on football video sequences show that the approach achieves 7.6-7.7 FPS with stable memory usage (~1880 MB), maintaining 100 percent tracking success in light occlusions and 90 percent in crowded penalty-box scenarios with 5 or more players. Appearance-based re-identification recovers 50 percent of heavy occlusions, demonstrating the value of domain-specific cues. Analysis reveals key trade-offs: the SAM + CSRT combination provides consistent performance across crowd densities but struggles with long-term occlusions where players leave the frame, achieving only 8.66 percent re-acquisition success. These results offer practical guidelines for deploying football tracking systems under resource constraints, showing that classical tracker-based methods work well with continuous visibility but require stronger re-identification mechanisms for extended absences.

[109] You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Logan Lawrence, Oindrila Saha, Megan Wei, Chen Sun, Subhransu Maji, Grant Van Horn

Main category: cs.CV

TL;DR: nlg2choice: A two-stage method for zero-shot visual classification using MLLMs that first generates free-form responses then uses constrained decoding to extract choices, improving performance on fine-grained datasets with hundreds to thousands of options.

DetailsMotivation: Existing evaluation methods for MLLMs struggle with free-form responses in fine-grained visual classification where choice counts are high (hundreds to thousands) and choices are highly related. Current approaches don't handle multi-way MCQs beyond 5-way options well, and retrieval-based problems face computational challenges when computing probabilities over large choice sets.

Method: nlg2choice uses a two-stage approach: 1) Ask the MLLM an open-ended question about the visual task with minimal constraints, 2) Use text-only constrained decoding to predict the most likely choice from the candidate set. For retrieval settings, they compute probability of constrained responses with early stopping to improve throughput.

Result: The method shows improvement over a suite of seven fine-grained visual datasets in both classification and retrieval tasks. Performance remains consistent across various natural language implementations that users might employ with LLMs.

Conclusion: nlg2choice provides an effective solution for evaluating MLLMs on fine-grained visual classification tasks with large choice sets, addressing both classification and retrieval challenges while maintaining computational efficiency through early stopping techniques.

Abstract: Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don’t consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

[110] ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention

Huiguo He, Pengyu Yan, Ziqi Yi, Weizhi Zhong, Zheng Liu, Yejun Tang, Huan Yang, Kun Gai, Guanbin Li, Lianwen Jin

Main category: cs.CV

TL;DR: ContextDrag: A new drag-based image editing method that leverages contextual information from reference images using Context-preserving Token Injection and Position-Consistent Attention for better coherence and detail preservation.

DetailsMotivation: Existing drag-based image editing methods fail to fully exploit contextual information from reference images, including fine-grained texture details, leading to edits with limited coherence and fidelity.

Method: Introduces ContextDrag with two key components: 1) Context-preserving Token Injection (CTI) that injects noise-free reference features via Latent-space Reverse Mapping algorithm, and 2) Position-Consistent Attention (PCA) that re-encodes reference tokens with overlap-aware masking to eliminate irrelevant feature interference.

Result: Extensive experiments on DragBench-SR and DragBench-DR demonstrate that ContextDrag surpasses all existing state-of-the-art methods.

Conclusion: ContextDrag provides a new paradigm for drag-based editing that better leverages contextual modeling capabilities, enabling precise drag control while preserving semantic and texture consistency without requiring finetuning or inversion.

Abstract: Drag-based image editing aims to modify visual content followed by user-specified drag operations. Despite existing methods having made notable progress, they still fail to fully exploit the contextual information in the reference image, including fine-grained texture details, leading to edits with limited coherence and fidelity. To address this challenge, we introduce ContextDrag, a new paradigm for drag-based editing that leverages the strong contextual modeling capability of editing models, such as FLUX-Kontext. By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details, without the need for finetuning or inversion. Specifically, ContextDrag introduced a novel Context-preserving Token Injection (CTI) that injects noise-free reference features into their correct destination locations via a Latent-space Reverse Mapping (LRM) algorithm. This strategy enables precise drag control while preserving consistency in both semantics and texture details. Second, ContextDrag adopts a novel Position-Consistent Attention (PCA), which positional re-encodes the reference tokens and applies overlap-aware masking to eliminate interference from irrelevant reference features. Extensive experiments on DragBench-SR and DragBench-DR demonstrate that our approach surpasses all existing SOTA methods. Code will be publicly available.

[111] Visionary: The World Model Carrier Built on WebGPU-Powered Gaussian Splatting Platform

Yuning Gong, Yifei Liu, Yifan Zhan, Muyao Niu, Xueying Li, Yuanjun Liao, Jiaming Chen, Yuanyuan Gao, Jiaqi Chen, Minming Chen, Li Zhou, Yuning Zhang, Wei Wang, Xiaoqing Hou, Huaxi Huang, Shixiang Tang, Le Ma, Dingwen Zhang, Xue Yang, Junchi Yan, Yanchi Zhang, Yinqiang Zheng, Xiao Sun, Zhihang Zhong

Main category: cs.CV

TL;DR: Visionary is a web-native platform for real-time Gaussian Splatting and mesh rendering that unifies inference and rendering in the browser with WebGPU and ONNX, enabling dynamic neural processing with plug-and-play algorithms.

DetailsMotivation: Existing 3D Gaussian Splatting viewer solutions are fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models.

Method: Built on efficient WebGPU renderer with per-frame ONNX inference, introduces standardized Gaussian Generator contract for plug-and-play algorithms, offers three.js library with TypeScript API, and uses GPU-based primitive sorting.

Result: Achieves superior rendering efficiency compared to current Web viewers, supports multiple variants (MLP-based 3DGS, 4DGS, neural avatars, style transformation), and enables dynamic neural processing in browser.

Conclusion: Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.

Abstract: Neural rendering, particularly 3D Gaussian Splatting (3DGS), has evolved rapidly and become a key component for building world models. However, existing viewer solutions remain fragmented, heavy, or constrained by legacy pipelines, resulting in high deployment friction and limited support for dynamic content and generative models. In this work, we present Visionary, an open, web-native platform for real-time various Gaussian Splatting and meshes rendering. Built on an efficient WebGPU renderer with per-frame ONNX inference, Visionary enables dynamic neural processing while maintaining a lightweight, “click-to-run” browser experience. It introduces a standardized Gaussian Generator contract, which not only supports standard 3DGS rendering but also allows plug-and-play algorithms to generate or update Gaussians each frame. Such inference also enables us to apply feedforward generative post-processing. The platform further offers a plug in three.js library with a concise TypeScript API for seamless integration into existing web applications. Experiments show that, under identical 3DGS assets, Visionary achieves superior rendering efficiency compared to current Web viewers due to GPU-based primitive sorting. It already supports multiple variants, including MLP-based 3DGS, 4DGS, neural avatars, and style transformation or enhancement networks. By unifying inference and rendering directly in the browser, Visionary significantly lowers the barrier to reproduction, comparison, and deployment of 3DGS-family methods, serving as a unified World Model Carrier for both reconstructive and generative paradigms.

[112] Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Yuhan Liu, Lianhui Qin, Shengjie Wang

Main category: cs.CV

TL;DR: SV is a training-free framework that uses multiple small VLMs as draft experts to generate diverse reasoning paths, then a strong VLM synthesizes these paths for final answers, improving performance on information-intensive visual tasks while reducing computational costs.

DetailsMotivation: Large VLMs struggle with information-intensive images that densely interleave text and graphics, facing challenges in precise localization of critical cues and multi-hop reasoning to integrate dispersed evidence.

Method: Two-stage framework: 1) Draft stage - multiple lightweight VLMs act as draft experts to generate diverse reasoning paths with localization candidates; 2) Verdict stage - a strong VLM synthesizes these paths to produce final answers. Includes consensus expert selection mechanism to forward only high-agreement paths to verdict.

Result: Achieves consistent gains on challenging information-intensive and high-resolution VQA benchmarks (InfographicVQA, ChartMuseum, ChartQAPro, HR-Bench 4K). Enables error correction and cost-efficiency compared to large proprietary models or training pipelines.

Conclusion: SV effectively addresses VLMs’ limitations on dense information images by combining speculative decoding principles with multiple draft experts and a verdict model, achieving both improved accuracy and computational efficiency without training.

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.

[113] Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions

Ada Gorgun, Fawaz Sammani, Nikos Deligiannis, Bernt Schiele, Jonas Fischer

Main category: cs.CV

TL;DR: PCI is a training-free framework that analyzes when specific concepts form during diffusion model generation by measuring concept insertion success across timesteps.

DetailsMotivation: Current diffusion model evaluation focuses only on final outputs, but understanding the dynamic generation process is crucial for analyzing controllability, reliability, and failure modes of these models.

Method: PCI (Prompt-Conditioned Intervention) is a model-agnostic framework that measures Concept Insertion Success (CIS) - the probability that a concept inserted at a given timestep is preserved in the final image, enabling temporal analysis of concept formation.

Result: PCI reveals diverse temporal behaviors across diffusion models, showing that certain phases are more favorable to specific concepts even within the same concept type. It also provides actionable insights for text-driven image editing with quantitatively stronger edits.

Conclusion: PCI enables systematic analysis of concept dynamics in diffusion models without model access or training, offering practical benefits for understanding model behavior and improving image editing interventions.

Abstract: Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions

[114] On-the-fly Large-scale 3D Reconstruction from Multi-Camera Rigs

Yijia Guo, Tong Hu, Zhiwei Li, Liwen Hu, Keming Qian, Xitong Lin, Shengbo Chen, Tiejun Huang, Lei Ma

Main category: cs.CV

TL;DR: First on-the-fly 3D reconstruction framework for multi-camera rigs using 3D Gaussian Splatting, achieving drift-free trajectory estimation and efficient online reconstruction from raw multi-camera video streams.

DetailsMotivation: Existing on-the-fly 3DGS methods for monocular RGB streams fail to achieve complete 3D coverage due to limited field of view. Multi-camera rigs address this limitation but lack real-time reconstruction frameworks.

Method: Hierarchical camera initialization for coarse inter-camera alignment without calibration, lightweight multi-camera bundle adjustment, redundancy-free Gaussian sampling strategy, and frequency-aware optimization scheduler to fuse dense RGB streams into unified Gaussian representation.

Result: Reconstructs hundreds of meters of 3D scenes within just 2 minutes using raw multi-camera video streams, demonstrating unprecedented speed, robustness, and fidelity for on-the-fly 3D scene reconstruction.

Conclusion: The method enables efficient, real-time 3D reconstruction from multi-camera rigs with complete 3D coverage, addressing limitations of monocular approaches while maintaining high fidelity and speed.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled efficient free-viewpoint rendering and photorealistic scene reconstruction. While on-the-fly extensions of 3DGS have shown promise for real-time reconstruction from monocular RGB streams, they often fail to achieve complete 3D coverage due to the limited field of view (FOV). Employing a multi-camera rig fundamentally addresses this limitation. In this paper, we present the first on-the-fly 3D reconstruction framework for multi-camera rigs. Our method incrementally fuses dense RGB streams from multiple overlapping cameras into a unified Gaussian representation, achieving drift-free trajectory estimation and efficient online reconstruction. We propose a hierarchical camera initialization scheme that enables coarse inter-camera alignment without calibration, followed by a lightweight multi-camera bundle adjustment that stabilizes trajectories while maintaining real-time performance. Furthermore, we introduce a redundancy-free Gaussian sampling strategy and a frequency-aware optimization scheduler to reduce the number of Gaussian primitives and the required optimization iterations, thereby maintaining both efficiency and reconstruction fidelity. Our method reconstructs hundreds of meters of 3D scenes within just 2 minutes using only raw multi-camera video streams, demonstrating unprecedented speed, robustness, and Fidelity for on-the-fly 3D scene reconstruction.

[115] Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models

Jiaming Zhang, Che Wang, Yang Cao, Longtao Huang, Wei Yang Bryan Lim

Main category: cs.CV

TL;DR: ReasonBreak is an adversarial framework that disrupts hierarchical reasoning in multi-modal large reasoning models to protect geographic location privacy in personal images, achieving significant improvements over existing methods.

DetailsMotivation: Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through sophisticated hierarchical chain-of-thought reasoning. Existing privacy protection techniques designed for perception-based models are ineffective against MLRMs' multi-step reasoning processes.

Method: ReasonBreak introduces concept-aware perturbations that strategically target critical conceptual dependencies within reasoning chains. The approach generates perturbations aligned with conceptual hierarchies rather than uniform noise, invalidating specific inference steps that cascade through subsequent reasoning stages. The authors also contribute GeoPrivacy-6K, a dataset of 6,341 ultra-high-resolution images with hierarchical concept annotations.

Result: Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) shows ReasonBreak achieves a 14.4% improvement in tract-level protection (33.8% vs 19.4%) and nearly doubles block-level protection (33.5% vs 16.8%).

Conclusion: This work establishes a new paradigm for privacy protection against reasoning-based threats by disrupting hierarchical reasoning in MLRMs through concept-aware perturbations, addressing a critical gap in existing privacy protection techniques.

Abstract: Multi-modal large reasoning models (MLRMs) pose significant privacy risks by inferring precise geographic locations from personal images through hierarchical chain-of-thought reasoning. Existing privacy protection techniques, primarily designed for perception-based models, prove ineffective against MLRMs’ sophisticated multi-step reasoning processes that analyze environmental cues. We introduce \textbf{ReasonBreak}, a novel adversarial framework specifically designed to disrupt hierarchical reasoning in MLRMs through concept-aware perturbations. Our approach is founded on the key insight that effective disruption of geographic reasoning requires perturbations aligned with conceptual hierarchies rather than uniform noise. ReasonBreak strategically targets critical conceptual dependencies within reasoning chains, generating perturbations that invalidate specific inference steps and cascade through subsequent reasoning stages. To facilitate this approach, we contribute \textbf{GeoPrivacy-6K}, a comprehensive dataset comprising 6,341 ultra-high-resolution images ($\geq$2K) with hierarchical concept annotations. Extensive evaluation across seven state-of-the-art MLRMs (including GPT-o3, GPT-5, Gemini 2.5 Pro) demonstrates ReasonBreak’s superior effectiveness, achieving a 14.4% improvement in tract-level protection (33.8% vs 19.4%) and nearly doubling block-level protection (33.5% vs 16.8%). This work establishes a new paradigm for privacy protection against reasoning-based threats.

[116] Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models

Vasco Ramos, Regev Cohen, Idan Szpektor, Joao Magalhaes

Main category: cs.CV

TL;DR: NoisyCLIP enables real-time detection of text/image misalignments during diffusion model generation by measuring semantic alignment in noisy latent space, reducing computational cost by 50% while maintaining 98% of CLIP alignment performance.

DetailsMotivation: Current conditional diffusion models suffer from misalignment and hallucinations, requiring expensive post-generation alignment assessment. The authors hypothesize that misalignments can be detected early in the denoising process to enable real-time assessment without waiting for complete generation.

Method: Proposes NoisyCLIP, which measures semantic alignment in the noisy latent space during the reverse diffusion process. It’s the first method to explore prompt-to-latent misalignment detection using dual encoders during image generation.

Result: NoisyCLIP reduces computational cost by 50% while achieving 98% of CLIP alignment performance in Best-of-N settings. Enables real-time alignment assessment during generation without sacrificing semantic fidelity.

Conclusion: Early detection of text/image misalignments during diffusion model generation is feasible and efficient. NoisyCLIP provides a practical solution for real-time alignment assessment that reduces costs while maintaining high semantic fidelity.

Abstract: Conditional diffusion models rely on language-to-image alignment methods to steer the generation towards semantically accurate outputs. Despite the success of this architecture, misalignment and hallucinations remain common issues and require automatic misalignment detection tools to improve quality, for example by applying them in a Best-of-N (BoN) post-generation setting. Unfortunately, measuring the alignment after the generation is an expensive step since we need to wait for the overall generation to finish to determine prompt adherence. In contrast, this work hypothesizes that text/image misalignments can be detected early in the denoising process, enabling real-time alignment assessment without waiting for the complete generation. In particular, we propose NoisyCLIP a method that measures semantic alignment in the noisy latent space. This work is the first to explore and benchmark prompt-to-latent misalignment detection during image generation using dual encoders in the reverse diffusion process. We evaluate NoisyCLIP qualitatively and quantitatively and find it reduces computational cost by 50% while achieving 98% of CLIP alignment performance in BoN settings. This approach enables real-time alignment assessment during generation, reducing costs without sacrificing semantic fidelity.

[117] OCCDiff: Occupancy Diffusion Model for High-Fidelity 3D Building Reconstruction from Noisy Point Clouds

Jialu Sui, Rui Liu, Hongsheng Zhang

Main category: cs.CV

TL;DR: OCCDiff uses latent diffusion in occupancy function space to reconstruct buildings from LiDAR point clouds, handling varying densities and noise through continuous occupancy functions and multi-task training.

DetailsMotivation: The main challenge is accurately capturing building surfaces from LiDAR point clouds that have varying point densities and noise interference. Existing methods struggle to flexibly gather high-quality 3D profiles in diverse resolutions.

Method: OCCDiff combines latent diffusion with function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. It includes a point encoder for condition features, occupancy constraints, and multi-modal feature insertion. Uses multi-task training for robust feature learning.

Result: Empirical results show the method generates physically consistent samples with high fidelity to target distribution and exhibits robustness to noisy data.

Conclusion: OCCDiff effectively addresses building reconstruction challenges from LiDAR by using occupancy function space diffusion, enabling flexible high-quality 3D profile generation across varying resolutions and noise conditions.

Abstract: A major challenge in reconstructing buildings from LiDAR point clouds lies in accurately capturing building surfaces under varying point densities and noise interference. To flexibly gather high-quality 3D profiles of the building in diverse resolution, we propose OCCDiff applying latent diffusion in the occupancy function space. Our OCCDiff combines a latent diffusion process with a function autoencoder architecture to generate continuous occupancy functions evaluable at arbitrary locations. Moreover, a point encoder is proposed to provide condition features to diffusion learning, constraint the final occupancy prediction for occupancy decoder, and insert multi-modal features for latent generation to latent encoder. To further enhance the model performance, a multi-task training strategy is employed, ensuring that the point encoder learns diverse and robust feature representations. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy data.

[118] Thinking with Images via Self-Calling Agent

Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye

Main category: cs.CV

TL;DR: Self-Calling Chain-of-Thought (sCoT) reformulates multimodal visual reasoning as language-only CoT with self-calling to virtual subagents, improving training efficiency and performance.

DetailsMotivation: Optimizing interleaved multimodal Chain-of-Thought (iMCoT) through reinforcement learning is challenging due to reliance on scarce high-quality reasoning data and complex modality interleaving.

Method: sCoT reformulates iMCoT as language-only CoT where a main agent decomposes visual reasoning tasks into atomic subtasks and invokes parameter-sharing virtual subagents to solve them in isolated context, using group-relative policy optimization for reinforcement.

Result: On HR-Bench 4K, sCoT improves overall reasoning performance by up to 1.9% while using ~75% fewer GPU hours compared to strong baseline approaches.

Conclusion: sCoT provides an effective and efficient paradigm for visual reasoning that avoids explicit modality interleaving while enhancing reasoning performance through self-calling architecture and optimized reinforcement learning.

Abstract: Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $\sim 75%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

[119] MVP: Multiple View Prediction Improves GUI Grounding

Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

Main category: cs.CV

TL;DR: MVP is a training-free framework that improves GUI grounding stability by aggregating predictions from multiple cropped views to overcome coordinate prediction instability in existing models.

DetailsMotivation: Existing GUI grounding models suffer from significant coordinate prediction instability where minor visual perturbations (like cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for high-resolution screens with small UI elements.

Method: Multi-View Prediction (MVP) framework with two components: (1) Attention-Guided View Proposal - derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering - ensembles predictions by selecting the centroid of the densest spatial cluster. The approach is training-free and enhances grounding through multi-view inference.

Result: Extensive experiments show MVP significantly boosts performance across various models and benchmarks. On ScreenSpot-Pro, MVP improves: UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%.

Conclusion: MVP effectively addresses coordinate prediction instability in GUI grounding by aggregating predictions from multiple views, demonstrating that while single-view predictions may be unstable, multi-view aggregation can reliably distinguish correct coordinates from outliers, significantly improving grounding performance without requiring additional training.

Abstract: GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP’s effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.

[120] PaintFlow: A Unified Framework for Interactive Oil Paintings Editing and Generation

Zhangli Hu, Ye Chen, Jiajun Yao, Bingbing Ni

Main category: cs.CV

TL;DR: A unified multimodal framework for oil painting generation and editing that combines reference images, hand-drawn sketches, and text prompts while maintaining consistent painting style.

DetailsMotivation: Oil painting is challenging for digital generation/editing due to complex brushstroke dynamics and stylized characteristics. Existing methods are limited by training data distribution and focus mainly on modifying real photos rather than artistic creation.

Method: Three key technical advancements: 1) Enhanced training with spatial alignment and semantic enhancement conditioning (masks/sketches as spatial constraints, reference images/text as feature constraints), 2) Self-supervised style transfer pipeline using Stroke-Based Rendering to create large-scale paired training data, 3) AdaIN operator integration during inference for stylistic consistency.

Result: The system enables fine-grained interactive oil painting editing while preserving artistic qualities, achieving unprecedented imagination realization in stylized oil painting generation and editing.

Conclusion: Proposed unified multimodal framework successfully addresses oil painting generation/editing challenges through spatial-semantic conditioning, data augmentation via SBR-based style transfer, and style consistency mechanisms, enabling interactive artistic creation.

Abstract: Oil painting, as a high-level medium that blends human abstract thinking with artistic expression, poses substantial challenges for digital generation and editing due to its intricate brushstroke dynamics and stylized characteristics. Existing generation and editing techniques are often constrained by the distribution of training data and primarily focus on modifying real photographs. In this work, we introduce a unified multimodal framework for oil painting generation and editing. The proposed system allows users to incorporate reference images for precise semantic control, hand-drawn sketches for spatial structure alignment, and natural language prompts for high-level semantic guidance, while consistently maintaining a unified painting style across all outputs. Our method achieves interactive oil painting creation through three crucial technical advancements. First, we enhance the training stage with spatial alignment and semantic enhancement conditioning strategy, which map masks and sketches into spatial constraints, and encode contextual embedding from reference images and text into feature constraints, enabling object-level semantic alignment. Second, to overcome data scarcity, we propose a self-supervised style transfer pipeline based on Stroke-Based Rendering (SBR), which simulates the inpainting dynamics of oil painting restoration, converting real images into stylized oil paintings with preserved brushstroke textures to construct a large-scale paired training dataset. Finally, during inference, we integrate features using the AdaIN operator to ensure stylistic consistency. Extensive experiments demonstrate that our interactive system enables fine-grained editing while preserving the artistic qualities of oil paintings, achieving an unprecedented level of imagination realization in stylized oil paintings generation and editing.

[121] Photo3D: Advancing Photorealistic 3D Generation through Structure-Aligned Detail Enhancement

Xinyue Liang, Zhinyuan Ma, Lingchen Sun, Yanjun Guo, Lei Zhang

Main category: cs.CV

TL;DR: Photo3D is a framework for photorealistic 3D generation that uses GPT-4o-Image generated images to enhance texture details while maintaining structural consistency with 3D-native geometry.

DetailsMotivation: Current 3D-native generators produce reliable geometry but lack realistic appearances due to limited access to diverse, high-quality real-world 3D assets with rich texture details. Capturing such data is difficult due to diverse scene scales, non-rigid motions, and scanner limitations.

Method: 1) Uses GPT-4o-Image generated images as data source; 2) Designs structure-aligned multi-view synthesis pipeline to address multi-view inconsistency; 3) Constructs detail-enhanced multi-view dataset paired with 3D geometry; 4) Implements realistic detail enhancement scheme using perceptual feature adaptation and semantic structure matching; 5) Provides dedicated training strategies for both geometry-texture coupled and decoupled 3D-native generation paradigms.

Result: Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance in experiments.

Conclusion: Photo3D successfully advances photorealistic 3D generation by leveraging AI-generated images to overcome data scarcity issues, while maintaining structural consistency through novel alignment techniques, making it applicable to various 3D-native generation approaches.

Abstract: Although recent 3D-native generators have made great progress in synthesizing reliable geometry, they still fall short in achieving realistic appearances. A key obstacle lies in the lack of diverse and high-quality real-world 3D assets with rich texture details, since capturing such data is intrinsically difficult due to the diverse scales of scenes, non-rigid motions of objects, and the limited precision of 3D scanners. We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT-4o-Image model. Considering that the generated images can distort 3D structures due to their lack of multi-view consistency, we design a structure-aligned multi-view synthesis pipeline and construct a detail-enhanced multi-view dataset paired with 3D geometry. Building on it, we present a realistic detail enhancement scheme that leverages perceptual feature adaptation and semantic structure matching to enforce appearance consistency with realistic details while preserving the structural consistency with the 3D-native geometry. Our scheme is general to different 3D-native generators, and we present dedicated training strategies to facilitate the optimization of geometry-texture coupled and decoupled 3D-native generation paradigms. Experiments demonstrate that Photo3D generalizes well across diverse 3D-native generation paradigms and achieves state-of-the-art photorealistic 3D generation performance.

[122] A Novel Wasserstein Quaternion Generative Adversarial Network for Color Image Generation

Zhigang Jia, Duan Wang, Hengkai Wang, Yajun Xie, Meixiang Zhao, Xiaoyu Zhao

Main category: cs.CV

TL;DR: Proposes a quaternion Wasserstein distance and Wasserstein quaternion GAN for color image generation, addressing color channel correlations and data distribution issues.

DetailsMotivation: Existing color image generation models ignore correlations among color channels (leading to chromatic aberration), and lack systematic theory for measuring color image data distributions.

Method: Defines new quaternion Wasserstein distance with dual theory; develops quaternion linear programming with strong duality via quaternion convex set separation theorem and Farkas lemma; proposes Wasserstein quaternion GAN.

Result: The proposed model surpasses both standard/quaternion GANs and Wasserstein GANs in generation efficiency and image quality.

Conclusion: Quaternion Wasserstein distance provides better theoretical foundation for color image generation, and the Wasserstein quaternion GAN offers superior performance by properly handling color channel correlations.

Abstract: Color image generation has a wide range of applications, but the existing generation models ignore the correlation among color channels, which may lead to chromatic aberration problems. In addition, the data distribution problem of color images has not been systematically elaborated and explained, so that there is still the lack of the theory about measuring different color images datasets. In this paper, we define a new quaternion Wasserstein distance and develop its dual theory. To deal with the quaternion linear programming problem, we derive the strong duality form with helps of quaternion convex set separation theorem and quaternion Farkas lemma. With using quaternion Wasserstein distance, we propose a novel Wasserstein quaternion generative adversarial network. Experiments demonstrate that this novel model surpasses both the (quaternion) generative adversarial networks and the Wasserstein generative adversarial network in terms of generation efficiency and image quality.

[123] Fast-ARDiff: An Entropy-informed Acceleration Framework for Continuous Space Autoregressive Generation

Zhen Zou, Xiaoxiao Ma, Jie Huang, Zichao Yu, Feng Zhao

Main category: cs.CV

TL;DR: Fast-ARDiff is a unified AR-diffusion framework that accelerates both AR speculative decoding and diffusion decoding through entropy-informed strategies and joint optimization.

DetailsMotivation: Current AR-diffusion hybrid models suffer from high latency due to sequential AR generation and iterative denoising, creating a bottleneck that needs to be addressed.

Method: 1) Entropy-informed speculative strategy aligns draft model’s entropy with target model to reduce rejection rates; 2) Integrated end-to-end framework with dynamic scheduler prioritizing AR optimization; 3) Joint distillation combining trajectory and distribution matching for diffusion optimization; 4) Inference uses shallow feature entropy to pre-filter low-entropy drafts.

Result: Achieves state-of-the-art acceleration: 4.3× lossless speedup on ImageNet 256×256 with TransDiff, and 3× acceleration on text-conditioned generation with NextStep-1.

Conclusion: Fast-ARDiff successfully addresses the latency bottleneck in AR-diffusion hybrids through unified optimization, achieving significant speed improvements while maintaining quality.

Abstract: Autoregressive(AR)-diffusion hybrid paradigms combine AR’s structured modeling with diffusion’s photorealistic synthesis, yet suffer from high latency due to sequential AR generation and iterative denoising. In this work, we tackle this bottleneck and propose a unified AR-diffusion framework Fast-ARDiff that jointly optimizes both components, accelerating AR speculative decoding while simultaneously facilitating faster diffusion decoding. Specifically: (1) The entropy-informed speculative strategy encourages draft model to produce higher-entropy representations aligned with target model’s entropy characteristics, mitigating entropy mismatch and high rejection rates caused by draft overconfidence. (2) For diffusion decoding, rather than treating it as an independent module, we integrate it into the same end-to-end framework using a dynamic scheduler that prioritizes AR optimization to guide the diffusion part in further steps. The diffusion part is optimized through a joint distillation framework combining trajectory and distribution matching, ensuring stable training and high-quality synthesis with extremely few steps. During inference, shallow feature entropy from AR module is used to pre-filter low-entropy drafts, avoiding redundant computation and improving latency. Fast-ARDiff achieves state-of-the-art acceleration across diverse models: on ImageNet 256$\times$256, TransDiff attains 4.3$\times$ lossless speedup, and NextStep-1 achieves 3$\times$ acceleration on text-conditioned generation. Code will be available at https://github.com/aSleepyTree/Fast-ARDiff.

[124] An Iteration-Free Fixed-Point Estimator for Diffusion Inversion

Yifei Chen, Kaiyu Song, Yan Pan, Jianxing Yu, Jian Yin, Hanjiang Lai

Main category: cs.CV

TL;DR: Proposes an iteration-free fixed-point estimator for diffusion inversion that avoids computational costs of iterative methods while maintaining reconstruction accuracy.

DetailsMotivation: Current fixed-point iteration methods for diffusion inversion suffer from high computational costs due to iterative nature and complex hyperparameter selection, creating need for more efficient approach.

Method: Derives explicit expression of fixed point from ideal inversion step, then introduces error approximation using calculable error from previous step to approximate unknown error at current step, creating calculable approximate expression for fixed point.

Result: Method achieves consistent and superior reconstruction performance on NOCAPS and MS-COCO datasets compared to DDIM inversion and other fixed-point iteration methods, without additional iterations or training.

Conclusion: Proposed iteration-free fixed-point estimator provides efficient and effective solution for diffusion inversion, offering unbiased estimation with low variance while eliminating computational overhead of iterative approaches.

Abstract: Diffusion inversion aims to recover the initial noise corresponding to a given image such that this noise can reconstruct the original image through the denoising diffusion process. The key component of diffusion inversion is to minimize errors at each inversion step, thereby mitigating cumulative inaccuracies. Recently, fixed-point iteration has emerged as a widely adopted approach to minimize reconstruction errors at each inversion step. However, it suffers from high computational costs due to its iterative nature and the complexity of hyperparameter selection. To address these issues, we propose an iteration-free fixed-point estimator for diffusion inversion. First, we derive an explicit expression of the fixed point from an ideal inversion step. Unfortunately, it inherently contains an unknown data prediction error. Building upon this, we introduce the error approximation, which uses the calculable error from the previous inversion step to approximate the unknown error at the current inversion step. This yields a calculable, approximate expression for the fixed point, which is an unbiased estimator characterized by low variance, as shown by our theoretical analysis. We evaluate reconstruction performance on two text-image datasets, NOCAPS and MS-COCO. Compared to DDIM inversion and other inversion methods based on the fixed-point iteration, our method achieves consistent and superior performance in reconstruction tasks without additional iterations or training.

[125] Disturbance-Free Surgical Video Generation from Multi-Camera Shadowless Lamps for Open Surgery

Yuna Kato, Shohei Mori, Hideo Saito, Yoshifumi Takatsume, Hiroki Kajita, Mariko Isogawa

Main category: cs.CV

TL;DR: Automated video alignment system for surgical recordings that adjusts camera views when lighting system moves, selects least-occluded views, and generates stable surgical field videos from fixed perspective.

DetailsMotivation: Surgical video recordings are essential for education/research but suffer from surgeon occlusion and require frequent camera adjustments. Existing multi-camera setups on shadowless lamps help but need manual post-processing alignment when lamps are moved.

Method: Proposed method automatically detects when lighting system moves, realigns camera frames, selects camera with least occlusion, and synthesizes consistent surgical field videos from fixed perspective. Includes multiple synthesis options.

Result: Surgeon user study showed method’s videos superior to conventional methods for confirming surgical area and viewing comfort. Improved video quality over existing techniques. User study also assessed surgeon preferences for different synthesis options.

Conclusion: Fully automated alignment system successfully addresses occlusion and camera movement issues in surgical video recording, producing higher quality educational/research videos preferred by surgeons.

Abstract: Video recordings of open surgeries are greatly required for education and research purposes. However, capturing unobstructed videos is challenging since surgeons frequently block the camera field of view. To avoid occlusion, the positions and angles of the camera must be frequently adjusted, which is highly labor-intensive. Prior work has addressed this issue by installing multiple cameras on a shadowless lamp and arranging them to fully surround the surgical area. This setup increases the chances of some cameras capturing an unobstructed view. However, manual image alignment is needed in post-processing since camera configurations change every time surgeons move the lamp for optimal lighting. This paper aims to fully automate this alignment task. The proposed method identifies frames in which the lighting system moves, realigns them, and selects the camera with the least occlusion to generate a video that consistently presents the surgical field from a fixed perspective. A user study involving surgeons demonstrated that videos generated by our method were superior to those produced by conventional methods in terms of the ease of confirming the surgical area and the comfort during video viewing. Additionally, our approach showed improvements in video quality over existing techniques. Furthermore, we implemented several synthesis options for the proposed view-synthesis method and conducted a user study to assess surgeons’ preferences for each option.

[126] SSCATeR: Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling for Real-Time 3D Object Detection in LiDAR Point Clouds

Alexander Dow, Manduhu Manduhu, Matheus Santos, Ben Bartlett, Gerard Dooly, James Riordan

Main category: cs.CV

TL;DR: SSCATeR algorithm uses temporal data recycling and sparse scatter-based convolutions to process only changing regions in LiDAR streams, achieving 6.61x speedup while maintaining accuracy.

DetailsMotivation: Traditional LiDAR processing performs redundant computations on unchanged regions between frames. The goal is to reduce computational load by focusing only on areas with point cloud changes.

Method: Uses sliding time windows with short strides, stores convolution results between passes, and extends scatter-based convolutions with temporal data recycling (SSCATeR). Treats LiDAR as continuous stream and processes only changing parts.

Result: Achieves up to 6.61-fold reduction in processing time while producing identical feature maps to traditional sparse convolution methods. Maintains accuracy while significantly improving computational efficiency.

Conclusion: SSCATeR successfully enables efficient LiDAR processing by exploiting temporal sparsity through data reuse, making real-time object detection more feasible without sacrificing accuracy.

Abstract: This work leverages the continuous sweeping motion of LiDAR scanning to concentrate object detection efforts on specific regions that receive a change in point data from one frame to another. We achieve this by using a sliding time window with short strides and consider the temporal dimension by storing convolution results between passes. This allows us to ignore unchanged regions, significantly reducing the number of convolution operations per forward pass without sacrificing accuracy. This data reuse scheme introduces extreme sparsity to detection data. To exploit this sparsity, we extend our previous work on scatter-based convolutions to allow for data reuse, and as such propose Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR). This operation treats incoming LiDAR data as a continuous stream and acts only on the changing parts of the point cloud. By doing so, we achieve the same results with as much as a 6.61-fold reduction in processing time. Our test results show that the feature maps output by our method are identical to those produced by traditional sparse convolution techniques, whilst greatly increasing the computational efficiency of the network.

[127] BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani

Main category: cs.CV

TL;DR: Large-scale automated framework discovers and explains visual concept representations across human cortex using unsupervised pattern discovery and automated explanation generation.

DetailsMotivation: Understanding how the human brain represents visual concepts and where these representations are encoded remains challenging due to complex brain signals, vast concept space, and limitations of small-scale, manual studies that lack systematic validation.

Method: Two-stage framework: 1) Discover candidate interpretable patterns in fMRI activity using unsupervised data-driven decomposition methods; 2) Explain each pattern by identifying natural images that most strongly elicit it and generating natural-language descriptions. Uses automated pipeline to test multiple candidate explanations, assign reliability scores, and select most consistent descriptions.

Result: Framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.

Conclusion: The automated, large-scale approach enables systematic discovery and explanation of visual representations across the human cortex, overcoming limitations of traditional small-scale studies.

Abstract: Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and properties, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns quantitative reliability scores, and selects the most consistent description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.

[128] Modular Neural Image Signal Processing

Mahmoud Afifi, Zhongling Wang, Ran Zhang, Michael S. Brown

Main category: cs.CV

TL;DR: Modular neural ISP framework for raw-to-display image processing with full control over intermediate stages, enabling high-quality rendering, editing flexibility, and generalization to unseen cameras.

DetailsMotivation: Prior neural ISP designs lack modularity and control over intermediate rendering stages, limiting scalability, debuggability, generalization, and user-preference flexibility.

Method: Modular neural ISP framework with intermediate stage control, learning-based approach with variants of different capacities (0.5M to 3.9M parameters), and user-interactive photo-editing tool integration.

Result: Achieves competitive qualitative and quantitative results across multiple test sets, enables diverse editing operations and picture styles, and supports unlimited post-editable re-rendering.

Conclusion: Modular neural ISP design provides superior control, scalability, and flexibility compared to prior approaches, enabling practical photo-editing applications with high-quality rendering.

Abstract: This paper presents a modular neural image signal processing (ISP) framework that processes raw inputs and renders high-quality display-referred images. Unlike prior neural ISP designs, our method introduces a high degree of modularity, providing full control over multiple intermediate stages of the rendering process.~This modular design not only achieves high rendering accuracy but also improves scalability, debuggability, generalization to unseen cameras, and flexibility to match different user-preference styles. To demonstrate the advantages of this design, we built a user-interactive photo-editing tool that leverages our neural ISP to support diverse editing operations and picture styles. The tool is carefully engineered to take advantage of the high-quality rendering of our neural ISP and to enable unlimited post-editable re-rendering. Our method is a fully learning-based framework with variants of different capacities, all of moderate size (ranging from ~0.5 M to ~3.9 M parameters for the entire pipeline), and consistently delivers competitive qualitative and quantitative results across multiple test sets. Watch the supplemental video at: https://youtu.be/ByhQjQSjxVM

[129] Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning

Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Zhimeng Huang, Yuhua Li

Main category: cs.CV

TL;DR: CLIP’s template-sample similarity introduces bias, causing models to rely on template proximity rather than true category alignment. The paper proposes using empty prompts to capture unbiased template features and offset this bias through pre-training and fine-tuning stages.

DetailsMotivation: The authors identify that CLIP's few-shot learning performance is compromised by template-sample similarity (TSS) bias, where models prioritize template resemblance over genuine sample-to-category alignment, reducing accuracy and robustness.

Method: A two-stage framework using empty prompts (textual inputs conveying “emptiness” without category info) to capture unbiased template features. During pre-training, empty prompts reduce template-induced bias in the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct image-category alignment.

Result: Experiments across multiple benchmarks show the template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness compared to baseline CLIP.

Conclusion: The proposed framework effectively decouples template bias in CLIP, improving few-shot learning by ensuring models focus on relevant visual cues rather than template proximity, with code available for reproducibility.

Abstract: The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification. We present a framework that uses empty prompts, textual inputs that convey the idea of “emptiness” without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues. Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness. The repository of this project is available at https://github.com/zhenyuZ-HUST/Decoupling-Template-Bias-in-CLIP.

[130] Instance-Aware Test-Time Segmentation for Continual Domain Shifts

Seunghwan Lee, Inyoung Jung, Hojoon Lee, Eunil Park, Sungeun Hong

Main category: cs.CV

TL;DR: Proposes adaptive pseudo-label adjustment for continual test-time adaptation in semantic segmentation, using class- and instance-aware thresholds to handle varying difficulty across classes and instances.

DetailsMotivation: Existing CTTA methods use fixed or batch-level thresholds that cannot account for varying difficulty across classes and instances, which is especially problematic in semantic segmentation requiring dense, multi-class predictions.

Method: Adaptively adjusts pseudo labels to reflect confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts, using fine-grained, class- and instance-aware adaptation.

Result: Extensive experiments across eight CTTA and TTA scenarios (including synthetic-to-real and long-term shifts) show consistent outperformance over state-of-the-art techniques, setting new standard for semantic segmentation under evolving conditions.

Conclusion: The proposed adaptive approach produces more reliable supervision and mitigates error accumulation throughout continual adaptation, advancing semantic segmentation performance in evolving domain scenarios.

Abstract: Continual Test-Time Adaptation (CTTA) enables pre-trained models to adapt to continuously evolving domains. Existing methods have improved robustness but typically rely on fixed or batch-level thresholds, which cannot account for varying difficulty across classes and instances. This limitation is especially problematic in semantic segmentation, where each image requires dense, multi-class predictions. We propose an approach that adaptively adjusts pseudo labels to reflect the confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts. This fine-grained, class- and instance-aware adaptation produces more reliable supervision and mitigates error accumulation throughout continual adaptation. Extensive experiments across eight CTTA and TTA scenarios, including synthetic-to-real and long-term shifts, show that our method consistently outperforms state-of-the-art techniques, setting a new standard for semantic segmentation under evolving conditions.

[131] Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu

Main category: cs.CV

TL;DR: A unified aerial VLN framework using only monocular RGB and language, formulating navigation as next-token prediction with multi-task learning and redundancy reduction techniques.

DetailsMotivation: Existing aerial VLN methods require panoramic images, depth inputs, or odometry, increasing system cost and complexity, which hinders practical deployment on lightweight UAVs. There's a need for simpler, more deployable solutions.

Method: A unified framework using only egocentric monocular RGB and language instructions. Formulates navigation as next-token prediction with prompt-guided multi-task learning. Includes keyframe selection to reduce visual redundancy, and action merging with label reweighting to address supervision imbalance.

Result: Achieves strong results on Aerial VLN benchmark under monocular RGB-only setting, outperforming existing RGB-only baselines and narrowing gap with state-of-the-art panoramic RGB-D methods. Validated through extensive experiments and ablation studies.

Conclusion: The proposed framework demonstrates that effective aerial VLN can be achieved with minimal sensing requirements (monocular RGB only), making it more practical for real-world lightweight UAV deployment while maintaining competitive performance.

Abstract: Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.

[132] From Cells to Survival: Hierarchical Analysis of Cell Inter-Relations in Multiplex Microscopy for Lung Cancer Prognosis

Olle Edgren Schüllerqvist, Jens Baumann, Joakim Lindblad, Love Nordling, Artur Mezheyeuski, Patrick Micke, Nataša Sladoje

Main category: cs.CV

TL;DR: HiGINE is a hierarchical graph-based approach that predicts lung cancer patient survival from multiplex immunofluorescence images by modeling cell neighborhood interactions and fusing with clinical stage data.

DetailsMotivation: The tumor microenvironment contains valuable prognostic biomarkers, but existing methods fail to capture the complex cellular interactions needed for accurate risk stratification in lung cancer.

Method: Hierarchical graph-based model that encodes local and global inter-relations in cell neighborhoods using cell type and morphology information from mIF images, with multimodal fusion incorporating cancer stage data.

Result: Validated on two public datasets, showing improved risk stratification, robustness, and generalizability compared to existing approaches.

Conclusion: HiGINE effectively leverages TME complexity through hierarchical graph modeling and multimodal fusion to enhance lung cancer survival prediction and clinical risk assessment.

Abstract: The tumor microenvironment (TME) has emerged as a promising source of prognostic biomarkers. To fully leverage its potential, analysis methods must capture complex interactions between different cell types. We propose HiGINE – a hierarchical graph-based approach to predict patient survival (short vs. long) from TME characterization in multiplex immunofluorescence (mIF) images and enhance risk stratification in lung cancer. Our model encodes both local and global inter-relations in cell neighborhoods, incorporating information about cell types and morphology. Multimodal fusion, aggregating cancer stage with mIF-derived features, further boosts performance. We validate HiGINE on two public datasets, demonstrating improved risk stratification, robustness, and generalizability.

[133] Automated Pollen Recognition in Optical and Holographic Microscopy Images

Swarn Singh Warshaneyan, Maksims Ivanovs, Blaž Cugmas, Inese Bērziņa, Laura Goldberga, Mindaugas Tamosiunas, Roberts Kadiķis

Main category: cs.CV

TL;DR: Deep learning models (YOLOv8s + MobileNetV3L) applied to pollen grain detection/classification in optical & holographic microscopy, achieving 91.3% mAP50 detection and 97% classification accuracy on optical images, with performance improvements on holographic images through dataset expansion techniques.

DetailsMotivation: To improve and automate pollen grain detection and classification for veterinary cytology applications using both optical and holographic microscopy, addressing the performance gap between these imaging modalities.

Method: Used YOLOv8s for object detection and MobileNetV3L for classification, evaluated across optical and holographic microscopy images. Applied dataset expansion techniques including automated labeling and bounding box area enlargement to improve holographic image performance.

Result: Optical images: 91.3% mAP50 for detection, 97% overall accuracy for classification. Holographic images: initial performance was low (2.49% mAP50 detection, 42% classification), improved to 13.3% mAP50 detection and 54% classification after dataset expansion techniques.

Conclusion: Deep learning techniques can be effectively paired with cost-effective lensless digital holographic microscopy devices for image classification tasks, demonstrating potential for automated pollen analysis in veterinary cytology applications.

Abstract: This study explores the application of deep learning to improve and automate pollen grain detection and classification in both optical and holographic microscopy images, with a particular focus on veterinary cytology use cases. We used YOLOv8s for object detection and MobileNetV3L for the classification task, evaluating their performance across imaging modalities. The models achieved 91.3% mAP50 for detection and 97% overall accuracy for classification on optical images, whereas the initial performance on greyscale holographic images was substantially lower. We addressed the performance gap issue through dataset expansion using automated labeling and bounding box area enlargement. These techniques, applied to holographic images, improved detection performance from 2.49% to 13.3% mAP50 and classification performance from 42% to 54%. Our work demonstrates that, at least for image classification tasks, it is possible to pair deep learning techniques with cost-effective lensless digital holographic microscopy devices.

[134] OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park

Main category: cs.CV

TL;DR: OpenMonoGS-SLAM: First monocular SLAM framework combining 3D Gaussian Splatting with open-set semantic understanding using visual foundation models, requiring no depth sensors or semantic ground truth.

DetailsMotivation: Current SLAM systems with semantic understanding often rely on depth sensors or closed-set semantic models, limiting scalability and adaptability in open-world environments. There's a need for monocular SLAM that can handle open-set semantics without specialized hardware.

Method: Uses Visual Foundation Models (MASt3R for geometry, SAM and CLIP for semantics) to create a monocular SLAM framework with 3D Gaussian Splatting. Includes a memory mechanism for managing high-dimensional semantic features to construct Gaussian semantic feature maps. Operates via self-supervised learning without depth input or 3D semantic ground truth.

Result: Achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, without relying on depth maps or semantic annotations.

Conclusion: OpenMonoGS-SLAM successfully demonstrates that monocular SLAM with open-set semantic understanding is feasible using visual foundation models and 3D Gaussian Splatting, enabling intelligent perception in open-world environments without specialized sensors.

Abstract: Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.

[135] Mitigating Individual Skin Tone Bias in Skin Lesion Classification through Distribution-Aware Reweighting

Kuniko Paxton, Zeinab Dehghani, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos

Main category: cs.CV

TL;DR: This paper proposes a distribution-based framework for individual fairness in skin lesion classification, treating skin tone as continuous rather than categorical, using KDE for distribution modeling and statistical distance metrics to quantify disparities, with a distance-based reweighting loss to correct underrepresentation.

DetailsMotivation: Current fairness research in medical imaging uses coarse subgroup categories that overlook individual-level variations and risk obscuring biases faced by outliers within subgroups, especially for sensitive attributes like skin color.

Method: Treat skin tone as continuous attribute, use kernel density estimation (KDE) to model its distribution, compare 12 statistical distance metrics to quantify disparities, and propose distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones.

Result: Experiments show limitations of categorical reweighting for capturing individual-level disparities, and superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS).

Conclusion: Establishes robust methodology for advancing individual-level fairness in dermatological AI systems, with broader implications for handling sensitive continuous attributes in medical image analysis.

Abstract: Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.

[136] Trajectory Densification and Depth from Perspective-based Blur

Tianchen Qiu, Qirun Zhang, Jiajian He, Zhengyue Zhuge, Jiahui Xu, Yueting Chen

Main category: cs.CV

TL;DR: A method that estimates metric depth by analyzing perspective-based blur patterns in videos, combining optical design algorithms with vision encoders and point trackers for depth mapping and trajectory densification.

DetailsMotivation: Camera rotational dynamics during handheld shooting cause perspective-based blur that varies with object depth, creating depth-dependent blur patterns that can be leveraged for depth estimation.

Method: Uses off-the-shelf vision encoder and point tracker to extract video information, estimates depth via windowed embedding and multi-window aggregation, and densifies sparse optical flow trajectories using a vision-language model.

Result: Achieves strong performance across multiple depth datasets with good generalization, demonstrates superior precision in trajectory estimation compared to real handheld shooting data, and maintains accurate dense reconstruction.

Conclusion: The proposed method effectively leverages perspective-based blur patterns for metric depth estimation, showing promising results for depth recovery from handheld video streams with rotational dynamics.

Abstract: In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.

[137] Chain-of-Image Generation: Toward Monitorable and Controllable Image Generation

Young Kyung Kim, Oded Schlesinger, Yuzhou Zhao, J. Matias Di Martino, Guillermo Sapiro

Main category: cs.CV

TL;DR: CoIG framework makes image generation transparent by decomposing complex prompts into step-by-step instructions using LLMs, enabling monitoring and control like Chain-of-Thought does for language models.

DetailsMotivation: Current image generation models are opaque "black boxes" that limit human observation, intervention, and control, making it difficult to ensure reliability and safety. Their non-human-like workflows also hinder interpretability.

Method: CoIG uses an LLM to decompose complex prompts into sequential, simple instructions. The image generation model then executes this plan by progressively generating and editing images, with each step focusing on a single semantic entity for direct monitoring.

Result: CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to baseline models. It also mitigates entity collapse by decomposing complex tasks into simple subproblems.

Conclusion: CoIG provides a transparent, human-interpretable framework for image generation that enables monitoring and control while maintaining performance, analogous to how Chain-of-Thought improved LLMs.

Abstract: While state-of-the-art image generation models achieve remarkable visual quality, their internal generative processes remain a “black box.” This opacity limits human observation and intervention, and poses a barrier to ensuring model reliability, safety, and control. Furthermore, their non-human-like workflows make them difficult for human observers to interpret. To address this, we introduce the Chain-of-Image Generation (CoIG) framework, which reframes image generation as a sequential, semantic process analogous to how humans create art. Similar to the advantages in monitorability and performance that Chain-of-Thought (CoT) brought to large language models (LLMs), CoIG can produce equivalent benefits in text-to-image generation. CoIG utilizes an LLM to decompose a complex prompt into a sequence of simple, step-by-step instructions. The image generation model then executes this plan by progressively generating and editing the image. Each step focuses on a single semantic entity, enabling direct monitoring. We formally assess this property using two novel metrics: CoIG Readability, which evaluates the clarity of each intermediate step via its corresponding output; and Causal Relevance, which quantifies the impact of each procedural step on the final generated image. We further show that our framework mitigates entity collapse by decomposing the complex generation task into simple subproblems, analogous to the procedural reasoning employed by CoT. Our experimental results indicate that CoIG substantially enhances quantitative monitorability while achieving competitive compositional robustness compared to established baseline models. The framework is model-agnostic and can be integrated with any image generation model.

[138] Refining Visual Artifacts in Diffusion Models via Explainable AI-based Flaw Activation Maps

Seoyeon Lee, Gwangyeol Yu, Chaewon Kim, Jonghyuk Park

Main category: cs.CV

TL;DR: Self-refining diffusion framework uses XAI-based flaw detection to improve image generation quality by identifying artifacts and unrealistic regions, achieving up to 27.3% FID improvement across various diffusion models and tasks.

DetailsMotivation: While diffusion models have achieved remarkable success in image synthesis, addressing artifacts and unrealistic regions remains a critical challenge that needs to be solved to improve generation quality.

Method: Proposes a self-refining diffusion framework with an XAI-based flaw highlighter that produces flaw activation maps (FAMs). These maps identify artifacts and unrealistic regions, then improve reconstruction by amplifying noise in flawed regions during forward process and focusing on these regions during reverse process.

Result: Achieves up to 27.3% improvement in Fréchet inception distance across various diffusion-based models, shows consistently strong performance on diverse datasets, and demonstrates robust effectiveness across different tasks including image generation, text-to-image generation, and inpainting.

Conclusion: Explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing image synthesis.

Abstract: Diffusion models have achieved remarkable success in image synthesis. However, addressing artifacts and unrealistic regions remains a critical challenge. We propose self-refining diffusion, a novel framework that enhances image generation quality by detecting these flaws. The framework employs an explainable artificial intelligence (XAI)-based flaw highlighter to produce flaw activation maps (FAMs) that identify artifacts and unrealistic regions. These FAMs improve reconstruction quality by amplifying noise in flawed regions during the forward process and by focusing on these regions during the reverse process. The proposed approach achieves up to a 27.3% improvement in Fréchet inception distance across various diffusion-based models, demonstrating consistently strong performance on diverse datasets. It also shows robust effectiveness across different tasks, including image generation, text-to-image generation, and inpainting. These results demonstrate that explainable AI techniques can extend beyond interpretability to actively contribute to image refinement. The proposed framework offers a versatile and effective approach applicable to various diffusion models and tasks, significantly advancing the field of image synthesis.

[139] C-DIRA: Computationally Efficient Dynamic ROI Routing and Domain-Invariant Adversarial Learning for Lightweight Driver Behavior Recognition

Keito Inoshita

Main category: cs.CV

TL;DR: C-DIRA: A lightweight driver behavior recognition framework using dynamic ROI routing and adversarial learning for efficient, accurate, and domain-invariant performance on edge devices.

DetailsMotivation: Current lightweight models for driver distraction recognition fail to capture fine-grained behavioral cues and suffer from reduced performance on unseen drivers/varying conditions. ROI-based methods increase computational costs, creating a trade-off between efficiency and accuracy that needs to be addressed.

Method: Proposes C-DIRA framework combining: 1) Saliency-driven Top-K ROI pooling and fused classification for local feature extraction, 2) Dynamic ROI routing that applies ROI inference only to high-difficulty samples for selective computation, and 3) Pseudo-domain labeling with adversarial learning to learn domain-invariant features robust to driver/background variations.

Result: On State Farm Distracted Driver Detection Dataset, C-DIRA maintains high accuracy with significantly fewer FLOPs and lower latency than prior lightweight models. Shows robustness under visual degradation (blur, low-light) and stable performance across unseen domains.

Conclusion: C-DIRA effectively achieves compactness, efficiency, and generalization for real-time driver behavior recognition on edge devices, balancing computational constraints with accurate, domain-invariant performance.

Abstract: Driver distraction behavior recognition using in-vehicle cameras demands real-time inference on edge devices. However, lightweight models often fail to capture fine-grained behavioral cues, resulting in reduced performance on unseen drivers or under varying conditions. ROI-based methods also increase computational cost, making it difficult to balance efficiency and accuracy. This work addresses the need for a lightweight architecture that overcomes these constraints. We propose Computationally efficient Dynamic region of Interest Routing and domain-invariant Adversarial learning for lightweight driver behavior recognition (C-DIRA). The framework combines saliency-driven Top-K ROI pooling and fused classification for local feature extraction and integration. Dynamic ROI routing enables selective computation by applying ROI inference only to high difficulty data samples. Moreover, pseudo-domain labeling and adversarial learning are used to learn domain-invariant features robust to driver and background variation. Experiments on the State Farm Distracted Driver Detection Dataset show that C-DIRA maintains high accuracy with significantly fewer FLOPs and lower latency than prior lightweight models. It also demonstrates robustness under visual degradation such as blur and low-light, and stable performance across unseen domains. These results confirm C-DIRA’s effectiveness in achieving compactness, efficiency, and generalization.

[140] MatteViT: High-Frequency-Aware Document Shadow Removal with Shadow Matte Guidance

Chaewon Kim, Seoyeon Lee, Jonghyuk Park

Main category: cs.CV

TL;DR: MatteViT is a document shadow removal framework that uses spatial and frequency-domain information with high-frequency amplification and continuous luminance-based shadow matte to preserve fine details while removing shadows.

DetailsMotivation: Document shadow removal is essential for digitized document clarity, and preserving high-frequency details (text edges, lines) is critical because shadows often obscure or distort these fine structures.

Method: Proposes MatteViT with two key strategies: 1) Lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components, and 2) Continuous luminance-based shadow matte generated using custom dataset and shadow matte generator for precise spatial guidance from early stages.

Result: Achieves state-of-the-art performance on public benchmarks (RDD and Kligler), provides robust practical solution for real-world document shadow removal, and better preserves text-level details in downstream tasks like OCR, improving recognition performance over prior methods.

Conclusion: MatteViT offers an effective framework for document shadow removal that successfully preserves fine-grained structural details while eliminating shadows, demonstrating superior performance and practical utility for document digitization tasks.

Abstract: Document shadow removal is essential for enhancing the clarity of digitized documents. Preserving high-frequency details (e.g., text edges and lines) is critical in this process because shadows often obscure or distort fine structures. This paper proposes a matte vision transformer (MatteViT), a novel shadow removal framework that applies spatial and frequency-domain information to eliminate shadows while preserving fine-grained structural details. To effectively retain these details, we employ two preservation strategies. First, our method introduces a lightweight high-frequency amplification module (HFAM) that decomposes and adaptively amplifies high-frequency components. Second, we present a continuous luminance-based shadow matte, generated using a custom-built matte dataset and shadow matte generator, which provides precise spatial guidance from the earliest processing stage. These strategies enable the model to accurately identify fine-grained regions and restore them with high fidelity. Extensive experiments on public benchmarks (RDD and Kligler) demonstrate that MatteViT achieves state-of-the-art performance, providing a robust and practical solution for real-world document shadow removal. Furthermore, the proposed method better preserves text-level details in downstream tasks, such as optical character recognition, improving recognition performance over prior methods.

[141] Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank

Shaofeng Zhang, Xuanqi Chen, Ning Liao, Haoxiang Zhao, Xiaoxing Wang, Haoru Tan, Sitong Wu, Xiaosong Jia, Qi Fan, Junchi Yan

Main category: cs.CV

TL;DR: A plug-and-play training framework called MNAME that eliminates dependency on external encoders for generative models by using a memory bank mechanism to provide abundant negative samples for contrastive learning, achieving state-of-the-art FID of 2.40 on ImageNet-256.

DetailsMotivation: Denoising generative models (diffusion, flow-matching) have high training costs and inefficiencies in representation learning. Existing approaches that inject discriminative representations via auxiliary alignment rely on external pre-trained encoders, which introduces overhead and domain shift problems.

Method: Proposes MNAME framework with memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations, decoupling negative sample count from mini-batch size. Uses low-dimensional projection head to minimize memory/bandwidth overhead. No external encoders needed.

Result: Achieves state-of-the-art FID of 2.40 on ImageNet-256 within 400k steps, significantly outperforming comparable methods. Enables substantially faster convergence and superior generative quality more efficiently.

Conclusion: MNAME offers three key advantages: (1) self-contained with no dependency on pretrained vision foundation models, (2) no additional parameters or computational cost during inference, and (3) enables faster convergence and better generative quality efficiently.

Abstract: The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\mname} achieves a state-of-the-art FID of \textbf{2.40} within 400k steps, significantly outperforming comparable methods.

[142] Dual-Branch Center-Surrounding Contrast: Rethinking Contrastive Learning for 3D Point Clouds

Shaofeng Zhang, Xuanqi Chen, Xiangdong Zhang, Sitong Wu, Junchi Yan

Main category: cs.CV

TL;DR: A novel Dual-Branch Center-Surrounding Contrast (CSCon) framework for 3D point cloud self-supervised learning that combines center and surrounding masking with patch-level contrastive loss to capture both high-level discriminative features and local geometric details.

DetailsMotivation: Existing 3D SSL methods are dominated by generative MAE approaches that struggle with high-level discriminative features, while contrastive methods (effective in 2D) are scarce in 3D. Direct application of 2D contrastive methods fails to capture 3D local details effectively.

Method: Proposes CSCon framework with dual-branch inputs: center-biased and surrounding-biased representations created by separately masking center and surrounding parts. Uses patch-level contrastive loss to enhance both high-level information and local sensitivity.

Result: Achieves comparable performance to generative methods under FULL/ALL protocols; attains SOTA under MLP-LINEAR, MLP-3, and ONLY-NEW protocols. Specifically, outperforms Point-MAE baseline by 7.9%, 6.7%, and 10.3% on three ScanObjectNN variants under MLP-LINEAR protocol.

Conclusion: CSCon effectively addresses limitations of existing 3D SSL methods by combining the strengths of contrastive learning with 3D-specific architectural innovations, achieving superior discriminative feature learning and generalization on downstream tasks.

Abstract: Most existing self-supervised learning (SSL) approaches for 3D point clouds are dominated by generative methods based on Masked Autoencoders (MAE). However, these generative methods have been proven to struggle to capture high-level discriminative features effectively, leading to poor performance on linear probing and other downstream tasks. In contrast, contrastive methods excel in discriminative feature representation and generalization ability on image data. Despite this, contrastive learning (CL) in 3D data remains scarce. Besides, simply applying CL methods designed for 2D data to 3D fails to effectively learn 3D local details. To address these challenges, we propose a novel Dual-Branch \textbf{C}enter-\textbf{S}urrounding \textbf{Con}trast (CSCon) framework. Specifically, we apply masking to the center and surrounding parts separately, constructing dual-branch inputs with center-biased and surrounding-biased representations to better capture rich geometric information. Meanwhile, we introduce a patch-level contrastive loss to further enhance both high-level information and local sensitivity. Under the FULL and ALL protocols, CSCon achieves performance comparable to generative methods; under the MLP-LINEAR, MLP-3, and ONLY-NEW protocols, our method attains state-of-the-art results, even surpassing cross-modal approaches. In particular, under the MLP-LINEAR protocol, our method outperforms the baseline (Point-MAE) by \textbf{7.9%}, \textbf{6.7%}, and \textbf{10.3%} on the three variants of ScanObjectNN, respectively. The code will be made publicly available.

[143] What really matters for person re-identification? A Mixture-of-Experts Framework for Semantic Attribute Importance

Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos

Main category: cs.CV

TL;DR: MoSAIC-ReID is a Mixture-of-Experts framework that quantifies the importance of pedestrian attributes for person re-identification, providing interpretability while maintaining competitive performance.

DetailsMotivation: Current person re-identification methods achieve high accuracy but lack transparency about which semantic attributes they rely on, creating a need for interpretable frameworks that reveal attribute importance.

Method: Uses a Mixture-of-Experts framework with LoRA-based experts (each linked to a single attribute) and an oracle router for controlled attribution analysis, enabling systematic quantification of attribute importance.

Result: Achieves competitive performance on Market-1501 and DukeMTMC datasets when attribute annotations are available at test time, and provides quantitative analysis showing clothing colors and intrinsic characteristics contribute most while infrequent cues like accessories have limited effect.

Conclusion: Offers a principled framework for interpretable ReID and highlights requirements for integrating explicit semantic knowledge in practice, advancing transparency in person re-identification systems.

Abstract: State-of-the-art person re-identification methods achieve impressive accuracy but remain largely opaque, leaving open the question: which high-level semantic attributes do these models actually rely on? We propose MoSAIC-ReID, a Mixture-of-Experts framework that systematically quantifies the importance of pedestrian attributes for re-identification. Our approach uses LoRA-based experts, each linked to a single attribute, and an oracle router that enables controlled attribution analysis. While MoSAIC-ReID achieves competitive performance on Market-1501 and DukeMTMC under the assumption that attribute annotations are available at test time, its primary value lies in providing a large-scale, quantitative study of attribute importance across intrinsic and extrinsic cues. Using generalized linear models, statistical tests, and feature-importance analyses, we reveal which attributes, such as clothing colors and intrinsic characteristics, contribute most strongly, while infrequent cues (e.g. accessories) have limited effect. This work offers a principled framework for interpretable ReID and highlights the requirements for integrating explicit semantic knowledge in practice. Code is available at https://github.com/psaltaath/MoSAIC-ReID

[144] Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Kyumin Hwang, Wonhyeok Choi, Kiljoon Han, Wonjoon Choi, Minwoo Choi, Yongcheon Na, Minwoo Park, Sunghoon Im

Main category: cs.CV

TL;DR: A knowledge distillation method for full surround monocular depth estimation that transfers depth knowledge from foundation models to lightweight networks while addressing computational cost and metric-scale depth challenges.

DetailsMotivation: Foundation models have strong generalization for monocular depth estimation but face two key limitations when applied to full surround settings: high computational cost preventing real-time performance, and inability to estimate metric-scale depth (only relative depth).

Method: Proposes a novel knowledge distillation strategy with: 1) Cross-interaction knowledge distillation that transfers scale-invariant depth bin probabilities from foundation model while guiding student network to infer metric-scale depth bin centers from ground truth, and 2) View-relational knowledge distillation that encodes structural relationships among adjacent camera views to enhance cross-view depth consistency.

Result: Experiments on DDAD and nuScenes datasets show effectiveness compared to conventional supervised methods and existing knowledge distillation approaches. The method achieves favorable trade-off between performance and efficiency, meeting real-time requirements.

Conclusion: The proposed knowledge distillation framework successfully addresses computational cost and metric-scale depth challenges in full surround monocular depth estimation, enabling real-time performance while maintaining accuracy.

Abstract: Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme–traditionally used in classification–with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

[145] Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning

Yi Zhang, Chun-Wun Cheng, Junyi He, Ke Yu, Yushun Tang, Carola-Bibiane Schönlieb, Zhihai He, Angelica I. Aviles-Rivero

Main category: cs.CV

TL;DR: T-DHA is a training-free adaptation method for vision-language models that uses hyperbolic space embeddings to better represent hierarchical semantic relationships, achieving superior few-shot recognition and domain generalization without fine-tuning.

DetailsMotivation: Existing VLMs suffer from performance degradation with domain changes or require heavy computational resources for fine-tuning in new domains. The authors aim to develop a more efficient adaptation method that maintains performance across domains without extensive retraining.

Method: Proposes Training-free Dual Hyperbolic Adapters (T-DHA) that characterizes vision-language relationships in hyperbolic space instead of Euclidean space. Uses Poincaré ball model to embed hierarchical semantic structures, leveraging hyperbolic space’s exponential volume growth for better representation. Combines with negative learning for more accurate and robust classification with fewer feature dimensions.

Result: Extensive experiments on various datasets show T-DHA significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.

Conclusion: Hyperbolic space embeddings provide superior representation for hierarchical semantic relationships in vision-language models, enabling training-free adaptation that achieves better performance with fewer computational resources across diverse domains.

Abstract: Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.

[146] SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images

Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng, Xiangyong Cao

Main category: cs.CV

TL;DR: Simple training-free adaptation of SAM 3 for remote sensing open-vocabulary semantic segmentation using mask fusion and presence score filtering.

DetailsMotivation: Existing CLIP-based methods for OVSS struggle with precise localization and require complex pipelines, especially in remote sensing with dense small targets. SAM 3 offers unified segmentation and recognition capabilities that could address these challenges.

Method: Two key adaptations: 1) Mask fusion strategy combining SAM 3’s semantic segmentation head and Transformer decoder outputs for better land coverage, 2) Using presence score from presence head to filter out non-existent categories and reduce false positives from large vocabularies.

Result: The simple adaptation achieves promising performance on extensive remote sensing datasets, demonstrating SAM 3’s potential for remote sensing OVSS without any training.

Conclusion: SAM 3 shows strong potential for training-free open-vocabulary semantic segmentation in remote sensing applications, with simple adaptations addressing domain-specific challenges like dense small targets and large vocabulary sizes.

Abstract: Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3’s semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.

[147] InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: InfiniteVL combines sliding window attention with Gated DeltaNet to create a linear-complexity VLM that overcomes limitations of window attention and linear attention, achieving competitive performance with only 2% training data while maintaining constant latency and memory footprint.

DetailsMotivation: Window attention suffers performance degradation when sequence length exceeds window size, while linear attention underperforms on information-intensive tasks like OCR and document understanding. Need a solution that overcomes both limitations while maintaining efficiency.

Method: Proposes InfiniteVL architecture that synergizes sliding window attention (SWA) with Gated DeltaNet for linear complexity. Uses three-stage training: distillation pretraining, instruction tuning, and long-sequence SFT. Achieves competitive performance with minimal training data.

Result: Outperforms previous linear-complexity VLMs and matches leading Transformer-based VLMs using <2% training data. Achieves 3.6× inference speedup over FlashAttention-2 accelerated Transformers with constant latency/memory. Maintains 24 FPS real-time prefill in streaming video with long-term memory retention.

Conclusion: InfiniteVL successfully addresses limitations of both window and linear attention approaches, providing an efficient linear-complexity VLM that maintains competitive performance while enabling practical deployment with constant resource requirements and effective long-term memory.

Abstract: Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.

[148] A Scalable Pipeline Combining Procedural 3D Graphics and Guided Diffusion for Photorealistic Synthetic Training Data Generation in White Button Mushroom Segmentation

Artúr I. Károly, Péter Galambos

Main category: cs.CV

TL;DR: Novel workflow combining 3D rendering in Blender with constrained diffusion models to generate photorealistic synthetic mushroom images for computer vision training, achieving state-of-the-art segmentation performance with only synthetic data.

DetailsMotivation: Industrial mushroom cultivation needs computer vision for monitoring/harvesting, but creating large annotated real datasets is expensive. Synthetic data alternatives often lack realism for real-world generalization.

Method: Integrates 3D rendering in Blender with constrained diffusion model to automatically generate high-quality annotated synthetic images of Agaricus Bisporus mushrooms. Preserves full control over 3D scene configuration and annotations while achieving photorealism without specialized graphics expertise.

Result: Released two synthetic datasets (6,000 images each, over 250k mushroom instances). Mask R-CNN models trained on synthetic data achieved state-of-the-art segmentation performance in zero-shot testing on two real-world datasets (F1 = 0.859 on M18K benchmark).

Conclusion: The pipeline successfully generates photorealistic synthetic training data that generalizes well to real-world scenarios. Although demonstrated on Agaricus Bisporus mushrooms, the method can be adapted to other mushroom species or agricultural domains like fruit and leaf detection.

Abstract: Industrial mushroom cultivation increasingly relies on computer vision for monitoring and automated harvesting. However, developing accurate detection and segmentation models requires large, precisely annotated datasets that are costly to produce. Synthetic data provides a scalable alternative, yet often lacks sufficient realism to generalize to real-world scenarios. This paper presents a novel workflow that integrates 3D rendering in Blender with a constrained diffusion model to automatically generate high-quality annotated, photorealistic synthetic images of Agaricus Bisporus mushrooms. This approach preserves full control over 3D scene configuration and annotations while achieving photorealism without the need for specialized computer graphics expertise. We release two synthetic datasets (each containing 6,000 images depicting over 250k mushroom instances) and evaluate Mask R-CNN models trained on them in a zero-shot setting. When tested on two independent real-world datasets (including a newly collected benchmark), our method achieves state-of-the-art segmentation performance (F1 = 0.859 on M18K), despite using only synthetic training data. Although the approach is demonstrated on Agaricus Bisporus mushrooms, the proposed pipeline can be readily adapted to other mushroom species or to other agricultural domains, such as fruit and leaf detection.

[149] Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning

Jing Jie Tan, Anissa Mokraoui, Ban-Hoe Kwan, Danny Wee-Kiat Ng, Yan-Chai Hum

Main category: cs.CV

TL;DR: SOLI is a lightweight Siamese network approach for low-resolution image captioning that optimizes latent embeddings to reduce computational overhead while maintaining performance.

DetailsMotivation: Traditional image captioning models struggle with low-resolution images and are computationally expensive. There's a need for lightweight solutions that can handle low-resolution images efficiently on resource-constrained devices.

Method: SOLI uses a Siamese network architecture with dual-pathway neural network structure to optimize latent embeddings for low-resolution images, minimizing computational overhead while enhancing image-to-text translation efficiency.

Result: The approach reduces computational requirements and memory usage while maintaining captioning performance, making it suitable for resource-constrained scenarios.

Conclusion: SOLI provides an effective lightweight solution for low-resolution image captioning that balances computational efficiency with performance, addressing practical deployment challenges.

Abstract: Image captioning is essential in many fields including assisting visually impaired individuals, improving content management systems, and enhancing human-computer interaction. However, a recent challenge in this domain is dealing with low-resolution image (LRI). While performance can be improved by using larger models like transformers for encoding, these models are typically heavyweight, demanding significant computational resources and memory, leading to challenges in retraining. To address this, the proposed SOLI (Siamese-Driven Optimization for Low-Resolution Image Latent Embedding in Image Captioning) approach presents a solution specifically designed for lightweight, low-resolution images captioning. It employs a Siamese network architecture to optimize latent embeddings, enhancing the efficiency and accuracy of the image-to-text translation process. By focusing on a dual-pathway neural network structure, SOLI minimizes computational overhead without sacrificing performance, making it an ideal choice for training on resource-constrained scenarios.

[150] Skewness-Guided Pruning of Multimodal Swin Transformers for Federated Skin Lesion Classification on Edge Devices

Kuniko Paxton, Koorosh Aslansefat, Dhavalkumar Thakker, Yiannis Papadopoulos

Main category: cs.CV

TL;DR: A skewness-guided pruning method for multimodal Swin Transformers achieves 36% model size reduction with no accuracy loss in federated learning for medical imaging.

DetailsMotivation: High-performance medical vision models are too large for edge devices, and privacy constraints prevent centralized data management, necessitating federated learning and model compression solutions.

Method: Proposes skewness-guided pruning that selectively prunes Multi-Head Self-Attention and Multi-Layer Perceptron layers based on statistical skewness of their output distributions, validated in horizontal federated learning environment.

Result: Achieves approximately 36% model size reduction with no loss in accuracy on compact Swin Transformer, maintaining performance while substantially reducing model complexity.

Conclusion: Demonstrates feasibility of efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices through federated learning with skewness-guided pruning.

Abstract: In recent years, high-performance computer vision models have achieved remarkable success in medical imaging, with some skin lesion classification systems even surpassing dermatology specialists in diagnostic accuracy. However, such models are computationally intensive and large in size, making them unsuitable for deployment on edge devices. In addition, strict privacy constraints hinder centralized data management, motivating the adoption of Federated Learning (FL). To address these challenges, this study proposes a skewness-guided pruning method that selectively prunes the Multi-Head Self-Attention and Multi-Layer Perceptron layers of a multimodal Swin Transformer based on the statistical skewness of their output distributions. The proposed method was validated in a horizontal FL environment and shown to maintain performance while substantially reducing model complexity. Experiments on the compact Swin Transformer demonstrate approximately 36% model size reduction with no loss in accuracy. These findings highlight the feasibility of achieving efficient model compression and privacy-preserving distributed learning for multimodal medical AI on edge devices.

[151] No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

Damiano Marsili, Georgia Gkioxari

Main category: cs.CV

TL;DR: VALOR is an annotation-free training framework that improves visual reasoning by combining LLM reasoning refinement with VLM visual grounding enhancement, outperforming existing methods without requiring ground truth labels.

DetailsMotivation: Visual reasoning requires both precise object grounding and understanding complex spatial relationships. Existing methods have limitations: language-only chain-of-thought approaches need large-scale supervised data, while program-synthesis approaches suffer from flawed logic and erroneous grounding.

Method: Proposes an annotation-free training framework using AI-powered verifiers: an LLM verifier refines reasoning via reinforcement learning, and a VLM verifier strengthens visual grounding through automated hard-negative mining. Combines language-only reasoning models for decomposing spatial queries with vision specialist models improved via VLM critics.

Result: The method improves visual reasoning and surpasses both open-source and proprietary models across diverse spatial reasoning tasks. With the improved visual grounding model, it further outperforms recent text-only visual reasoning methods.

Conclusion: VALOR demonstrates that annotation-free training with AI-powered verifiers can effectively improve both reasoning and grounding in visual reasoning tasks, combining the strengths of modern language and vision models without requiring ground truth supervision.

Abstract: Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong vision specialist models improved via performant VLM critics. We evaluate our approach across diverse spatial reasoning tasks, and show that our method improves visual reasoning and surpasses open-source and proprietary models, while with our improved visual grounding model we further outperform recent text-only visual reasoning methods. Project webpage: https://glab-caltech.github.io/valor/

[152] Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Ruihang Chu, Yefei He, Zhekai Chen, Shiwei Zhang, Xiaogang Xu, Bin Xia, Dingdong Wang, Hongwei Yi, Xihui Liu, Hengshuang Zhao, Yu Liu, Yingya Zhang, Yujiu Yang

Main category: cs.CV

TL;DR: Wan-Move is a scalable framework for precise motion control in video generation that uses dense point trajectories to guide existing image-to-video models without architectural changes.

DetailsMotivation: Existing motion-controllable video generation methods suffer from coarse control granularity and limited scalability, making their outputs insufficient for practical applications.

Method: Represents object motions with dense point trajectories, projects them into latent space, propagates first frame features along trajectories to create aligned spatiotemporal feature maps, then integrates these as motion guidance into off-the-shelf image-to-video models without architecture changes.

Result: Generates 5-second, 480p videos with motion controllability rivaling commercial systems like Kling 1.5 Pro’s Motion Brush, as validated by user studies and extensive experiments on MoveBench benchmark.

Conclusion: Wan-Move achieves precise, high-quality motion control through a simple yet effective framework that eliminates need for auxiliary motion encoders, enables easy scaling of base models, and sets new standards for motion-controllable video generation.

Abstract: We present Wan-Move, a simple and scalable framework that brings motion control to video generative models. Existing motion-controllable methods typically suffer from coarse control granularity and limited scalability, leaving their outputs insufficient for practical use. We narrow this gap by achieving precise and high-quality motion control. Our core idea is to directly make the original condition features motion-aware for guiding video synthesis. To this end, we first represent object motions with dense point trajectories, allowing fine-grained control over the scene. We then project these trajectories into latent space and propagate the first frame’s features along each trajectory, producing an aligned spatiotemporal feature map that tells how each scene element should move. This feature map serves as the updated latent condition, which is naturally integrated into the off-the-shelf image-to-video model, e.g., Wan-I2V-14B, as motion guidance without any architecture change. It removes the need for auxiliary motion encoders and makes fine-tuning base models easily scalable. Through scaled training, Wan-Move generates 5-second, 480p videos whose motion controllability rivals Kling 1.5 Pro’s commercial Motion Brush, as indicated by user studies. To support comprehensive evaluation, we further design MoveBench, a rigorously curated benchmark featuring diverse content categories and hybrid-verified annotations. It is distinguished by larger data volume, longer video durations, and high-quality motion annotations. Extensive experiments on MoveBench and the public dataset consistently show Wan-Move’s superior motion quality. Code, models, and benchmark data are made publicly available.

[153] LoFA: Learning to Predict Personalized Priors for Fast Adaptation of Visual Generative Models

Yiming Hao, Mutian Xu, Chongjie Ye, Jie Qin, Shunlin Lu, Yipeng Qin, Xiaoguang Han

Main category: cs.CV

TL;DR: LoFA is a hypernetwork framework that predicts personalized LoRA weights in seconds by learning structured distribution patterns in parameter changes, enabling fast visual generative model adaptation without lengthy optimization.

DetailsMotivation: Current personalization methods like LoRA require task-specific data and hours of optimization, making them impractical. Hypernetwork approaches struggle to map fine-grained user prompts to complex LoRA weight distributions.

Method: Two-stage hypernetwork: first predicts relative distribution patterns in parameter changes between LoRA and base models, then uses these patterns to guide final LoRA weight prediction, leveraging discovered structured distribution properties.

Result: LoFA predicts high-quality personalized priors within seconds across multiple tasks and user prompts, outperforming conventional LoRA that requires hours of processing while maintaining adaptation quality.

Conclusion: LoFA bridges the gap between slow optimization-based adaptation and limited hypernetwork approaches by efficiently predicting personalized priors through structured distribution pattern learning, enabling practical real-time visual generative model personalization.

Abstract: Personalizing visual generative models to meet specific user needs has gained increasing attention, yet current methods like Low-Rank Adaptation (LoRA) remain impractical due to their demand for task-specific data and lengthy optimization. While a few hypernetwork-based approaches attempt to predict adaptation weights directly, they struggle to map fine-grained user prompts to complex LoRA distributions, limiting their practical applicability. To bridge this gap, we propose LoFA, a general framework that efficiently predicts personalized priors for fast model adaptation. We first identify a key property of LoRA: structured distribution patterns emerge in the relative changes between LoRA and base model parameters. Building on this, we design a two-stage hypernetwork: first predicting relative distribution patterns that capture key adaptation regions, then using these to guide final LoRA weight prediction. Extensive experiments demonstrate that our method consistently predicts high-quality personalized priors within seconds, across multiple tasks and user prompts, even outperforming conventional LoRA that requires hours of processing. Project page: https://jaeger416.github.io/lofa/.

[154] Astra: General Interactive World Model with Autoregressive Denoising

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: Astra is an interactive general world model that generates real-world futures for diverse scenarios with precise action interactions, using autoregressive denoising, temporal causal attention, noise-augmented memory, and mixture of action experts.

DetailsMotivation: While diffusion transformers have advanced video generation, world models capable of predicting long-horizon futures from past observations and actions remain underexplored for general-purpose scenarios and various action forms.

Method: Autoregressive denoising architecture with temporal causal attention to aggregate past observations; noise-augmented history memory to balance responsiveness with coherence; action-aware adapter for precise action control; mixture of action experts to route heterogeneous action modalities.

Result: Astra achieves interactive, consistent, and general long-term video prediction supporting various interactions. Experiments across multiple datasets show improvements in fidelity, long-range prediction, and action alignment over state-of-the-art world models.

Conclusion: Astra bridges the gap in interactive general world models, enabling precise action-controlled future prediction across diverse real-world scenarios like autonomous driving and robot manipulation.

Abstract: Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.

[155] Generation is Required for Data-Efficient Perception

Jack Brady, Bernhard Schölkopf, Thomas Kipf, Simon Buchholz, Wieland Brendel

Main category: cs.CV

TL;DR: Generative methods enable compositional generalization via decoder constraints and inversion, while non-generative methods struggle without large-scale pretraining or added supervision.

DetailsMotivation: To investigate whether generative approaches (with decoder inversion) are necessary for human-level visual perception, specifically examining compositional generalization as a key capability.

Method: Theoretical analysis of inductive biases required for compositional generalization in both generative (decoder-based) and non-generative (encoder-based) methods, followed by empirical evaluation on photorealistic image datasets.

Result: Generative methods achieve compositional generalization by enforcing inductive biases on decoders and using inversion (gradient search or generative replay), while non-generative methods often fail without large-scale pretraining or added supervision.

Conclusion: Generative approaches are more effective for compositional generalization, supporting the hypothesis that decoder inversion is important for human-like visual perception in machines.

Abstract: It has been hypothesized that human-level visual perception requires a generative approach in which internal representations result from inverting a decoder. Yet today’s most successful vision models are non-generative, relying on an encoder that maps images to representations without decoder inversion. This raises the question of whether generation is, in fact, necessary for machines to achieve human-level visual perception. To address this, we study whether generative and non-generative methods can achieve compositional generalization, a hallmark of human perception. Under a compositional data generating process, we formalize the inductive biases required to guarantee compositional generalization in decoder-based (generative) and encoder-based (non-generative) methods. We then show theoretically that enforcing these inductive biases on encoders is generally infeasible using regularization or architectural constraints. In contrast, for generative methods, the inductive biases can be enforced straightforwardly, thereby enabling compositional generalization by constraining a decoder and inverting it. We highlight how this inversion can be performed efficiently, either online through gradient-based search or offline through generative replay. We examine the empirical implications of our theory by training a range of generative and non-generative methods on photorealistic image datasets. We find that, without the necessary inductive biases, non-generative methods often fail to generalize compositionally and require large-scale pretraining or added supervision to improve generalization. By comparison, generative methods yield significant improvements in compositional generalization, without requiring additional data, by leveraging suitable inductive biases on a decoder along with search and replay.

[156] Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference

Amit Bendkhale

Main category: cs.CV

TL;DR: Tri-Bench: A compact benchmark for evaluating VLMs’ verifiable geometric reasoning capabilities, focusing on planar triangle problems with camera pose and scene context variations.

DetailsMotivation: Vision-Language Models often fail under realistic scene changes, and there's a need for trustworthy and controllable agentic AI with verifiable geometric reasoning capabilities.

Method: Created Tri-Bench with planar triangle problems, testing camera pose (planar vs. tilted) and scene context via object interference. Evaluated four recent VLMs with a fixed prompt containing explicit guardrail about surrounding square border for homography-based solutions.

Result: Overall accuracy ~69% (best ~75%, worst ~64%). Accuracy drops to ~0% on minority shape classes (equilateral, isosceles, right-angled triangles). Camera tilt degrades accuracy by ~4.1%. Object interference has no significant effect. Models fail to utilize explicit frame-of-reference hints and default to 2D image plane cues.

Conclusion: Current VLMs have limited verifiable geometric reasoning capabilities, failing to properly use explicit geometric hints and struggling with 3D reasoning, especially for minority shape classes and under camera tilt.

Abstract: Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.

[157] SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing

Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng

Main category: cs.CV

TL;DR: Enhanced VLM-based visual grounding in satellite imagery using structured localization mechanism with control tokens, achieving 24.8% improvement over previous methods.

DetailsMotivation: Vision-language models (VLMs) are becoming powerful tools for remote sensing but need better spatial reasoning capabilities for precise object localization in complex satellite imagery.

Method: Fine-tune pretrained VLM on diverse instruction-following tasks while interfacing with a dedicated grounding module through specialized control tokens for structured localization.

Result: Consistent state-of-the-art improvements on remote sensing benchmarks, including 24.8% relative improvement over previous methods on visual grounding tasks.

Conclusion: Structured spatial reasoning integration into VLMs significantly enhances precise object localization in satellite imagery, enabling more reliable real-world satellite data analysis.

Abstract: Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model’s ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.

[158] Accelerated Rotation-Invariant Convolution for UAV Image Segmentation

Manduhu Manduhu, Alexander Dow, Gerard Dooly, James Riordan

Main category: cs.CV

TL;DR: GPU-optimized rotation-invariant convolution framework for UAV aerial imagery segmentation that eliminates im2col overhead, reduces memory/computational costs, and maintains accuracy while achieving significant speedups and energy savings.

DetailsMotivation: Rotation invariance is crucial for precise object-level segmentation in UAV aerial imagery where targets have arbitrary orientations and fine details. Conventional CNN architectures like U-Net lack rotation invariance, degrading segmentation accuracy across varying viewpoints. Existing rotation-invariant methods using expanded filter banks increase computational costs and memory traffic significantly.

Method: Introduces a GPU-optimized rotation-invariant convolution framework that eliminates the traditional im2col data-lowering step. Exploits structured data sharing among symmetrically rotated filters to achieve multi-orientation convolution with reduced memory traffic and computational redundancy. Generalizes the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles.

Result: Achieves 20-55% faster training and 15-45% lower energy consumption than CUDNN while maintaining comparable accuracy to state-of-the-art rotation-invariant methods. In eight-orientation setting: up to 45% speedup and 41% energy savings on 256×256 inputs, and 32% speedup and 23% lower energy usage on 1024×1024 inputs. Integrated into U-Net segmentation model yields up to 6% accuracy improvement over non-rotation-aware baseline.

Conclusion: The proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks, demonstrating practical benefits for UAV aerial imagery segmentation with significant performance gains and energy savings.

Abstract: Rotation invariance is essential for precise, object-level segmentation in UAV aerial imagery, where targets can have arbitrary orientations and exhibit fine-scale details. Conventional segmentation architectures like U-Net rely on convolution operators that are not rotation-invariant, leading to degraded segmentation accuracy across varying viewpoints. Rotation invariance can be achieved by expanding the filter bank across multiple orientations; however, this will significantly increase computational cost and memory traffic. In this paper, we introduce a GPU-optimized rotation-invariant convolution framework that eliminates the traditional data-lowering (im2col) step required for matrix-multiplication-based convolution. By exploiting structured data sharing among symmetrically rotated filters, our method achieves multi-orientation convolution with greatly reduced memory traffic and computational redundancy. We further generalize the approach to accelerate convolution with arbitrary (non-symmetric) rotation angles. Across extensive benchmarks, the proposed convolution achieves 20–55% faster training and 15–45% lower energy consumption than CUDNN, while maintaining accuracy comparable to state-of-the-art rotation-invariant methods. In the eight-orientation setting, our approach achieves up to 45% speedup and 41% energy savings on 256(\times)256 inputs, and 32% speedup and 23% lower energy usage on 1024(\times)1024 inputs. Integrated into a U-Net segmentation model, the framework yields up to 6% improvement in accuracy over the non-rotation-aware baseline. These results demonstrate that the proposed method provides an effective and highly efficient alternative to existing rotation-invariant CNN frameworks.

[159] UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation

Zeyang Liu, Le Wang, Sanping Zhou, Yuxuan Wu, Xiaolong Sun, Gang Hua, Haoxiang Li

Main category: cs.CV

TL;DR: UniLayDiff is a unified diffusion transformer model that handles various content-aware layout generation tasks with a single end-to-end trainable model, treating layout constraints as a distinct modality.

DetailsMotivation: Current methods for content-aware layout generation either address only subsets of tasks or require separate models for different conditions, lacking a truly unified solution for diverse input-constrained generation sub-tasks.

Method: Uses Multi-Modal Diffusion Transformer framework to capture interplay between background image, layout elements, and constraints; treats constraints as distinct modality; integrates relation constraints via LoRA fine-tuning after pretraining on other tasks.

Result: Achieves state-of-the-art performance across unconditional and various conditional generation tasks; first model to unify full range of content-aware layout generation tasks.

Conclusion: UniLayDiff provides a unified solution for diverse layout generation tasks with enhanced layout quality, demonstrating the effectiveness of treating constraints as a distinct modality in a diffusion transformer framework.

Abstract: Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.

[160] Self-Evolving 3D Scene Generation from a Single Image

Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He, Xin Eric Wang

Main category: cs.CV

TL;DR: EvoScene is a training-free framework that generates complete 3D scenes from single images by combining geometric reasoning from 3D generation models with visual knowledge from video generation models through iterative 2D-3D refinement.

DetailsMotivation: Existing image-to-3D generators have limitations: they are object-centric, trained on specific datasets, and struggle with complex, large-scale scenes that require faithful structure and texture. There's a need for a method that can handle diverse scenes with better generalization.

Method: EvoScene uses a self-evolving, training-free framework with three iterative stages: 1) Spatial Prior Initialization, 2) Visual-guided 3D Scene Mesh Generation, and 3) Spatial-guided Novel View Generation. It alternates between 2D and 3D domains, leveraging complementary strengths of existing 3D generation models (for geometry) and video generation models (for visual knowledge).

Result: Experiments on diverse scenes show EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines. It produces ready-to-use 3D meshes suitable for practical applications.

Conclusion: EvoScene successfully addresses the challenge of generating high-quality textured 3D scenes from single images by combining existing models in an iterative framework, demonstrating better generalization to complex scenes than object-centric approaches.

Abstract: Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages–Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation–EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.

[161] LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception

Simon de Moreau, Andrei Bursuc, Hafid El-Idrissi, Fabien Moutarde

Main category: cs.CV

TL;DR: LiDAS is an active illumination system that dynamically optimizes headlight patterns to improve nighttime perception for autonomous vehicles, enabling daytime-trained models to work at night without retraining.

DetailsMotivation: Nighttime environments challenge camera-based perception systems because they passively rely on existing scene lighting. Current methods either uniformly brighten scenes or don't adapt illumination to perception needs.

Method: LiDAS combines off-the-shelf perception models with high-definition headlights in a closed-loop system. It dynamically predicts optimal illumination fields that maximize perception performance by decreasing light on empty areas and reallocating it to object regions.

Result: In real-world closed-loop driving scenarios, LiDAS achieves +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performance while reducing energy use by 40% and complements domain-generalization methods without retraining.

Conclusion: LiDAS transforms standard headlights into active vision actuators, providing a cost-effective solution for robust nighttime perception that enables zero-shot generalization of daytime-trained models to nighttime conditions.

Abstract: Nighttime environments pose significant challenges for camera-based perception, as existing methods passively rely on the scene lighting. We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high-definition headlights. Rather than uniformly brightening the scene, LiDAS dynamically predicts an optimal illumination field that maximizes downstream perception performance, i.e., decreasing light on empty areas to reallocate it on object regions. LiDAS enables zero-shot nighttime generalization of daytime-trained models through adaptive illumination control. Trained on synthetic data and deployed zero-shot in real-world closed-loop driving scenarios, LiDAS enables +18.7% mAP50 and +5.0% mIoU over standard low-beam at equal power. It maintains performances while reducing energy use by 40%. LiDAS complements domain-generalization methods, further strengthening robustness without retraining. By turning readily available headlights into active vision actuators, LiDAS offers a cost-effective solution to robust nighttime perception.

[162] Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration

Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim

Main category: cs.CV

TL;DR: UniT is a unified text restoration framework that combines Diffusion Transformer, Vision-Language Model, and Text Spotting Module to restore degraded text in images while reducing hallucinations.

DetailsMotivation: Diffusion models for image restoration often produce text hallucinations in text-centric tasks due to lack of explicit linguistic knowledge, requiring a solution that integrates textual understanding.

Method: UniT integrates three components iteratively: 1) Diffusion Transformer (DiT) backbone for strong representation, 2) Vision-Language Model (VLM) extracts textual content from degraded images, 3) Text Spotting Module (TSM) generates intermediate OCR predictions at each denoising step to refine VLM guidance.

Result: Experiments on SA-Text and Real-Text benchmarks show UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.

Conclusion: UniT effectively addresses text hallucinations in image restoration by integrating explicit linguistic guidance through iterative refinement between VLM and TSM, leveraging DiT’s representational power for high-fidelity text restoration.

Abstract: Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.

[163] Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi SM Sajjadi

Main category: cs.CV

TL;DR: D4RT is a feedforward transformer model that jointly infers depth, spatio-temporal correspondence, and camera parameters from single videos for efficient 4D reconstruction.

DetailsMotivation: Reconstructing complex geometry and motion of dynamic scenes from video remains challenging, requiring efficient solutions that can handle multiple reconstruction tasks simultaneously.

Method: Uses unified transformer architecture with novel querying mechanism that avoids dense per-frame decoding and multiple task-specific decoders; enables flexible probing of 3D positions across space and time.

Result: Sets new state-of-the-art performance across wide spectrum of 4D reconstruction tasks; achieves lightweight, highly scalable method with efficient training and inference.

Conclusion: D4RT provides a simple yet powerful solution for dynamic scene reconstruction that outperforms previous methods while being computationally efficient and scalable.

Abstract: Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.

[164] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Youming Deng, Songyou Peng, Junyi Zhang, Kathryn Heal, Tiancheng Sun, John Flynn, Steve Marschner, Lucy Chai

Main category: cs.CV

TL;DR: Selfi improves 3D reconstruction by aligning VGGT features for better geometric consistency, achieving SOTA in novel view synthesis and pose estimation.

DetailsMotivation: VGGT models enable feed-forward 3D prediction from uncalibrated images but lack explicit multi-view geometric consistency, which limits performance in 3D reconstruction tasks.

Method: Selfi trains a lightweight feature adapter using reprojection-based consistency loss, distilling VGGT outputs into a geometrically-aligned feature space that captures 3D spatial proximity.

Result: Achieves state-of-the-art performance in both novel view synthesis and camera pose estimation, demonstrating improved 3D feature consistency.

Conclusion: Feature alignment is a highly beneficial step for downstream 3D reasoning, transforming VGGT backbones into high-fidelity 3D reconstruction engines.

Abstract: Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach – 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.

[165] MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Kaizhi Zheng, Xuehai He, Xin Eric Wang

Main category: cs.CV

TL;DR: MiniGPT-5 introduces a novel interleaved vision-and-language generation method using “generative vokens” and a two-stage training strategy for description-free multimodal generation, achieving substantial improvements over baselines.

DetailsMotivation: While Multimodal Large Language Models (MLLMs) show strong multimodal understanding capabilities, simultaneous generation of coherent images with text remains underdeveloped. The paper aims to address this gap in multimodal generation.

Method: Proposes an interleaved vision-and-language generation method centered around “generative vokens” as pivotal elements for coherent outputs. Uses a unique two-stage training strategy for description-free multimodal generation (without needing extensive image descriptions) and integrates classifier-free guidance for better image-text alignment.

Result: MiniGPT-5 shows substantial improvement over baseline models on multimodal generation datasets (MMDialog and VIST). Human evaluation indicates MiniGPT-5 outperforms baseline models in more than 56% of cases for multimodal generation.

Conclusion: The proposed method with generative vokens and two-stage training effectively enables coherent interleaved image-text generation without requiring extensive image descriptions, demonstrating strong performance across diverse benchmarks.

Abstract: The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

[166] Spike-EVPR: Deep Spiking Residual Networks with SNN-Tailored Representations for Event-Based Visual Place Recognition

Zuntao Liu, Yaohui Li, Chenming Hu, Delei Kong, Junjie Jiang, Zheng Fang

Main category: cs.CV

TL;DR: Spike-EVPR: An end-to-end SNN framework for event-based visual place recognition that uses novel event representations and deep spiking architecture to achieve SOTA performance with low energy consumption.

DetailsMotivation: Event cameras are ideal for VPR in challenging environments due to high temporal resolution and dynamic range, but existing methods convert sparse events into dense frame-like representations for ANNs, ignoring event sparsity and incurring high computational cost. SNNs complement event data but lack effective spike-compatible representations and deep architectures for learning discriminative global descriptors.

Method: Proposes Spike-EVPR, a directly trained, end-to-end SNN framework with two complementary event representations (MCS-Tensor and TSS-Tensor) designed to reduce temporal redundancy while preserving spatio-temporal cues, and a deep spiking residual architecture that aggregates these features to generate robust place descriptors.

Result: Extensive experiments on Brisbane-Event-VPR and DDD20 datasets demonstrate state-of-the-art performance, improving Recall@1 by 7.61% and 13.20% respectively, while significantly reducing energy consumption.

Conclusion: Spike-EVPR effectively addresses limitations of existing methods by leveraging SNNs with novel event representations and deep architecture, achieving superior VPR performance with energy efficiency for event-based vision systems.

Abstract: Event cameras are ideal for visual place recognition (VPR) in challenging environments due to their high temporal resolution and high dynamic range. However, existing methods convert sparse events into dense frame-like representations for Artificial Neural Networks (ANNs), ignoring event sparsity and incurring high computational cost. Spiking Neural Networks (SNNs) complement event data through discrete spike signals to enable energy-efficient VPR, but their application is hindered by the lack of effective spike-compatible representations and deep architectures capable of learning discriminative global descriptors. To address these limitations, we propose Spike-EVPR, a directly trained, end-to-end SNN framework tailored for event-based VPR. First, we introduce two complementary event representations, MCS-Tensor and TSS-Tensor, designed to reduce temporal redundancy while preserving essential spatio-temporal cues. Furthermore, we propose a deep spiking residual architecture that effectively aggregates these features to generate robust place descriptors. Extensive experiments on the Brisbane-Event-VPR and DDD20 datasets demonstrate that Spike-EVPR achieves state-of-the-art performance, improving Recall@1 by 7.61% and 13.20%, respectively, while significantly reducing energy consumption.

[167] Learning effective pruning at initialization from iterative pruning

Shengkai Liu, Yaofeng Cheng, Fusheng Zha, Wei Guo, Lining Sun, Zhenshan Bing, Chenguang Yang

Main category: cs.CV

TL;DR: AutoSparse: A neural network-based pruning at initialization method that learns from iterative rewind pruning to predict important subnetworks before training, achieving better performance at high sparsity levels with cross-model generalization.

DetailsMotivation: Current pruning at initialization (PaI) methods have significant accuracy gaps compared to iterative pruning, especially at high sparsity levels. The authors aim to bridge this gap by drawing inspiration from iterative pruning methods like iterative rewind pruning (IRP) to improve PaI performance.

Method: AutoSparse uses an end-to-end neural network to learn correlations between initial network features and their importance scores. It takes the model’s initial features as input, outputs importance scores, and prunes the lowest-scoring parameters before training. The method learns from IRP results where subnetworks are identified by rewinding to original initialization in each pruning iteration.

Result: The approach outperforms existing PaI methods in high-sparsity settings. Notably, it demonstrates strong generalization: only one-time IRP on one model (e.g., ResNet-18/CIFAR-10) enables AutoSparse to generalize to other models and datasets (e.g., VGG-16/CIFAR-10, ResNet-18/TinyImageNet).

Conclusion: AutoSparse represents the first neural network-based PaI method that effectively bridges the gap between PaI and iterative pruning. The approach reveals learning tendencies of neural networks and provides practical insights for PaI research, with promising cross-model generalization capabilities.

Abstract: Pruning at initialization (PaI) reduces training costs by removing weights before training, which becomes increasingly crucial with the growing network size. However, current PaI methods still have a large accuracy gap with iterative pruning, especially at high sparsity levels. This raises an intriguing question: can we get inspiration from iterative pruning to improve the PaI performance? In the lottery ticket hypothesis, the iterative rewind pruning (IRP) finds subnetworks retroactively by rewinding the parameter to the original initialization in every pruning iteration, which means all the subnetworks are based on the initial state. Here, we hypothesise the surviving subnetworks are more important and bridge the initial feature and their surviving score as the PaI criterion. We employ an end-to-end neural network (\textbf{AutoS}parse) to learn this correlation, input the model’s initial features, output their score and then prune the lowest score parameters before training. To validate the accuracy and generalization of our method, we performed PaI across various models. Results show that our approach outperforms existing methods in high-sparsity settings. Notably, as the underlying logic of model pruning is consistent in different models, only one-time IRP on one model is needed (e.g., once IRP on ResNet-18/CIFAR-10, AutoS can be generalized to VGG-16/CIFAR-10, ResNet-18/TinyImageNet, et al.). As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach. These results reveal the learning tendencies of neural networks and provide new insights into our understanding and research of PaI from a practical perspective. Our code is available at: https://github.com/ChengYaofeng/AutoSparse.git.

[168] Beyond accuracy: quantifying the reliability of Multiple Instance Learning for Whole Slide Image classification

Hassan Keshvarikhojasteh, Marc Aubreville, Christof A. Bertram, Josien P. W. Pluim, Mitko Veta

Main category: cs.CV

TL;DR: The paper introduces reliability metrics for Multiple Instance Learning models in computational pathology, finding that simple MEAN-POOL-INS outperforms more complex architectures in reliability despite its simplicity.

DetailsMotivation: Machine learning models, especially MIL models for Whole Slide Image classification in computational pathology, lack reliability assessment despite being used in high-stakes clinical applications. There's a critical gap in understanding their suitability for clinical decision-making.

Method: The authors introduce three quantitative metrics for reliability assessment and apply them to several widely used MIL architectures across three region-wise annotated pathology datasets.

Result: The mean pooling instance (MEAN-POOL-INS) model demonstrates superior reliability compared to other networks, despite its simple architectural design and computational efficiency.

Conclusion: Reliability evaluation should be conducted alongside predictive performance in MIL models, and MEAN-POOL-INS serves as a strong, trustworthy baseline for future research in computational pathology.

Abstract: Machine learning models have become integral to many fields, but their reliability, defined as producing dependable, trustworthy, and domain-consistent predictions, remains a critical concern. Multiple Instance Learning (MIL) models designed for Whole Slide Image (WSI) classification in computational pathology are rarely evaluated in terms of reliability, leaving a key gap in understanding their suitability for high-stakes applications like clinical decision-making. In this paper, we address this gap by introducing three quantitative metrics for reliability assessment and applying them to several widely used MIL architectures across three region-wise annotated pathology datasets. Our findings indicate that the mean pooling instance (MEAN-POOL-INS)model demonstrates superior reliability compared to other networks, despite its simple architectural design and computational efficiency. These findings underscore the need of reliability evaluation alongside predictive performance in MIL models and establish MEAN-POOL-INS as a strong, trustworthy baseline for future research.

[169] Learning Geodesics of Geometric Shape Deformations From Images

Nian Wu, Miaomiao Zhang

Main category: cs.CV

TL;DR: GDN is a novel neural network method that learns geodesic flows of deformation fields from images for shape analysis, using neural operators and a geodesic loss function.

DetailsMotivation: Existing registration networks learn initial velocity fields but ignore the geodesic definition crucial for deformation-based shape analysis. There's a need to directly learn geodesic flows for better shape quantification and comparison.

Method: Develops geodesic deformable networks (GDN) using neural operators to treat geodesics as unknown mapping functions learned from latent deformation spaces. Uses composition of integral operators and smooth activation functions to approximate mappings, and jointly optimizes a newly defined geodesic loss.

Result: Demonstrates effectiveness on both 2D synthetic data and 3D real brain MRI data, showing improved network regularizability and generalizability.

Conclusion: GDN successfully enables learning of geodesic flows for deformation-based shape analysis, addressing limitations of previous registration networks and providing better quantification of deformable shapes in images.

Abstract: This paper presents a novel method, named geodesic deformable networks (GDN), that for the first time enables the learning of geodesic flows of deformation fields derived from images. In particular, the capability of our proposed GDN being able to predict geodesics is important for quantifying and comparing deformable shape presented in images. The geodesic deformations, also known as optimal transformations that align pairwise images, are often parameterized by a time sequence of smooth vector fields governed by nonlinear differential equations. A bountiful literature has been focusing on learning the initial conditions (e.g., initial velocity fields) based on registration networks. However, the definition of geodesics central to deformation-based shape analysis is blind to the networks. To address this problem, we carefully develop an efficient neural operator to treat the geodesics as unknown mapping functions learned from the latent deformation spaces. A composition of integral operators and smooth activation functions is then formulated to effectively approximate such mappings. In contrast to previous works, our GDN jointly optimizes a newly defined geodesic loss, which adds additional benefits to promote the network regularizability and generalizability. We demonstrate the effectiveness of GDN on both 2D synthetic data and 3D real brain magnetic resonance imaging (MRI).

[170] Neural Radiance Fields for the Real World: A Survey

Wenhui Xiao, Remi Chierchia, Rodrigo Santa Cruz, Xuesong Li, David Ahmedt-Aristizabal, Olivier Salvado, Clinton Fookes, Leo Lebrat

Main category: cs.CV

TL;DR: A comprehensive survey paper reviewing Neural Radiance Fields (NeRFs), covering theoretical advancements, alternative representations, applications in reconstruction, computer vision, and robotics, plus datasets, toolkits, and future research directions.

DetailsMotivation: NeRFs have revolutionized 3D scene representation but lack a thorough review of recent innovations, applications, and challenges. The paper aims to compile key advancements and investigate emerging challenges in the field.

Method: This is a survey paper that compiles and analyzes existing research on NeRFs. The methodology involves reviewing key theoretical advancements, alternative representations, applications, datasets, and toolkits in the NeRF literature.

Result: The survey provides a comprehensive overview of NeRF innovations, explores their applications in reconstruction, computer vision, and robotics, reviews essential datasets and toolkits, and identifies gaps in current literature.

Conclusion: By identifying gaps in the literature, the survey discusses open challenges and offers directions for future research in Neural Radiance Fields, highlighting their transformative impact on 3D scene representation and related fields.

Abstract: Neural Radiance Fields (NeRFs) have remodeled 3D scene representation since release. NeRFs can effectively reconstruct complex 3D scenes from 2D images, advancing different fields and applications such as scene understanding, 3D content generation, and robotics. Despite significant research progress, a thorough review of recent innovations, applications, and challenges is lacking. This survey compiles key theoretical advancements and alternative representations and investigates emerging challenges. It further explores applications on reconstruction, highlights NeRFs’ impact on computer vision and robotics, and reviews essential datasets and toolkits. By identifying gaps in the literature, this survey discusses open challenges and offers directions for future research.

[171] Shape and Texture Recognition in Large Vision-Language Models

Sagi Eppel, Mor Bismut, Alona Faktor-Strugatski

Main category: cs.CV

TL;DR: LVLMs struggle with basic visual perception tasks like shape and texture recognition, performing well below humans despite approaching human-level on some material recognition in 3D scenes.

DetailsMotivation: Shapes and textures are fundamental to visual understanding, but current LVLMs' ability to recognize them independently of orientation, texture, context, or object association remains unclear and needs benchmarking.

Method: Created LAS&T dataset (700k+ images) via unsupervised pattern extraction from natural images. Benchmarked leading LVLMs (GPT/Gemini/LLama/Qwen) and foundation models (DINO/CLIP) on shape matching across transformations and texture/material recognition across objects.

Result: LVLMs perform significantly below humans on shape recognition, especially with multiple transformations. They approach human-level on 3D material recognition but underperform on simpler 2D textures/shapes. Models rely too much on high-level semantic features and struggle with abstract shapes lacking class associations.

Conclusion: Current LVLMs have major deficiencies in extracting low-level visual features, revealing a gap between their semantic understanding and basic visual perception capabilities that humans and simple task-specific networks excel at.

Abstract: Shapes and textures are the basic building blocks of visual perception. The ability to identify shapes regardless of orientation, texture, or context, and to recognize textures and materials independently of their associated objects, is essential for a general visual understanding of the world. This work introduces the Large Shapes and Textures dataset (LAS&T), a giant collection of highly diverse shapes and textures, created by unsupervised extraction of patterns from natural images. This dataset is used to benchmark how effectively leading Large Vision-Language Models (LVLM/VLM) recognize and represent shapes, textures, and materials in 2D and 3D scenes. For shape recognition, we test the models’ ability to match images of identical shapes that differ in orientation, texture, color, or environment. Our results show that the shape-recognition capabilities of LVLMs remain well below human performance, especially when multiple transformations are applied. LVLMs rely predominantly on high-level and semantic features and struggle with abstract shapes lacking class associations. For texture and material recognition, we evaluated the models’ ability to identify images with identical textures and materials across different objects and environments. Interestingly, leading LVLMs approach human-level performance in recognizing materials in 3D scenes, yet substantially underperform humans when identifying simpler, more abstract 2D textures and shapes. These results are consistent across a wide range of leading LVLMs (GPT/Gemini/LLama/Qwen) and foundation vision models (DINO/CLIP), exposing major deficiencies in the ability of VLMs to extract low-level visual features. In contrast, humans and simple nets trained directly for these tasks achieve high accuracy. The LAS&T dataset, featuring over 700,000 images for 2D/3D shape and textures recognition and retrieval, is freely available.

[172] ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation

Yuan Zhou, Shilong Jin, Litao Hua, Wanjun Lv, Haoran Duan, Jungong Han

Main category: cs.CV

TL;DR: ConsDreamer addresses view bias in zero-shot text-to-3D generation by refining both conditional and unconditional terms in score distillation to mitigate the multi-face Janus problem.

DetailsMotivation: Current zero-shot text-to-3D generation methods using 3D Gaussian Splatting with score distillation suffer from inherent prior view biases in T2I models, leading to inconsistent 3D generation and the multi-face Janus problem where objects exhibit conflicting features across different views.

Method: ConsDreamer proposes two key components: (1) a View Disentanglement Module that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control, and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships.

Result: Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

Conclusion: ConsDreamer provides a novel approach to address fundamental view bias challenges in zero-shot text-to-3D generation, improving consistency and quality of 3D content creation from textual descriptions.

Abstract: Recent advances in zero-shot text-to-3D generation have revolutionized 3D content creation by enabling direct synthesis from textual descriptions. While state-of-the-art methods leverage 3D Gaussian Splatting with score distillation to enhance multi-view rendering through pre-trained text-to-image (T2I) models, they suffer from inherent prior view biases in T2I priors. These biases lead to inconsistent 3D generation, particularly manifesting as the multi-face Janus problem, where objects exhibit conflicting features across views. To address this fundamental challenge, we propose ConsDreamer, a novel method that mitigates view bias by refining both the conditional and unconditional terms in the score distillation process: (1) a View Disentanglement Module (VDM) that eliminates viewpoint biases in conditional prompts by decoupling irrelevant view components and injecting precise view control; and (2) a similarity-based partial order loss that enforces geometric consistency in the unconditional term by aligning cosine similarities with azimuth relationships. Extensive experiments demonstrate that ConsDreamer can be seamlessly integrated into various 3D representations and score distillation paradigms, effectively mitigating the multi-face Janus problem.

[173] Video Dataset Condensation with Diffusion Models

Zhe Li, Hadrien Reynaud, Mischa Dombrowski, Sarah Cechnicka, Franciskus Xaverius Erick, Bernhard Kainz

Main category: cs.CV

TL;DR: Video dataset distillation method using video diffusion models and novel selection techniques to create compact synthetic datasets with improved performance.

DetailsMotivation: Address computational challenges from large datasets and complex models by creating efficient video dataset distillation methods that overcome limitations of existing approaches.

Method: Uses video diffusion model for initial synthetic video generation, introduces Video Spatio-Temporal U-Net (VST-UNet) for diverse subset selection, and employs training-free Temporal-Aware Cluster-based Distillation (TAC-DT) algorithm.

Result: Achieves up to 10.61% performance improvement over state-of-the-art on four benchmark datasets, consistently outperforming existing approaches.

Conclusion: Proposed method establishes new benchmark for video dataset distillation with significant computational efficiency gains and superior performance.

Abstract: In recent years, the rapid expansion of dataset sizes and the increasing complexity of deep learning models have significantly escalated the demand for computational resources, both for data storage and model training. Dataset distillation has emerged as a promising solution to address this challenge by generating a compact synthetic dataset that retains the essential information from a large real dataset. However, existing methods often suffer from limited performance, particularly in the video domain. In this paper, we focus on video dataset distillation. We begin by employing a video diffusion model to generate synthetic videos. Since the videos are generated only once, this significantly reduces computational costs. Next, we introduce the Video Spatio-Temporal U-Net (VST-UNet), a model designed to select a diverse and informative subset of videos that effectively captures the characteristics of the original dataset. To further optimize computational efficiency, we explore a training-free clustering algorithm, Temporal-Aware Cluster-based Distillation (TAC-DT), to select representative videos without requiring additional training overhead. We validate the effectiveness of our approach through extensive experiments on four benchmark datasets, demonstrating performance improvements of up to (10.61%) over the state-of-the-art. Our method consistently outperforms existing approaches across all datasets, establishing a new benchmark for video dataset distillation.

[174] MELLM: A Flow-Guided Large Language Model for Micro-Expression Understanding

Sirui Zhao, Zhengye Zhang, Shifeng Liu, Xinglong Mao, Shukang Yin, Chaoyou Fu, Tong Xu, Enhong Chen

Main category: cs.CV

TL;DR: Proposes MELLM, a micro-expression large language model that combines optical flow-based facial motion analysis with LLM reasoning for comprehensive micro-expression understanding beyond simple classification.

DetailsMotivation: Existing micro-expression recognition methods are limited to discrete emotion classification and lack comprehensive understanding of subtle facial dynamics and underlying emotional cues. While MLLMs have potential, they struggle to perceive subtle facial affective behaviors.

Method: Develops MEFlowNet (iterative warping-based optical flow estimator) to capture facial micro-movements, creates MEFlowDataset (54,611 onset-apex image pairs), builds MEU-Instruct instruction-tuning dataset, and fine-tunes MELLM to translate motion patterns into descriptions and emotional inferences.

Result: MEFlowNet significantly outperforms existing optical flow methods in facial and ME-flow estimation. MELLM achieves state-of-the-art accuracy and generalization across multiple ME benchmarks.

Conclusion: Presents two key contributions: MEFlowNet as the first dedicated ME flow estimator, and MELLM as the first LLM tailored for comprehensive micro-expression understanding, bridging the gap between subtle facial motion perception and emotional inference.

Abstract: Micro-expressions (MEs), brief and low-intensity facial movements revealing concealed emotions, are crucial for affective computing. Despite notable progress in ME recognition, existing methods are largely confined to discrete emotion classification, lacking the capacity for comprehensive ME Understanding (MEU), particularly in interpreting subtle facial dynamics and underlying emotional cues. While Multimodal Large Language Models (MLLMs) offer potential for MEU with their advanced reasoning abilities, they still struggle to perceive such subtle facial affective behaviors. To bridge this gap, we propose a ME Large Language Model (MELLM) that integrates optical flow-based sensitivity to subtle facial motions with the powerful inference ability of LLMs. Specifically, an iterative, warping-based optical-flow estimator, named MEFlowNet, is introduced to precisely capture facial micro-movements. For its training and evaluation, we construct MEFlowDataset, a large-scale optical-flow dataset with 54,611 onset-apex image pairs spanning diverse identities and subtle facial motions. Subsequently, we design a Flow-Guided Micro-Expression Understanding paradigm. Under this framework, the optical flow signals extracted by MEFlowNet are leveraged to build MEU-Instruct, an instruction-tuning dataset for MEU. MELLM is then fine-tuned on MEU-Instruct, enabling it to translate subtle motion patterns into human-readable descriptions and generate corresponding emotional inferences. Experiments demonstrate that MEFlowNet significantly outperforms existing optical flow methods in facial and ME-flow estimation, while MELLM achieves state-of-the-art accuracy and generalization across multiple ME benchmarks. To the best of our knowledge, this work presents two key contributions: MEFlowNet as the first dedicated ME flow estimator, and MELLM as the first LLM tailored for MEU.

[175] Leveraging Multi-Modal Information to Enhance Dataset Distillation

Zhe Li, Hadrien Reynaud, Bernhard Kainz

Main category: cs.CV

TL;DR: Multi-modal dataset distillation framework using caption-guided supervision and object-centric masking to improve synthetic data quality and privacy protection.

DetailsMotivation: Existing dataset distillation methods focus only on visual representations, ignoring valuable multi-modal information. There's a need to leverage textual information and reduce background noise to improve synthetic data utility while enhancing privacy protection.

Method: Proposes a multi-modal framework with two key enhancements: 1) Caption-guided supervision using caption concatenation (fusing caption embeddings with visual features) and caption matching (enforcing semantic alignment via caption-based loss), and 2) Object-centric masking using segmentation masks to isolate target objects with masked feature alignment and masked gradient matching losses.

Result: Extensive evaluations show the approach improves downstream performance while promoting privacy protection by minimizing exposure to real data.

Conclusion: Incorporating multi-modal information and object-centric masking in dataset distillation enhances synthetic data quality, improves utility, and strengthens privacy preservation compared to visual-only approaches.

Abstract: Dataset distillation aims to create a small and highly representative synthetic dataset that preserves the essential information of a larger real dataset. Beyond reducing storage and computational costs, related approaches offer a promising avenue for privacy preservation in computer vision by eliminating the need to store or share sensitive real-world images. Existing methods focus solely on optimizing visual representations, overlooking the potential of multi-modal information. In this work, we propose a multi-modal dataset distillation framework that incorporates two key enhancements: caption-guided supervision and object-centric masking. To leverage textual information, we introduce two strategies: caption concatenation, which fuses caption embeddings with visual features during classification, and caption matching, which enforces semantic alignment between real and synthetic data through a caption-based loss. To improve data utility and reduce unnecessary background noise, we employ segmentation masks to isolate target objects and introduce two novel losses: masked feature alignment and masked gradient matching, both aimed at promoting object-centric learning. Extensive evaluations demonstrate that our approach improves downstream performance while promoting privacy protection by minimizing exposure to real data.

[176] 50 Years of Automated Face Recognition

Minchul Kim, Anil Jain, Xiaoming Liu

Main category: cs.CV

TL;DR: This paper provides a comprehensive survey of automated face recognition evolution over 50+ years, from early handcrafted methods to modern deep learning systems that now rival or exceed human performance, analyzing key innovations and future directions.

DetailsMotivation: To trace the historical and technological evolution of face recognition systems, examining how the field has progressed from early algorithmic approaches to current state-of-the-art deep learning architectures, and to identify key innovations and future research directions.

Method: The paper conducts a comprehensive survey and analysis of face recognition evolution, examining pivotal innovations across multiple dimensions: dataset construction, loss function formulation, network architecture design, and feature fusion strategies. It analyzes the relationship between data scale, diversity, and model generalization.

Result: Recent systems have achieved near-perfect large-scale identification accuracy, with the leading algorithm in NIST FRTE 1:N benchmark reporting FNIR of 0.15% at FPIR of 0.001 on a gallery of over 10 million identities. The analysis shows how dataset expansion correlates with benchmark performance gains.

Conclusion: While face recognition has made remarkable progress, key open problems remain including scalable training, multi-modal fusion, synthetic data utilization, and developing interpretable recognition frameworks. The field continues to evolve with emerging directions building on current successes.

Abstract: Over the past five decades, automated face recognition (FR) has progressed from handcrafted geometric and statistical approaches to advanced deep learning architectures that now approach, and in many cases exceed, human performance. This paper traces the historical and technological evolution of FR, encompassing early algorithmic paradigms through to contemporary neural systems trained on extensive real and synthetically generated datasets. We examine pivotal innovations that have driven this progression, including advances in dataset construction, loss function formulation, network architecture design, and feature fusion strategies. Furthermore, we analyze the relationship between data scale, diversity, and model generalization, highlighting how dataset expansion correlates with benchmark performance gains. Recent systems have achieved near-perfect large-scale identification accuracy, with the leading algorithm in the latest NIST FRTE 1:N benchmark reporting a FNIR of 0.15 percent at FPIR of 0.001 on a gallery of over 10 million identities. We delineate key open problems and emerging directions, including scalable training, multi-modal fusion, synthetic data, and interpretable recognition frameworks.

[177] Enabling Validation for Robust Few-Shot Recognition

Hanxin Wang, Tian Liu, Shu Kong

Main category: cs.CV

TL;DR: VEST framework improves few-shot recognition robustness by harmonizing performance on few-shot ID data and retrieved OOD data for validation, enabling better parameter selection and achieving SOTA results on both ID and OOD benchmarks.

DetailsMotivation: Current VLM finetuning methods for few-shot recognition generalize well to in-distribution test data but struggle with out-of-distribution data, highlighting the need for robust FSR. The core challenge is data scarcity including lack of validation data.

Method: Proposes Validation-Enabled Stage-wise Tuning (VEST) framework that introduces a novel validation strategy harmonizing performance gain on few-shot ID data and degradation on retrieved OOD data. This enables parameter selection for partial finetuning and checkpoint selection to mitigate overfitting.

Result: Extensive experiments on ImageNet OOD benchmarks show VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.

Conclusion: VEST framework effectively addresses the validation paradox in few-shot recognition by repurposing retrieved OOD data for validation, leading to improved generalization and robustness in VLM finetuning for FSR tasks.

Abstract: Few-Shot Recognition (FSR) tackles classification tasks by training with minimal task-specific labeled data. Prevailing methods adapt or finetune a pretrained Vision-Language Model (VLM) and augment the scarce training data by retrieving task-relevant but noisy samples from open data sources. The finetuned VLM generalizes decently well to the task-specific in-distribution (ID) test data but struggles with out-of-distribution (OOD) test data. This motivates our study of robust FSR with VLM finetuning. The core challenge of FSR is data scarcity, extending beyond limited training data to a complete lack of validation data. We identify a key paradox as a potential solution: repurposing the retrieved open data for validation. As such retrieved data are inherently OOD compared with the task-specific ID training data, finetuned VLMs yield degraded performance on the retrieved data. This causes the validation logic to favor the pretrained model without any finetuning, hindering improvements w.r.t generalization. To resolve this dilemma, we introduce a novel validation strategy that harmonizes performance gain and degradation on the few-shot ID data and the retrieved data, respectively. Our validation enables parameter selection for partial finetuning and checkpoint selection, mitigating overfitting and improving test-data generalization. We unify this strategy with robust learning into a cohesive framework: Validation-Enabled Stage-wise Tuning (VEST). Extensive experiments on the established ImageNet OOD benchmarks show that VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.

[178] Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Domain-RAG is a training-free, retrieval-guided compositional image generation framework for Cross-Domain Few-Shot Object Detection that creates domain-aligned synthetic samples by decomposing images, retrieving similar backgrounds, generating new backgrounds, and composing with preserved foregrounds.

DetailsMotivation: Existing data augmentation and generative methods for CD-FSOD often fail to preserve correct object categories or produce backgrounds coherent with target domains, making them difficult to apply directly. There's a need for methods that ensure both visual realism and domain alignment in synthetic samples.

Method: Domain-RAG uses a three-stage framework: 1) Domain-aware background retrieval - decomposes input images and retrieves semantically/stylistically similar images; 2) Domain-guided background generation - uses generative models conditioned on original and retrieved contexts to synthesize new backgrounds; 3) Foreground-background composition - preserves original foregrounds and composes them with newly generated domain-aligned backgrounds.

Result: Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results across diverse tasks including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. The method produces high-quality, domain-consistent samples without requiring additional supervision or training.

Conclusion: Domain-RAG effectively addresses the challenges of CD-FSOD by providing a training-free approach that generates domain-aligned synthetic samples, demonstrating superior performance across multiple challenging few-shot object detection scenarios.

Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.

[179] Co-Seg++: Mutual Prompt-Guided Collaborative Learning for Versatile Medical Segmentation

Qing Xu, Yuxiang Luo, Wenting Duan, Zhen Chen

Main category: cs.CV

TL;DR: Co-Seg++ is a unified medical segmentation framework that jointly performs semantic and instance segmentation through mutual enhancement, outperforming SOTA on CT and histopathology datasets.

DetailsMotivation: Existing medical segmentation approaches treat semantic and instance segmentation as isolated tasks, overlooking their interdependencies, which leads to suboptimal performance and insufficient medical image understanding.

Method: Proposes Co-Seg++ with two key components: 1) SSP-Encoder to capture long-range spatial and sequential relationships as prior constraints, and 2) MTC-Decoder using cross-guidance to strengthen contextual consistency for joint semantic and instance mask computation.

Result: Extensive experiments on diverse CT and histopathology datasets show Co-Seg++ outperforms state-of-the-art methods in semantic, instance, and panoptic segmentation of dental structures, histopathology tissues, and nuclei instances.

Conclusion: The proposed co-segmentation paradigm enables mutual enhancement between semantic and instance segmentation tasks, achieving superior performance across multiple medical segmentation applications with publicly available code.

Abstract: Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-sequential prompt encoder (SSP-Encoder) to capture long-range spatial and sequential relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg-Plus.

[180] End-to-End Fine-Tuning of 3D Texture Generation using Differentiable Rewards

AmirHossein Zamani, Tianhao Xie, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky

Main category: cs.CV

TL;DR: End-to-end differentiable framework for 3D texture generation that incorporates human feedback via reward functions, improving geometry awareness and alignment with preferences.

DetailsMotivation: Current 3D generative models produce high-quality textures but often fail to capture human preferences or task-specific requirements. Existing approaches rely on 2D text-to-image models that lack understanding of 3D structure.

Method: Proposes an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback as differentiable reward functions directly into 3D texture synthesis. Back-propagates preference signals through both geometric and appearance modules. Introduces three novel geometry-aware reward functions for controllable and interpretable 3D content creation from natural language.

Result: The method consistently outperforms state-of-the-art approaches in qualitative, quantitative, and user-preference evaluations. Generates textures that respect 3D geometry structure and align with desired criteria.

Conclusion: The proposed differentiable texture learning framework effectively addresses limitations of existing 3D texture generation methods by incorporating human feedback and geometry awareness, offering a more controllable and interpretable pathway for high-quality 3D content creation.

Abstract: While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specific requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. Our implementation code is publicly available at: https://github.com/AHHHZ975/Differentiable-Texture-Learning

[181] Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Yiwei He, Zhenglin Huang, Haiquan Wen, Tianxiao Li, Yi Dong, Hao Fei, Baoyuan Wu, Guangliang Cheng

Main category: cs.CV

TL;DR: BiMi is a bilingual multimodal framework that detects misinformation in news media by analyzing region-level visual edits and cross-lingual inconsistencies between images and subtitles, enhanced by online retrieval and improved explanations via GRPO.

DetailsMotivation: Increasing realism of multimodal content makes misinformation more subtle and harder to detect, especially in news media with bilingual subtitles where localized image edits and cross-lingual inconsistencies can jointly distort meaning while appearing plausible.

Method: BiMi framework jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation. It integrates an online retrieval module for external context and uses Group Relative Policy Optimization (GRPO) to improve explanation quality.

Result: BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore. The authors also release BiMiBench, a large-scale benchmark with 104,000 samples of systematically edited real news images and subtitles.

Conclusion: BiMi advances state-of-the-art performance in realistic, multilingual misinformation detection by effectively addressing the challenges of subtle multimodal manipulations in bilingual news content through joint analysis of visual and linguistic inconsistencies.

Abstract: The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.

[182] Learning an Ensemble Token from Task-driven Priors in Facial Analysis

Sunyong Seo, Semin Kim, Jongha Lee

Main category: cs.CV

TL;DR: KT-Adapter introduces knowledge tokens for efficient integration of high-fidelity facial analysis features from different backbone architectures with minimal computational cost.

DetailsMotivation: Facial analysis requires task-specific features, but combining high-fidelity models from CNNs (spatial information) and ViTs (semantic information) is computationally expensive. There's a need for efficient feature integration without heavy computational costs.

Method: Proposes KT-Adapter with robust prior unification learning that generates knowledge tokens within self-attention mechanisms. These tokens share mutual information across pre-trained encoders, enabling efficient feature representation integration.

Result: Improved performance across facial analysis tasks with statistically significant enhancements in feature representations. The approach achieves high efficiency with negligible computational overhead.

Conclusion: KT-Adapter provides an effective solution for integrating high-fidelity feature representations from different architectures in a computationally efficient manner, advancing facial analysis capabilities.

Abstract: Facial analysis exhibits task-specific feature variations. While Convolutional Neural Networks (CNNs) have enabled the fine-grained representation of spatial information, Vision Transformers (ViTs) have facilitated the representation of semantic information at the patch level. While advances in backbone architectures have improved over the past decade, combining high-fidelity models often incurs computational costs on feature representation perspective. In this work, we introduce KT-Adapter, a novel methodology for learning knowledge token which enables the integration of high-fidelity feature representation in computationally efficient manner. Specifically, we propose a robust prior unification learning method that generates a knowledge token within a self-attention mechanism, sharing the mutual information across the pre-trained encoders. This knowledge token approach offers high efficiency with negligible computational cost. Our results show improved performance across facial analysis, with statistically significant enhancements observed in the feature representations.

[183] DIVER: Reinforced Diffusion Breaks Imitation Bottlenecks in End-to-End Autonomous Driving

Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo

Main category: cs.CV

TL;DR: DIVER is an end-to-end autonomous driving framework that combines reinforcement learning with diffusion models to generate diverse and feasible trajectories, addressing the conservative behavior limitations of imitation learning.

DetailsMotivation: Most end-to-end autonomous driving methods rely on single expert demonstrations through imitation learning, leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios.

Method: DIVER integrates reinforcement learning with diffusion-based generation: 1) Conditions on map elements and surrounding agents to generate multiple reference trajectories from single ground-truth trajectories, 2) Uses reinforcement learning to guide diffusion process with reward-based supervision for safety and diversity constraints.

Result: Extensive experiments on closed-loop NAVSIM and Bench2Drive benchmarks and open-loop nuScenes dataset show DIVER significantly improves trajectory diversity and effectively addresses mode collapse problem in imitation learning.

Conclusion: DIVER successfully addresses the limitations of imitation learning by generating diverse and feasible trajectories through reinforced diffusion-based generation, with proposed diversity metrics for better evaluation.

Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode predictions.Extensive experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

[184] CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

Main category: cs.CV

TL;DR: A dual-model framework for embodied reference understanding that combines head-to-fingertip and wrist-to-fingertip pointing directions with CLIP-guided ensemble fusion to improve object reference prediction.

DetailsMotivation: Existing methods for embodied reference understanding fail to fully exploit visual disambiguation signals and rely on overly limiting single-line assumptions about pointing direction. The referent often aligns with head-to-fingertip direction, but in many cases aligns more closely with wrist-to-fingertip direction.

Method: Proposes a dual-model framework with one model learning from head-to-fingertip direction and another from wrist-to-fingertip direction. Uses Gaussian ray heatmap representation of pointing lines as input. Introduces CLIP-Aware Pointing Ensemble module for hybrid ensemble guided by CLIP features. Incorporates auxiliary object center prediction head to enhance referent localization.

Result: Achieves 75.0 mAP at 0.25 IoU on YouRefIt benchmark, with state-of-the-art CLIP and C_D scores. Demonstrates robust performance on unseen CAESAR and ISL Pointing benchmarks, showing strong generalization capability.

Conclusion: The proposed dual-model approach with CLIP-guided ensemble fusion effectively addresses the limitations of single-line pointing assumptions and improves embodied reference understanding by better leveraging multimodal cues from pointing gestures and language.

Abstract: We address Embodied Reference Understanding, the task of predicting the object a person in the scene refers to through pointing gesture and language. This requires multimodal reasoning over text, visual pointing cues, and scene context, yet existing methods often fail to fully exploit visual disambiguation signals. We also observe that while the referent often aligns with the head-to-fingertip direction, in many cases it aligns more closely with the wrist-to-fingertip direction, making a single-line assumption overly limiting. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To fuse their complementary strengths, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble guided by CLIP features. We further incorporate an auxiliary object center prediction head to enhance referent localization. We validate our approach on YouRefIt, achieving 75.0 mAP at 0.25 IoU, alongside state-of-the-art CLIP and C_D scores, and demonstrate its generality on unseen CAESAR and ISL Pointing, showing robust performance across benchmarks.

[185] CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang

Main category: cs.CV

TL;DR: CATP is a training-free image token pruning method for multimodal in-context learning that reduces 77.8% of image tokens while improving performance by 0.6% and reducing inference latency by 10.78%.

DetailsMotivation: Current image token pruning methods focus on single-image tasks and overlook multimodal ICL, where token redundancy is higher and efficiency is more critical. Existing methods cause large accuracy drops in multimodal ICL settings, creating a need for new approaches.

Method: Contextually Adaptive Token Pruning (CATP) uses two stages of progressive pruning that fully reflect complex cross-modal interactions in the input sequence. It’s a training-free method specifically designed for multimodal ICL.

Result: After removing 77.8% of image tokens, CATP achieves average performance gain of 0.6% over vanilla model on four LVLMs and eight benchmarks, outperforming all baselines. It reduces inference latency by average of 10.78%.

Conclusion: CATP strengthens the practical value of multimodal ICL and lays foundation for future progress in interleaved image-text settings by addressing token redundancy while maintaining or improving performance.

Abstract: Modern large vision-language models (LVLMs) convert each input image into a large set of tokens that far outnumber the text tokens. Although this improves visual perception, it also introduces severe image token redundancy. Because image tokens contain sparse information, many contribute little to reasoning but greatly increase inference cost. Recent image token pruning methods address this issue by identifying important tokens and removing the rest. These methods improve efficiency with only small performance drops. However, most of them focus on single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is higher and efficiency is more important. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and lead to unstable performance. When existing pruning methods are applied in this setting, they cause large accuracy drops, which exposes a clear gap and the need for new approaches. To address this, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method designed for multimodal ICL. CATP uses two stages of progressive pruning that fully reflect the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP achieves an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, clearly outperforming all baselines. At the same time, it improves efficiency by reducing inference latency by an average of 10.78%. CATP strengthens the practical value of multimodal ICL and lays the foundation for future progress in interleaved image-text settings.

[186] MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training

Muhammad Osama Zeeshan, Natacha Gillet, Alessandro Lameiras Koerich, Marco Pedersoli, Francois Bremond, Eric Granger

Main category: cs.CV

TL;DR: MuSACo is a multimodal subject-specific adaptation method for expression recognition that uses co-training to leverage complementary information across modalities and source domains, outperforming existing UDA and MSDA approaches.

DetailsMotivation: Current MSDA methods for personalized expression recognition often overlook multimodal information or blend sources into a single domain, failing to capture unique subject-specific characteristics crucial for applications like patient-specific assessment in digital health.

Method: MuSACo uses co-training to select relevant source subjects and generate pseudo-labels using the dominant modality for class-aware learning. It combines this with a class-agnostic loss for less confident target samples, while aligning source features from each modality and combining only confident target features.

Result: Experimental results on challenging multimodal ER datasets (BioVid, StressID, and BAH) show that MuSACo outperforms both UDA (blending) and state-of-the-art MSDA methods.

Conclusion: MuSACo effectively addresses limitations of existing MSDA approaches by leveraging multimodal information and preserving subject diversity, making it particularly suitable for affective computing applications where subject-level nuances are crucial.

Abstract: Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods, where each domain corresponds to a specific subject to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multimodal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Experimental results on challenging multimodal ER datasets: BioVid, StressID, and BAH show that MuSACo outperforms UDA (blending) and state-of-the-art MSDA methods.

[187] DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

Main category: cs.CV

TL;DR: DAASH is a differentiable meta-attack framework that strategically composes existing Lp-based attacks to generate perceptually aligned adversarial examples with better visual quality and higher success rates than state-of-the-art perceptual attacks.

DetailsMotivation: Traditional Lp-norm bounded adversarial examples often don't align well with human perception, and it's unclear whether insights from Lp-constrained attacks can improve perceptual efficacy. There's a need for effective perceptually aligned adversarial examples that leverage existing attack methods.

Method: DAASH uses a multi-stage differentiable framework that aggregates candidate adversarial examples from multiple base Lp-based attacks using learned adaptive weights. It employs a novel meta-loss function that jointly minimizes misclassification loss and perceptual distortion, dynamically modulating each base attack’s contribution throughout the stages.

Result: DAASH significantly outperforms state-of-the-art perceptual attacks like AdvAD, achieving 20.63% higher attack success rates and superior visual quality (SSIM, LPIPS, FID improvements ≈ 11, 0.015, and 5.7 respectively). It generalizes well to unseen defenses without requiring handcrafted adaptive attacks.

Conclusion: DAASH demonstrates that Lp-based attacks can be effectively leveraged to create perceptually aligned adversarial examples through strategic composition, providing a practical and strong baseline for evaluating model robustness across various defenses.

Abstract: Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only recently have a few methods begun specifically exploring perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DAASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DAASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DAASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DAASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD – achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DAASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

[188] FLAIR: Frequency- and Locality-Aware Implicit Neural Representations

Sukhun Ko, Seokhyun Yoon, Dahyeon Kye, Kyle Min, Chanho Eom, Jihyong Oh

Main category: cs.CV

TL;DR: FLAIR introduces frequency- and locality-aware implicit neural representations with Band-Localized Activation and Wavelet-Energy-Guided Encoding to address spectral bias and improve frequency selectivity in INRs.

DetailsMotivation: Existing Implicit Neural Representations lack frequency selectivity and spatial localization, leading to spectral bias where they learn low-frequency components early but struggle with high-frequency details, resulting in over-reliance on redundant signal components.

Method: Proposes FLAIR with two key innovations: 1) Band-Localized Activation (BLA) for joint frequency selection and spatial localization under time-frequency uncertainty principle constraints, and 2) Wavelet-Energy-Guided Encoding (WEGE) that uses discrete wavelet transform to compute energy scores and guide frequency information to the network.

Result: The method consistently outperforms existing INRs in 2D image representation, 3D shape reconstruction, and novel view synthesis, effectively mitigating spectral bias and enhancing training stability.

Conclusion: FLAIR addresses fundamental limitations of current INRs by incorporating frequency selectivity and spatial localization, leading to improved performance across multiple vision tasks through better frequency control and adaptive band management.

Abstract: Implicit Neural Representations (INRs) leverage neural networks to map coordinates to corresponding signals, enabling continuous and compact representations. This paradigm has driven significant advances in various vision tasks. However, existing INRs lack frequency selectivity and spatial localization, leading to an over-reliance on redundant signal components. Consequently, they exhibit spectral bias, tending to learn low-frequency components early while struggling to capture fine high-frequency details. To address these issues, we propose FLAIR (Frequency- and Locality-Aware Implicit Neural Representations), which incorporates two key innovations. The first is Band-Localized Activation (BLA), a novel activation designed for joint frequency selection and spatial localization under the constraints of the time-frequency uncertainty principle (TFUP). Through structured frequency control and spatially localized responses, BLA effectively mitigates spectral bias and enhances training stability. The second is Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet transform to compute energy scores and explicitly guide frequency information to the network, enabling precise frequency selection and adaptive band control. Our method consistently outperforms existing INRs in 2D image representation, as well as 3D shape reconstruction and novel view synthesis.

[189] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

Yanxu Meng, Haoning Wu, Ya Zhang, Weidi Xie

Main category: cs.CV

TL;DR: SceneGen is a novel framework that generates multiple 3D assets with geometry and texture from a single scene image and object masks in one feedforward pass, without needing optimization or asset retrieval.

DetailsMotivation: 3D content generation is crucial for VR/AR and embodied AI applications, but synthesizing multiple 3D assets within a single scene image remains challenging.

Method: SceneGen uses a feature aggregation module that integrates local and global scene information from visual and geometric encoders, coupled with a position head to generate 3D assets and their relative spatial positions simultaneously.

Result: Extensive evaluations show SceneGen’s efficiency and robustness, with direct extensibility to multi-image input scenarios despite being trained only on single-image inputs.

Conclusion: SceneGen offers a novel solution for high-quality 3D content generation that could advance practical applications in downstream tasks.

Abstract: 3D content generation has recently attracted significant research interest, driven by its critical applications in VR/AR and embodied AI. In this work, we tackle the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for extra optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen’s direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architecture yields improved generation performance when multiple images are provided; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robustness of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: https://mengmouxu.github.io/SceneGen.

[190] Harnessing Object Grounding for Time-Sensitive Video Understanding

Tz-Ying Wu, Sharath Nittur Sridhar, Subarna Tripathi

Main category: cs.CV

TL;DR: GO-Tokenizer improves video understanding in Video-LLMs by encoding compact object information from object detectors, outperforming text-based object descriptions.

DetailsMotivation: Time-sensitive video understanding tasks can benefit from grounded object information, but textual descriptions of objects add token length and are susceptible to noise in object-level information.

Method: GO-Tokenizer: a lightweight add-on module for Video-LLMs that uses off-the-shelf object detectors to encode compact object information on the fly, avoiding lengthy textual descriptions.

Result: Pretraining with GO-Tokenizer outperforms vanilla Video-LLM and text-based object descriptions, with gains generalizing across models, datasets, and tasks like reasoning temporal localization and dense captioning.

Conclusion: GO-Tokenizer effectively improves time-sensitive video understanding in Video-LLMs by providing compact, efficient object information encoding that generalizes well across different scenarios.

Abstract: We propose to improve the time-sensitive video understanding (TSV) capability of video large language models (Video-LLMs) with grounded objects (GO). We hypothesize that TSV tasks can benefit from GO within frames, which is supported by our preliminary experiments on LITA, a state-of-the-art Video-LLM for reasoning temporal localization. While augmenting prompts with textual descriptions of these object annotations improves the performance of LITA, it also introduces extra token length and susceptibility to the noise in object-level information. To address this, we propose GO-Tokenizer, a lightweight add-on module for Video-LLMs leveraging off-the-shelf object detectors to encode compact object information on the fly. Experimental results demonstrate that pretraining with GO-Tokenizer outperforms the vanilla Video-LLM and its counterpart, utilizing textual descriptions of objects in the prompt. The gain generalizes across different models, datasets, and video understanding tasks, such as reasoning temporal localization and dense captioning.

[191] Zo3T: Zero-Shot 3D-Aware Trajectory-Guided Image-to-Video Generation via Test-Time Training

Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li

Main category: cs.CV

TL;DR: Zo3T is a zero-shot test-time-training framework for trajectory-guided image-to-video generation that improves 3D realism and motion accuracy without fine-tuning.

DetailsMotivation: Existing methods for trajectory-guided I2V generation either require computationally expensive fine-tuning on scarce datasets or produce unrealistic motion by neglecting 3D perspective and creating misalignment between manipulated latents and noise predictions.

Method: Three core innovations: 1) 3D-Aware Kinematic Projection using scene depth for perspective-correct transformations; 2) Trajectory-Guided Test-Time LoRA with ephemeral adapters and regional feature consistency loss; 3) Guidance Field Rectification with one-step lookahead strategy.

Result: Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

Conclusion: The proposed zero-shot test-time-training framework effectively addresses challenges in trajectory-guided I2V generation by ensuring 3D perspective awareness, latent-manifold alignment, and efficient generative progression without requiring fine-tuning.

Abstract: Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network’s noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

[192] A Data-driven Typology of Vision Models from Integrated Representational Metrics

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

Main category: cs.CV

TL;DR: The paper develops a framework using multiple similarity metrics and network fusion to analyze and classify vision models based on their representational structure, revealing that computational strategies shaped by architecture and training objectives define model families more than surface design categories.

DetailsMotivation: Large vision models vary widely in architecture and training, but there are no principled methods to determine which aspects of their representations are shared across model families versus which reflect distinctive computational strategies.

Method: Uses a suite of representational similarity metrics (capturing geometry, unit tuning, or linear decodability) and adapts Similarity Network Fusion (SNF) from multi-omics integration to combine these complementary facets into composite signatures.

Result: Geometry and tuning metrics yield strong family discrimination, while linear decodability shows weaker separation. SNF achieves substantially sharper family separation than individual metrics. Clustering reveals: supervised ResNets and ViTs form distinct clusters; all self-supervised models group together across architectures; hybrid architectures cluster with masked autoencoders.

Conclusion: The biology-inspired framework provides a principled typology showing that emergent computational strategies—shaped jointly by architecture and training objective—define representational structure beyond surface design categories.

Abstract: Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.

[193] ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Ruihang Xu, Dewei Zhou, Fan Ma, Yi Yang

Main category: cs.CV

TL;DR: ContextGen is a Diffusion Transformer framework for multi-instance image generation that uses layout and reference images to precisely control object positions and preserve multiple subject identities.

DetailsMotivation: Current diffusion models struggle with multi-instance image generation due to limitations in precise layout control and preserving identities of multiple distinct subjects simultaneously.

Method: Two key innovations: 1) Contextual Layout Anchoring (CLA) incorporates composite layout images to anchor objects in desired positions, and 2) Identity Consistency Attention (ICA) uses reference images to maintain identity consistency across multiple instances.

Result: ContextGen achieves state-of-the-art performance, outperforming existing methods in control precision, identity fidelity, and overall visual quality. The team also created IMIG-100K, the first large-scale dataset with detailed layout and identity annotations.

Conclusion: ContextGen successfully addresses key challenges in multi-instance image generation through its novel architecture and attention mechanisms, setting new benchmarks for the field while introducing valuable dataset resources.

Abstract: Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

[194] PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang

Main category: cs.CV

TL;DR: PAGE-4D extends VGGT to handle dynamic scenes by introducing a dynamics-aware aggregator that disentangles static and dynamic information, enabling multi-task 4D reconstruction without post-processing.

DetailsMotivation: Existing 3D feed-forward models like VGGT perform well on static scenes but struggle with dynamic elements in real-world scenarios. There's a need for models that can handle moving objects and complex dynamics while performing multiple 4D reconstruction tasks.

Method: Extends VGGT to dynamic scenes with a dynamics-aware aggregator that predicts a dynamics-aware mask. This mask suppresses motion cues for camera pose estimation while amplifying them for geometry reconstruction, resolving the inherent conflict between these tasks.

Result: PAGE-4D consistently outperforms VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular/video depth estimation, and dense point map reconstruction.

Conclusion: The proposed dynamics-aware approach effectively handles the tension between different 4D reconstruction tasks in dynamic scenes, enabling robust multi-task performance without post-processing.

Abstract: Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction – all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask – suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

[195] Glass Surface Detection: Leveraging Reflection Dynamics in Flash/No-flash Imagery

Tao Yan, Hao Huang, Yiwei Lu, Zeyu Wang, Ke Xu, Yinghui Wang, Xiaojun Chang, Rynson W. H. Lau

Main category: cs.CV

TL;DR: NFGlassNet uses flash/no-flash image pairs to detect glass surfaces by analyzing reflection dynamics, outperforming existing methods that rely on boundary or reflection cues alone.

DetailsMotivation: Glass surfaces are challenging to detect due to their transparency and lack of distinctive features. Existing methods rely on boundary cues or reflections but fail to exploit intrinsic glass properties. The authors observed that illumination differences across glass surfaces create predictable reflection dynamics in flash/no-flash images.

Method: Proposes NFGlassNet with two key modules: Reflection Contrast Mining Module (RCMM) extracts reflections from flash/no-flash pairs, and Reflection Guided Attention Module (RGAM) fuses reflection features with glass surface features. Uses a dataset of 3.3K flash/no-flash image pairs with ground truth annotations.

Result: Extensive experiments show the method outperforms state-of-the-art glass surface detection methods. The approach effectively leverages reflection dynamics for accurate glass localization.

Conclusion: Flash/no-flash imagery provides valuable reflection dynamics for glass surface detection. The proposed NFGlassNet with specialized modules successfully exploits these dynamics, achieving superior performance over existing methods.

Abstract: Glass surfaces are ubiquitous in daily life, typically appearing colorless, transparent, and lacking distinctive features. These characteristics make glass surface detection a challenging computer vision task. Existing glass surface detection methods always rely on boundary cues (e.g., window and door frames) or reflection cues to locate glass surfaces, but they fail to fully exploit the intrinsic properties of the glass itself for accurate localization. We observed that in most real-world scenes, the illumination intensity in front of the glass surface differs from that behind it, which results in variations in the reflections visible on the glass surface. Specifically, when standing on the brighter side of the glass and applying a flash towards the darker side, existing reflections on the glass surface tend to disappear. Conversely, while standing on the darker side and applying a flash towards the brighter side, distinct reflections will appear on the glass surface. Based on this phenomenon, we propose NFGlassNet, a novel method for glass surface detection that leverages the reflection dynamics present in flash/no-flash imagery. Specifically, we propose a Reflection Contrast Mining Module (RCMM) for extracting reflections, and a Reflection Guided Attention Module (RGAM) for fusing features from reflection and glass surface for accurate glass surface detection. For learning our network, we also construct a dataset consisting of 3.3K no-flash and flash image pairs captured from various scenes with corresponding ground truth annotations. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. Our code, model, and dataset will be available upon acceptance of the manuscript.

[196] Wukong’s 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

Minghao Yin, Yukang Cao, Kai Han

Main category: cs.CV

TL;DR: WUKONG is a training-free framework for high-fidelity textured 3D morphing using flow-based transformers, solving optimal transport barycenter problems for smooth shape transitions and preserving texture details through semantic consistency mechanisms.

DetailsMotivation: Conventional 3D morphing methods rely on manual correspondence matching and deformation trajectory estimation, which limits generalization and requires costly preprocessing. There's a need for a more efficient, high-fidelity approach that can handle diverse geometry and texture variations without extensive manual intervention.

Method: WUKONG leverages flow-based transformers’ generative prior for 3D transitions. It formulates morphing as an optimal transport barycenter problem for smooth shape transitions, uses sequential initialization to prevent geometric distortions, and implements similarity-guided semantic consistency mechanisms for texture preservation with precise blending control.

Result: Extensive evaluations show WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations while avoiding common artifacts like oversmoothing and maintaining semantic fidelity.

Conclusion: WUKONG provides a novel, training-free framework for high-fidelity textured 3D morphing that overcomes limitations of conventional methods by leveraging flow-based transformers and optimal transport theory, offering better generalization and eliminating costly preprocessing requirements.

Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods – which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) – WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.

[197] IRPO: Boosting Image Restoration via Post-training GRPO

Haoxuan Xu, Yi Liu, Boyuan Jiang, Jinlong Peng, Donghao Luo, Xiaobin Hu, Shuicheng Yan, Haoang Li

Main category: cs.CV

TL;DR: IRPO is a novel low-level vision post-training paradigm that uses GRPO-based reinforcement learning with optimized data formulation and multi-component reward modeling to achieve state-of-the-art image restoration performance.

DetailsMotivation: Existing image restoration methods rely on pixel-level hard-fitting to ground-truth images, which leads to over-smoothing and poor generalization. Post-training paradigms have shown success in high-level generation tasks but remain unexplored for low-level vision tasks.

Method: IRPO uses a GRPO-based post-training paradigm with two key innovations: 1) Data formulation principle selecting underperforming samples from pre-training for optimal performance and efficiency, and 2) A three-component reward system balancing objective accuracy and human perception (General Reward for structural fidelity, Expert Reward using Qwen-VL for perceptual alignment, and Restoration Reward for task-specific low-level quality).

Result: IRPO achieves state-of-the-art results across six in-domain and five out-of-domain low-level benchmarks, surpassing AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings.

Conclusion: IRPO demonstrates that post-training paradigms can be effectively adapted to low-level vision tasks through careful data formulation and reward modeling, achieving superior performance and generalization compared to traditional pixel-level fitting approaches.

Abstract: Recent advances in post-training paradigms have achieved remarkable success in high-level generation tasks, yet their potential for low-level vision remains rarely explored. Existing image restoration (IR) methods rely on pixel-level hard-fitting to ground-truth images, struggling with over-smoothing and poor generalization. To address these limitations, we propose IRPO, a low-level GRPO-based post-training paradigm that systematically explores both data formulation and reward modeling. We first explore a data formulation principle for low-level post-training paradigm, in which selecting underperforming samples from the pre-training stage yields optimal performance and improved efficiency. Furthermore, we model a reward-level criteria system that balances objective accuracy and human perceptual preference through three complementary components: a General Reward for structural fidelity, an Expert Reward leveraging Qwen-VL for perceptual alignment, and a Restoration Reward for task-specific low-level quality. Comprehensive experiments on six in-domain and five out-of-domain (OOD) low-level benchmarks demonstrate that IRPO achieves state-of-the-art results across diverse degradation types, surpassing the AdaIR baseline by 0.83 dB on in-domain tasks and 3.43 dB on OOD settings. Our code can be shown in https://github.com/HaoxuanXU1024/IRPO.

[198] On the Temporality for Sketch Representation Learning

Marcelo Isaias de Moraes Junior, Moacir Antonelli Ponti

Main category: cs.CV

TL;DR: Sketches as sequences: absolute coordinates beat relative ones, non-autoregressive decoders beat autoregressive ones, and temporal importance depends on order and task.

DetailsMotivation: Despite advances in sketch representation learning, there's still uncertainty about whether treating sketches as sequences is justified and which temporal aspects matter most for representation quality.

Method: The paper investigates sketch representation by comparing different sequence modeling approaches, including absolute vs. relative coordinates and autoregressive vs. non-autoregressive decoders.

Result: Absolute coordinates consistently outperform relative ones for sequence modeling of sketches. Non-autoregressive decoders outperform autoregressive counterparts. Temporal importance varies based on order and task.

Conclusion: While sketches can be effectively treated as sequences, the choice of coordinate representation (absolute), decoder type (non-autoregressive), and temporal ordering significantly impacts representation quality depending on the specific task.

Abstract: Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.

[199] MT-Depth: Multi-task Instance feature analysis for the Depth Completion

Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun

Main category: cs.CV

TL;DR: Instance-aware depth completion framework using binary instance masks as spatial priors to refine depth predictions, achieving better RMSE than U-Net-only and semantic-guided methods.

DetailsMotivation: Existing depth completion approaches often rely on semantic segmentation but overlook object-level understanding. The authors aim to leverage instance-aware cues to improve depth completion without needing dense semantic labels.

Method: Four-component framework: 1) frozen YOLO V11 instance segmentation branch for binary instance masks, 2) U-Net-based depth completion backbone, 3) cross-attention fusion module to integrate instance guidance, and 4) attention-guided prediction head. Instance masks guide depth refinement via cross-attention.

Result: Validated on Virtual KITTI 2 dataset, achieves lower RMSE than U-Net-only baseline and previous semantic-guided methods while maintaining competitive MAE. Shows improved depth accuracy near object boundaries, occlusions, and thin structures.

Conclusion: Incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels, effectively enhancing depth accuracy in challenging regions.

Abstract: Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object-level understanding. In this work, we introduce an instance-aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U-Net-based depth completion backbone, a cross-attention fusion module, and an attention-guided prediction head. The instance segmentation branch generates per-image foreground masks that guide the depth branch via cross-attention, allowing the network to focus on object-centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower Root Mean Squared Error (RMSE) compared to both a U-Net-only baseline and previous semantic-guided methods, while maintaining competitive Mean Absolute Error (MAE). Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance-aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.

[200] EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, Qi Tian

Main category: cs.CV

TL;DR: EMMA is an efficient unified multimodal architecture that handles understanding, generation, and editing tasks with improved efficiency and performance through token compression, channel-wise concatenation, shared-decoupled networks, and mixture-of-experts mechanisms.

DetailsMotivation: To create a more efficient and unified architecture for multimodal tasks (understanding, generation, editing) that addresses the computational inefficiency of existing approaches while maintaining high performance across different modalities.

Method: 1) Efficient autoencoder with 32x compression ratio to reduce token count; 2) Channel-wise concatenation instead of token-wise for visual tokens; 3) Shared-and-decoupled network for task-specific modeling; 4) Mixture-of-experts mechanism for visual understanding encoder.

Result: EMMA-4B outperforms state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, and achieves competitive results compared to specialized multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image).

Conclusion: EMMA provides an efficient foundation for unified multimodal architectures that balances understanding and generation tasks while maintaining high performance with fewer parameters, representing a significant advancement in multimodal AI systems.

Abstract: We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.

[201] Equivariant Symmetry-Aware Head Pose Estimation for Fetal MRI

Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson, Polina Golland, Benjamin Billot

Main category: cs.CV

TL;DR: E(3)-Pose is a fast 6-DoF pose estimation method that explicitly models rotation equivariance and object symmetry for robust fetal head pose estimation in MRI scans, enabling automatic adaptive slice prescription.

DetailsMotivation: The paper addresses the challenging problem of fetal head motion during diagnostic MRI scans, aiming to enable automatic adaptive prescription of 2D diagnostic MRI slices. Existing methods struggle with clinical volumes due to pose ambiguities from anatomical symmetries, low resolution, noise, and artifacts.

Method: E(3)-Pose jointly and explicitly models rotation equivariance and object symmetry by construction. It captures anatomical symmetries and rigid pose equivariance to yield robust fetal head pose estimates from 3D MRI volumes acquired before each 2D slice.

Result: The method demonstrates superior robustness and generalization across domains on publicly available and representative clinical fetal MRI datasets. It achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation.

Conclusion: E(3)-Pose provides a robust solution for fetal head pose estimation in clinical MRI settings by explicitly addressing symmetry and equivariance challenges, enabling automatic adaptive slice prescription for improved diagnostic imaging.

Abstract: We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, paving the way for clinical translation. Our implementation is available at github.com/ramyamut/E3-Pose.

[202] High-Throughput Unsupervised Profiling of the Morphology of 316L Powder Particles for Use in Additive Manufacturing

Emmanuel Akeweje, Conall Kirk, Chi-Wai Chan, Denis Dowling, Mimi Zhang

Main category: cs.CV

TL;DR: Automated ML framework uses high-throughput imaging and clustering to profile metallic powder morphology for SLM additive manufacturing, identifying Fourier-descriptor + k-means as most effective pipeline.

DetailsMotivation: Conventional powder characterization methods for Selective Laser Melting are low-throughput and qualitative, failing to capture heterogeneity in industrial-scale batches needed for quality control.

Method: Developed automated ML framework coupling high-throughput imaging with shape extraction and three clustering pipelines: autoencoder, shape-descriptor, and functional-data pipelines on 126,000 powder images.

Result: Fourier-descriptor + k-means pipeline identified as most effective with lowest Davies-Bouldin index, highest Calinski-Harabasz score, and sub-millisecond runtime per particle on standard desktop.

Conclusion: Unsupervised learning framework enables rapid automated assessment of powder morphology, supports tracking shape evolution across reuse cycles, and offers path toward real-time feedstock monitoring in SLM workflows.

Abstract: Selective Laser Melting (SLM) is a powder-bed additive manufacturing technique whose part quality depends critically on feedstock morphology. However, conventional powder characterization methods are low-throughput and qualitative, failing to capture the heterogeneity of industrial-scale batches. We present an automated, machine learning framework that couples high-throughput imaging with shape extraction and clustering to profile metallic powder morphology at scale. We develop and evaluate three clustering pipelines: an autoencoder pipeline, a shape-descriptor pipeline, and a functional-data pipeline. Across a dataset of approximately 126,000 powder images (0.5-102 micrometer diameter), internal validity metrics identify the Fourier-descriptor + k-means pipeline as the most effective, achieving the lowest Davies-Bouldin index and highest Calinski-Harabasz score while maintaining sub-millisecond runtime per particle on a standard desktop workstation. Although the present work focuses on establishing the morphological-clustering framework, the resulting shape groups form a basis for future studies examining their relationship to flowability, packing density, and SLM part quality. Overall, this unsupervised learning framework enables rapid, automated assessment of powder morphology and supports tracking of shape evolution across reuse cycles, offering a path toward real-time feedstock monitoring in SLM workflows.

[203] HuPrior3R: Incorporating Human Priors for Better 3D Dynamic Reconstruction from Monocular Videos

Weitao Xiong, Zhiyuan Yuan, Jiahao Lu, Chengfeng Zhao, Peng Li, Yuan Liu

Main category: cs.CV

TL;DR: HuPrior3R improves monocular dynamic human video reconstruction by incorporating hybrid geometric priors (SMPL models + monocular depth) to address geometric inconsistencies and resolution degradation issues.

DetailsMotivation: Existing methods for monocular dynamic video reconstruction in human scenes suffer from geometric inconsistencies (distorted limb proportions, unnatural human-object fusion) and resolution degradation due to memory-constrained downsampling, causing human boundary drift toward background geometry.

Method: Proposes HuPrior3R with hybrid geometric priors combining SMPL human body models with monocular depth estimation. Uses hierarchical pipeline with refinement components: processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. Integrates SMPL priors through Feature Fusion Module for geometrically plausible reconstruction while preserving fine-grained human boundaries.

Result: Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction compared to existing methods.

Conclusion: The proposed approach effectively addresses geometric inconsistency and resolution degradation in monocular dynamic human video reconstruction by leveraging structured human priors and hierarchical processing, achieving improved reconstruction quality.

Abstract: Monocular dynamic video reconstruction faces significant challenges in dynamic human scenes due to geometric inconsistencies and resolution degradation issues. Existing methods lack 3D human structural understanding, producing geometrically inconsistent results with distorted limb proportions and unnatural human-object fusion, while memory-constrained downsampling causes human boundary drift toward background geometry. To address these limitations, we propose to incorporate hybrid geometric priors that combine SMPL human body models with monocular depth estimation. Our approach leverages structured human priors to maintain surface consistency while capturing fine-grained geometric details in human regions. We introduce HuPrior3R, featuring a hierarchical pipeline with refinement components that processes full-resolution images for overall scene geometry, then applies strategic cropping and cross-attention fusion for human-specific detail enhancement. The method integrates SMPL priors through a Feature Fusion Module to ensure geometrically plausible reconstruction while preserving fine-grained human boundaries. Extensive experiments on TUM Dynamics and GTA-IM datasets demonstrate superior performance in dynamic human reconstruction.

[204] Evaluating and Preserving High-level Fidelity in Super-Resolution

Josep M. Rocafort, Shaolin Su, Alexandra Gomez-Villa, Javier Vazquez-Corral

Main category: cs.CV

TL;DR: The paper introduces a new fidelity metric for image super-resolution models to measure high-level semantic preservation, addressing hallucination issues where SR models change image content despite good visual quality.

DetailsMotivation: Current SR models can hallucinate and change image content while achieving high visual quality, but existing low-level image quality metrics don't capture these high-level semantic changes that humans can easily identify.

Method: 1) Created first annotated dataset with fidelity scores from different SR models; 2) Evaluated SOTA SR models on high-level fidelity preservation; 3) Analyzed correlation between existing metrics and fidelity; 4) Showed foundation models better address high-level fidelity; 5) Fine-tuned SR models using fidelity feedback.

Result: 1) Established importance of measuring high-level fidelity; 2) Found foundation models perform better at fidelity assessment; 3) Demonstrated that fine-tuning with fidelity feedback improves both semantic fidelity and perceptual quality.

Conclusion: High-level fidelity measurement is crucial for evaluating and optimizing SR models, and the proposed approach enables better model assessment and improvement through fidelity-aware fine-tuning.

Abstract: Recent image Super-Resolution (SR) models are achieving impressive effects in reconstructing details and delivering visually pleasant outputs. However, the overpowering generative ability can sometimes hallucinate and thus change the image content despite gaining high visual quality. This type of high-level change can be easily identified by humans yet not well-studied in existing low-level image quality metrics. In this paper, we establish the importance of measuring high-level fidelity for SR models as a complementary criterion to reveal the reliability of generative SR models. We construct the first annotated dataset with fidelity scores from different SR models, and evaluate how state-of-the-art (SOTA) SR models actually perform in preserving high-level fidelity. Based on the dataset, we then analyze how existing image quality metrics correlate with fidelity measurement, and further show that this high-level task can be better addressed by foundation models. Finally, by fine-tuning SR models based on our fidelity feedback, we show that both semantic fidelity and perceptual quality can be improved, demonstrating the potential value of our proposed criteria, both in model evaluation and optimization. We will release the dataset, code, and models upon acceptance.

[205] TrajMoE: Scene-Adaptive Trajectory Planning with Mixture of Experts and Reinforcement Learning

Zebin Xing, Pengxuan Yang, Linbo Wang, Yichen Zhang, Yiming Hu, Yupeng Zheng, Junli Wang, Yinfeng Gao, Guang Li, Kun Ma, Long Chen, Zhongpu Xia, Qichao Zhang, Hangjun Ye, Dongbin Zhao

Main category: cs.CV

TL;DR: The paper proposes a novel autonomous driving planning framework that combines Mixture of Experts (MoE) for scenario-specific trajectory priors with Reinforcement Learning for policy-driven trajectory refinement, achieving third place on the navsim ICCV benchmark.

DetailsMotivation: Current end-to-end autonomous driving systems have limitations: 1) they use fixed trajectory priors that don't adapt to different driving scenarios, and 2) their trajectory evaluation lacks policy-driven refinement due to one-stage supervised training constraints.

Method: 1) Employ Mixture of Experts (MoE) to apply different trajectory priors tailored to specific driving scenarios. 2) Use Reinforcement Learning to fine-tune the trajectory scoring mechanism for policy-driven refinement. 3) Integrate models with different perception backbones to enhance perceptual features.

Result: The integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place in the competition.

Conclusion: The proposed approach successfully addresses the limitations of previous methods by introducing scenario-adaptive trajectory priors via MoE and policy-driven refinement via RL, demonstrating improved planning performance in autonomous driving systems.

Abstract: Current autonomous driving systems often favor end-to-end frameworks, which take sensor inputs like images and learn to map them into trajectory space via neural networks. Previous work has demonstrated that models can achieve better planning performance when provided with a prior distribution of possible trajectories. However, these approaches often overlook two critical aspects: 1) The appropriate trajectory prior can vary significantly across different driving scenarios. 2) Their trajectory evaluation mechanism lacks policy-driven refinement, remaining constrained by the limitations of one-stage supervised training. To address these issues, we explore improvements in two key areas. For problem 1, we employ MoE to apply different trajectory priors tailored to different scenarios. For problem 2, we utilize Reinforcement Learning to fine-tune the trajectory scoring mechanism. Additionally, we integrate models with different perception backbones to enhance perceptual features. Our integrated model achieved a score of 51.08 on the navsim ICCV benchmark, securing third place.

[206] COREA: Coarse-to-Fine 3D Representation Alignment Between Relightable 3D Gaussians and SDF via Bidirectional 3D-to-3D Supervision

Jaeyoon Lee, Hojoon Jung, Sungtae Hwang, Jihyong Oh, Jongwon Choi

Main category: cs.CV

TL;DR: COREA is a unified framework that jointly learns relightable 3D Gaussians and SDF for accurate geometry reconstruction and faithful relighting, addressing limitations of existing 3DGS methods through 3D-to-3D alignment.

DetailsMotivation: Existing 3D Gaussian Splatting methods for mesh reconstruction and physically-based rendering suffer from coarse surfaces and unreliable BRDF-lighting decomposition because their geometry is learned from 2D renderings rather than directly in 3D space.

Method: COREA introduces a coarse-to-fine bidirectional 3D-to-3D alignment strategy where depth provides coarse alignment between 3D Gaussians and SDF, while depth gradients and normals refine fine-scale structure. It also includes a density-control mechanism to stabilize Gaussian growth and balance geometric fidelity with memory efficiency.

Result: Experiments on standard benchmarks show COREA achieves superior performance in novel-view synthesis, mesh reconstruction, and physically-based rendering within a unified framework.

Conclusion: COREA successfully addresses the limitations of existing 3DGS methods by enabling direct 3D geometric learning, resulting in more accurate geometry reconstruction and stable BRDF-lighting decomposition for faithful relighting.

Abstract: We present COREA, the first unified framework that jointly learns relightable 3D Gaussians and a Signed Distance Field (SDF) for accurate geometry reconstruction and faithful relighting. While recent 3D Gaussian Splatting (3DGS) methods have extended toward mesh reconstruction and physically-based rendering (PBR), their geometry is still learned from 2D renderings, leading to coarse surfaces and unreliable BRDF-lighting decomposition. To address these limitations, COREA introduces a coarse-to-fine bidirectional 3D-to-3D alignment strategy that allows geometric signals to be learned directly in 3D space. Within this strategy, depth provides coarse alignment between the two representations, while depth gradients and normals refine fine-scale structure, and the resulting geometry supports stable BRDF-lighting decomposition. A density-control mechanism further stabilizes Gaussian growth, balancing geometric fidelity with memory efficiency. Experiments on standard benchmarks demonstrate that COREA achieves superior performance in novel-view synthesis, mesh reconstruction, and PBR within a unified framework.

[207] VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation

Md Selim Sarowar, Sungho Kim

Main category: cs.CV

TL;DR: Comparison of CLIP vs DINOv2 for 3D pose estimation in hand-object grasping, showing CLIP excels in semantic understanding while DINOv2 provides better geometric features.

DetailsMotivation: Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision with rich semantic and geometric representations, but their comparative strengths for 3D pose estimation in robotic grasping scenarios need systematic evaluation.

Method: Comprehensive visual comparison between CLIP-based and DINOv2-based approaches for 3D pose estimation in hand-object grasping scenarios, evaluating both models on 6D object pose estimation tasks using benchmark datasets.

Result: CLIP-based methods achieve better semantic consistency through language grounding, while DINOv2-based approaches demonstrate competitive performance with enhanced geometric precision and superior dense geometric features.

Conclusion: The analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping applications, highlighting the complementary strengths of CLIP (semantic understanding) and DINOv2 (geometric features).

Abstract: Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.

[208] CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics

Dahyeon Kye, Jeahun Sung, Mingyu Jeon, Jihyong Oh

Main category: cs.CV

TL;DR: CHIMERA is a zero-shot diffusion-based framework for smooth image morphing using adaptive cache injection and semantic anchor prompting to handle large semantic disparities.

DetailsMotivation: Existing diffusion-based image morphing methods often produce abrupt transitions or over-saturated appearances due to lack of adaptive structural and semantic alignments between dissimilar images.

Method: Formulates morphing as cached inversion-guided denoising with two key components: Adaptive Cache Injection (ACI) that caches and adaptively re-injects features from both inputs during denoising, and Semantic Anchor Prompting (SAP) that generates a shared anchor prompt using vision-language models to bridge semantic gaps.

Result: Extensive experiments and user studies show CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing new state-of-the-art in image morphing.

Conclusion: CHIMERA successfully addresses the challenge of smooth image morphing with large semantic disparities through adaptive feature injection and semantic bridging, with the proposed GLCS metric providing better evaluation of morphing quality.

Abstract: Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.

[209] MeshRipple: Structured Autoregressive Generation of Artist-Meshes

Junkai Lin, Hang Long, Huipeng Guo, Jielei Zhang, JiaYi Yang, Tianle Guo, Yang Yang, Jianwen Li, Wenxiao Zhang, Matthias Nießner, Wei Yang

Main category: cs.CV

TL;DR: MeshRipple: A novel mesh generation method that expands meshes outward from a frontier using BFS tokenization and sparse-attention memory to maintain topological completeness and avoid fragmentation.

DetailsMotivation: Current autoregressive mesh generators serialize faces into sequences and use truncated segments with sliding-window inference due to memory constraints. This approach breaks long-range geometric dependencies, resulting in holes and fragmented components in generated meshes.

Method: MeshRipple introduces three key innovations: 1) Frontier-aware BFS tokenization that aligns generation order with surface topology, 2) Expansive prediction strategy for coherent, connected surface growth, and 3) Sparse-attention global memory providing effectively unbounded receptive field to resolve long-range topological dependencies.

Result: MeshRipple outperforms strong recent baselines by generating meshes with high surface fidelity and topological completeness, addressing the critical limitation of fragmentation in current autoregressive approaches.

Conclusion: The integrated design of MeshRipple enables effective mesh generation that maintains long-range geometric dependencies and produces complete, connected surfaces without holes or fragmentation, representing a significant advancement in 3D mesh generation.

Abstract: Meshes serve as a primary representation for 3D assets. Autoregressive mesh generators serialize faces into sequences and train on truncated segments with sliding-window inference to cope with memory limits. However, this mismatch breaks long-range geometric dependencies, producing holes and fragmented components. To address this critical limitation, we introduce MeshRipple, which expands a mesh outward from an active generation frontier, akin to a ripple on a surface. MeshRipple rests on three key innovations: a frontier-aware BFS tokenization that aligns the generation order with surface topology; an expansive prediction strategy that maintains coherent, connected surface growth; and a sparse-attention global memory that provides an effectively unbounded receptive field to resolve long-range topological dependencies. This integrated design enables MeshRipple to generate meshes with high surface fidelity and topological completeness, outperforming strong recent baselines.

[210] From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge, Rui Bu, Yuchao Jin, Haisen Zhao, He Sun, Yangyan Li, Mu Xu, Wenzheng Chen, Baoquan Chen

Main category: cs.CV

TL;DR: Proposes a method for city-scale 3D reconstruction from sparse satellite images to synthesize ground-level novel views, using 2.5D height maps and texture restoration to overcome extreme viewpoint gaps.

DetailsMotivation: City-scale 3D reconstruction from satellite imagery faces extreme viewpoint extrapolation challenges (nearly 90° gaps), where current methods like NeRF and 3DGS fail due to sparse orbital images with minimal parallax, foreshortened facades, and flawed textures.

Method: Two key design choices: 1) Model city geometry as 2.5D height map using Z-monotonic signed distance field (SDF) for stable optimization and watertight meshes; 2) Paint mesh appearance via differentiable rendering and train generative texture restoration network to enhance degraded satellite inputs.

Result: Demonstrates scalability and robustness on large-scale urban reconstruction, reconstructing 4km² real-world regions from few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views.

Conclusion: The method produces visually compelling, high-fidelity 3D models suitable for downstream applications like urban planning and simulation, overcoming extreme viewpoint gaps in satellite-to-ground reconstruction.

Abstract: City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method’s scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation. Project page can be found at https://pku-vcl-geometry.github.io/Orbit2Ground/.

[211] ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

Fan Yang, Heyuan Li, Peihao Li, Weihao Yuan, Lingteng Qiu, Chaoyue Song, Cheng Chen, Yisheng He, Shifeng Zhang, Xiaoguang Han, Steven Hoi, Guosheng Lin

Main category: cs.CV

TL;DR: A novel framework combining 3D reconstruction models with video diffusion models to generate high-fidelity upper-body 3D avatars from single images, achieving photorealistic details and fluid motion while maintaining structural stability.

DetailsMotivation: Current 3D avatar generation methods produce blurry textures and stiff motion, while video models suffer from structural errors and identity drift. There's a need to combine the strengths of both approaches for high-quality, stable avatars.

Method: Uses a 3D reconstruction model to provide structural and appearance priors, which guides a real-time autoregressive video diffusion model for rendering. This combines geometric stability with generative capabilities.

Result: Significantly reduces artifacts, achieves substantial improvements in visual quality over leading methods, synthesizes high-frequency photorealistic details and fluid dynamics in real time while preventing structural inconsistencies.

Conclusion: The approach successfully unites 3D reconstruction stability with video generation quality, producing high-fidelity digital avatars with realistic appearance and dynamic motion, suitable for real-time applications like gaming and VR.

Abstract: Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa

cs.AI

[212] Impact of Data-Oriented and Object-Oriented Design on Performance and Cache Utilization with Artificial Intelligence Algorithms in Multi-Threaded CPUs

Gabriel M. Arantes, Richard F. Pinto, Bruno L. Dalmazo, Eduardo N. Borges, Giancarlo Lucca, Viviane L. D. de Mattos, Fabian C. Cardoso, Rafael A. Berri

Main category: cs.AI

TL;DR: DOD outperforms OOD in multi-threaded A* search with better cache efficiency and execution times, though single-threaded versions beat multi-threaded ones due to thread overhead.

DetailsMotivation: The performance gap between multi-core CPUs and main memory requires hardware-aware software design paradigms, motivating comparison of Data Oriented Design vs Object-Oriented Design for cache utilization in multi-threaded environments.

Method: Developed and compared four versions of A* search algorithm: single-threaded OOD, single-threaded DOD, multi-threaded OOD, and multi-threaded DOD. Evaluated using execution time, memory usage, and CPU cache misses.

Result: DOD showed significant performance gains in multi-threaded tests with faster execution times and fewer system calls/cache misses. Single-threaded versions outperformed multi-threaded counterparts in both paradigms due to thread management overhead for fine-grained A* tasks.

Conclusion: DOD’s consistent advantages in critical metrics demonstrate its architectural superiority for maximizing hardware efficiency in complex, large-scale AI and parallel computing tasks, despite subtle differences in simple algorithms.

Abstract: The growing performance gap between multi-core CPUs and main memory necessitates hardware-aware software design paradigms. This study provides a comprehensive performance analysis of Data Oriented Design (DOD) versus the traditional Object-Oriented Design (OOD), focusing on cache utilization and efficiency in multi-threaded environments. We developed and compared four distinct versions of the A* search algorithm: single-threaded OOD (ST-OOD), single-threaded DOD (ST-DOD), multi-threaded OOD (MT-OOD), and multi-threaded DOD (MT-DOD). The evaluation was based on metrics including execution time, memory usage, and CPU cache misses. In multi-threaded tests, the DOD implementation demonstrated considerable performance gains, with faster execution times and a lower number of raw system calls and cache misses. While OOD occasionally showed marginal advantages in memory usage or percentage-based cache miss rates, DOD’s efficiency in data-intensive operations was more evident. Furthermore, our findings reveal that for a fine-grained task like the A* algorithm, the overhead associated with thread management led to single-threaded versions significantly outperforming their multi-threaded counterparts in both paradigms. We conclude that even when performance differences appear subtle in simple algorithms, the consistent advantages of DOD in critical metrics highlight its foundational architectural superiority, suggesting it is a more effective approach for maximizing hardware efficiency in complex, large-scale AI and parallel computing tasks.

[213] Can AI autonomously build, operate, and use the entire data stack?

Arvind Agarwal, Lisa Amini, Sameep Mehta, Horst Samulowitz, Kavitha Srinivas

Main category: cs.AI

TL;DR: The paper argues for a paradigm shift from AI-assisted data management to fully autonomous data estates using intelligent agents across the entire data lifecycle.

DetailsMotivation: Current AI assistants only help specific personas with limited tasks, but there's an opportunity for full automation as AI capabilities advance. Enterprise data management is complex and spans multiple domains, creating a need for holistic autonomous systems.

Method: Proposes using intelligent agents to autonomously manage each stage of the modern data stack, creating self-sufficient systems that serve both human users and AI systems.

Result: The paper explores how agents can streamline the data lifecycle, identifies mounting forces demanding this shift, and highlights open research questions for achieving full autonomy.

Conclusion: The work aims to inspire debate, stimulate research, motivate collaboration, and facilitate a more autonomous future for data systems by shifting from component-level AI assistance to holistic autonomous management.

Abstract: Enterprise data management is a monumental task. It spans data architecture and systems, integration, quality, governance, and continuous improvement. While AI assistants can help specific persona, such as data engineers and stewards, to navigate and configure the data stack, they fall far short of full automation. However, as AI becomes increasingly capable of tackling tasks that have previously resisted automation due to inherent complexities, we believe there is an imminent opportunity to target fully autonomous data estates. Currently, AI is used in different parts of the data stack, but in this paper, we argue for a paradigm shift from the use of AI in independent data component operations towards a more holistic and autonomous handling of the entire data lifecycle. Towards that end, we explore how each stage of the modern data stack can be autonomously managed by intelligent agents to build self-sufficient systems that can be used not only by human end-users, but also by AI itself. We begin by describing the mounting forces and opportunities that demand this paradigm shift, examine how agents can streamline the data lifecycle, and highlight open questions and areas where additional research is needed. We hope this work will inspire lively debate, stimulate further research, motivate collaborative approaches, and facilitate a more autonomous future for data systems.

[214] SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu

Main category: cs.AI

TL;DR: SkipKV is a training-free KV cache compression method that uses sentence-level eviction and generation steering to reduce KV cache overhead in large reasoning models while maintaining accuracy.

DetailsMotivation: Large reasoning models suffer from significant KV cache overhead that grows linearly with chain-of-thought reasoning, causing memory and throughput bottlenecks. Existing KV cache eviction methods fail in multi-batch settings due to unstable token-wise scoring and padding issues, and often generate longer sequences.

Method: SkipKV operates at sentence-level rather than token-level: 1) Uses a sentence-scoring metric to identify and remove highly similar sentences while maintaining semantic coherence, 2) Dynamically adjusts a steering vector to update hidden activation states during inference to suppress redundant generation and enforce concise responses.

Result: SkipKV maintains up to 26.7% improved accuracy compared to alternatives at similar compression budgets, generates up to 1.6× fewer tokens than state-of-the-art methods, and improves throughput up to 1.7× on multiple reasoning benchmarks.

Conclusion: SkipKV provides an effective training-free solution for KV cache compression in large reasoning models by addressing limitations of token-level eviction methods through sentence-level operations and generation steering, enabling efficient deployment with maintained accuracy.

Abstract: Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.

[215] Toward an AI Reasoning-Enabled System for Patient-Clinical Trial Matching

Caroline N. Leach, Mitchell A. Klusty, Samuel E. Armstrong, Justine C. Pickarski, Kristen L. Hankins, Emily B. Collier, Maya Shah, Aaron D. Mullen, V. K. Cody Bumgardner

Main category: cs.AI

TL;DR: AI-augmented patient-trial matching system using LLMs to automate clinical trial eligibility screening with interpretable reasoning and human review.

DetailsMotivation: Manual clinical trial eligibility screening is time-consuming, resource-intensive, and inefficient. Current processes lack scalability and comprehensive data integration from heterogeneous EHR systems.

Method: Secure, scalable system using open-source reasoning-enabled LLMs that generate structured eligibility assessments with interpretable reasoning chains. Supports human-in-the-loop review and treats eligibility as dynamic state rather than binary classification.

Result: Proof-of-concept system that integrates heterogeneous EHR data, facilitates expert review, maintains security standards, and provides comprehensive auditability of AI-generated outputs.

Conclusion: AI-augmented patient-trial matching can reduce coordinator burden, intelligently broaden trial consideration, and provide actionable recommendations for future eligibility, representing a significant advancement in clinical trial recruitment efficiency.

Abstract: Screening patients for clinical trial eligibility remains a manual, time-consuming, and resource-intensive process. We present a secure, scalable proof-of-concept system for Artificial Intelligence (AI)-augmented patient-trial matching that addresses key implementation challenges: integrating heterogeneous electronic health record (EHR) data, facilitating expert review, and maintaining rigorous security standards. Leveraging open-source, reasoning-enabled large language models (LLMs), the system moves beyond binary classification to generate structured eligibility assessments with interpretable reasoning chains that support human-in-the-loop review. This decision support tool represents eligibility as a dynamic state rather than a fixed determination, identifying matches when available and offering actionable recommendations that could render a patient eligible in the future. The system aims to reduce coordinator burden, intelligently broaden the set of trials considered for each patient and guarantee comprehensive auditability of all AI-generated outputs.

[216] Large Language Models for Education and Research: An Empirical and User Survey-based Analysis

Md Mostafizer Rahman, Ariful Islam Shiplu, Md Faizul Ibne Amin, Yutaka Watanobe, Lu Peng

Main category: cs.AI

TL;DR: Comprehensive evaluation of ChatGPT and DeepSeek LLMs in education/research contexts, comparing their performance in text generation, programming, and specialized problem-solving with empirical experiments and user surveys.

DetailsMotivation: To evaluate and compare the performance of state-of-the-art LLMs (ChatGPT and DeepSeek) in educational and research applications, understanding their trade-offs in accuracy, computational efficiency, and user experience.

Method: Three-pronged approach: 1) Background technology analysis, 2) Empirical experiments benchmarking performance in text generation, programming, and specialized problem-solving, 3) Real-world user survey of students, educators, and researchers.

Result: ChatGPT excels in general language understanding and text generation, while DeepSeek demonstrates superior performance in programming tasks due to its efficiency-focused design. Both models deliver medically accurate diagnostic outputs and effectively solve complex mathematical problems.

Conclusion: Both ChatGPT and DeepSeek offer valuable capabilities for education and research, with complementary strengths: ChatGPT for general language tasks and DeepSeek for programming efficiency. User surveys provide practical insights into their real-world benefits and limitations for advancing education and research.

Abstract: Pretrained Large Language Models (LLMs) have achieved remarkable success across diverse domains, with education and research emerging as particularly impactful areas. Among current state-of-the-art LLMs, ChatGPT and DeepSeek exhibit strong capabilities in mathematics, science, medicine, literature, and programming. In this study, we present a comprehensive evaluation of these two LLMs through background technology analysis, empirical experiments, and a real-world user survey. The evaluation explores trade-offs among model accuracy, computational efficiency, and user experience in educational and research affairs. We benchmarked these LLMs performance in text generation, programming, and specialized problem-solving. Experimental results show that ChatGPT excels in general language understanding and text generation, while DeepSeek demonstrates superior performance in programming tasks due to its efficiency-focused design. Moreover, both models deliver medically accurate diagnostic outputs and effectively solve complex mathematical problems. Complementing these quantitative findings, a survey of students, educators, and researchers highlights the practical benefits and limitations of these models, offering deeper insights into their role in advancing education and research.

[217] The High Cost of Incivility: Quantifying Interaction Inefficiency via Multi-Agent Monte Carlo Simulations

Benedikt Mangold

Main category: cs.AI

TL;DR: LLM-based multi-agent systems simulate workplace toxicity through adversarial debates, showing toxic participants increase conversation duration by 25%, providing an ethical alternative to human-subject research for measuring social friction’s impact on efficiency.

DetailsMotivation: Workplace toxicity harms organizational culture but is difficult to study directly due to ethical constraints of reproducing conflict in human subjects. There's a need for methods to quantify how toxicity impacts operational efficiency without ethical concerns.

Method: Uses LLM-based Multi-Agent Systems to simulate 1-on-1 adversarial debates as a “sociological sandbox.” Employs Monte Carlo method to simulate hundreds of discussions, comparing baseline control groups with treatment groups where agents have “toxic” system prompts. Measures convergence time (number of arguments to reach conclusion).

Result: Toxic participants increase conversation duration by approximately 25% compared to baseline. This “latency of toxicity” serves as a proxy for financial damage in organizations. The method provides statistically significant results without ethical concerns.

Conclusion: Agent-based modeling with LLMs offers a reproducible, ethical alternative to human-subject research for studying social friction mechanics. The demonstrated “latency of toxicity” provides a quantifiable measure of how workplace conflict impacts operational efficiency.

Abstract: Workplace toxicity is widely recognized as detrimental to organizational culture, yet quantifying its direct impact on operational efficiency remains methodologically challenging due to the ethical and practical difficulties of reproducing conflict in human subjects. This study leverages Large Language Model (LLM) based Multi-Agent Systems to simulate 1-on-1 adversarial debates, creating a controlled “sociological sandbox”. We employ a Monte Carlo method to simulate hundrets of discussions, measuring the convergence time (defined as the number of arguments required to reach a conclusion) between a baseline control group and treatment groups involving agents with “toxic” system prompts. Our results demonstrate a statistically significant increase of approximately 25% in the duration of conversations involving toxic participants. We propose that this “latency of toxicity” serves as a proxy for financial damage in corporate and academic settings. Furthermore, we demonstrate that agent-based modeling provides a reproducible, ethical alternative to human-subject research for measuring the mechanics of social friction.

[218] Scalable Back-End for an AI-Based Diabetes Prediction Application

Henry Anand Septian Radityo, Bernardus Willson, Reynard Tanadi, Latifa Dwiyanti, Saiful Akbar

Main category: cs.AI

TL;DR: Developed scalable backend for diabetes prediction app using horizontal scaling, database sharding, and RabbitMQ message queue, achieving 83% feature performance targets and handling 10k concurrent users.

DetailsMotivation: Early diabetes detection is crucial to prevent complications, requiring AI prediction apps with responsive, scalable backend architecture to serve large user bases effectively.

Method: Built scalable backend system using horizontal scaling, database sharding, and asynchronous communication via RabbitMQ message queue to handle computational load and prevent data loss.

Result: 83% of features (20/24) met performance targets (<5% failure rate, <1000ms latency), handled 10k concurrent users, with RabbitMQ reducing error rates for intensive prediction requests.

Conclusion: The scalable backend architecture successfully supports diabetes prediction app with reliable performance, demonstrating effective handling of large user loads through distributed design and asynchronous processing.

Abstract: The rising global prevalence of diabetes necessitates early detection to prevent severe complications. While AI-powered prediction applications offer a promising solution, they require a responsive and scalable back-end architecture to serve a large user base effectively. This paper details the development and evaluation of a scalable back-end system designed for a mobile diabetes prediction application. The primary objective was to maintain a failure rate below 5% and an average latency of under 1000 ms. The architecture leverages horizontal scaling, database sharding, and asynchronous communication via a message queue. Performance evaluation showed that 83% of the system’s features (20 out of 24) met the specified performance targets. Key functionalities such as user profile management, activity tracking, and read-intensive prediction operations successfully achieved the desired performance. The system demonstrated the ability to handle up to 10,000 concurrent users without issues, validating its scalability. The implementation of asynchronous communication using RabbitMQ proved crucial in minimizing the error rate for computationally intensive prediction requests, ensuring system reliability by queuing requests and preventing data loss under heavy load.

[219] Empowerment Gain and Causal Model Construction: Children and adults are sensitive to controllability and variability in their causal interventions

Eunice Yiu, Kelsey Allen, Shiry Ginosar, Alison Gopnik

Main category: cs.AI

TL;DR: The paper proposes empowerment as a bridge between Bayesian causal learning and reinforcement learning, suggesting it explains human causal learning and could enable machines to learn causal models.

DetailsMotivation: Large pretrained models struggle with causal learning, while humans excel at it. The paper aims to bridge the gap between classical Bayesian causal learning (Causal Bayes Nets) and reinforcement learning through the concept of empowerment.

Method: The paper connects empowerment (mutual information between actions and outcomes) to causal learning, proposing that accurate causal models increase empowerment and vice versa. It includes an empirical study testing how children and adults use empowerment cues to infer causal relations and design interventions.

Result: The paper presents theoretical connections between empowerment and causal learning, and includes empirical findings about how humans use empowerment cues for causal inference and intervention design (though specific results aren’t detailed in the abstract).

Conclusion: Empowerment serves as a bridge between Bayesian causal learning and reinforcement learning, potentially explaining human causal learning capabilities and providing a tractable computational approach for enabling causal learning in machines.

Abstract: Learning about the causal structure of the world is a fundamental problem for human cognition. Causal models and especially causal learning have proved to be difficult for large pretrained models using standard techniques of deep learning. In contrast, cognitive scientists have applied advances in our formal understanding of causation in computer science, particularly within the Causal Bayes Net formalism, to understand human causal learning. In the very different tradition of reinforcement learning, researchers have described an intrinsic reward signal called “empowerment” which maximizes mutual information between actions and their outcomes. “Empowerment” may be an important bridge between classical Bayesian causal learning and reinforcement learning and may help to characterize causal learning in humans and enable it in machines. If an agent learns an accurate causal world model, they will necessarily increase their empowerment, and increasing empowerment will lead to a more accurate causal world model. Empowerment may also explain distinctive features of childrens causal learning, as well as providing a more tractable computational account of how that learning is possible. In an empirical study, we systematically test how children and adults use cues to empowerment to infer causal relations, and design effective causal interventions.

[220] Multi-Agent Intelligence for Multidisciplinary Decision-Making in Gastrointestinal Oncology

Rongzhao Zhang, Junqiao Wang, Shuyun Yang, Mouxiao Bian, Chao Ding, Yuwei Bai, Chihao Zhang, Yuguang Shen, Lei Wang, Lei Zheng, Qiujuan Yan, Yun Zhong, Meiling Liu, Jiwei Yu, Zheng Wang, Jie Xu, Meng Luo

Main category: cs.AI

TL;DR: A hierarchical multi-agent framework mimicking human MDT collaboration outperforms monolithic MLLMs in GI oncology clinical reasoning, achieving 4.60/5.00 expert score with improved logic and accuracy.

DetailsMotivation: Multimodal clinical reasoning in GI oncology requires integrating endoscopic, radiological, and biochemical data, but current MLLMs suffer from context dilution and hallucination with complex medical histories.

Method: Proposed hierarchical Multi-Agent Framework that emulates collaborative workflow of human Multidisciplinary Teams (MDT) to address MLLM limitations.

Result: Achieved composite expert evaluation score of 4.60/5.00, substantially improving over monolithic baseline, with most significant enhancements in reasoning logic and medical accuracy.

Conclusion: Agent-based collaboration provides scalable, interpretable, and clinically robust paradigm for automated decision support in oncology, mimicking human MDT workflows.

Abstract: Multimodal clinical reasoning in the field of gastrointestinal (GI) oncology necessitates the integrated interpretation of endoscopic imagery, radiological data, and biochemical markers. Despite the evident potential exhibited by Multimodal Large Language Models (MLLMs), they frequently encounter challenges such as context dilution and hallucination when confronted with intricate, heterogeneous medical histories. In order to address these limitations, a hierarchical Multi-Agent Framework is proposed, which emulates the collaborative workflow of a human Multidisciplinary Team (MDT). The system attained a composite expert evaluation score of 4.60/5.00, thereby demonstrating a substantial improvement over the monolithic baseline. It is noteworthy that the agent-based architecture yielded the most substantial enhancements in reasoning logic and medical accuracy. The findings indicate that mimetic, agent-based collaboration provides a scalable, interpretable, and clinically robust paradigm for automated decision support in oncology.

[221] Beyond Traditional Diagnostics: Transforming Patient-Side Information into Predictive Insights with Knowledge Graphs and Prototypes

Yibowen Zhao, Yinan Zhang, Zhixiang Su, Lizhen Cui, Chunyan Miao

Main category: cs.AI

TL;DR: KPI framework uses knowledge graphs, disease prototypes, and contrastive learning to improve disease prediction from patient-side information while providing LLM-generated explanations for interpretability.

DetailsMotivation: Current disease prediction methods from patient demographics and symptoms face challenges with imbalanced disease distributions and lack of interpretability, leading to biased/unreliable predictions that limit clinical utility.

Method: Proposes KPI framework that: 1) integrates medical knowledge into unified disease knowledge graph, 2) constructs clinically meaningful disease prototypes, 3) uses contrastive learning for better accuracy (especially for long-tailed diseases), and 4) employs LLMs to generate patient-specific medical explanations.

Result: Outperforms state-of-the-art methods in predictive accuracy on real-world datasets and provides clinically valid explanations that align well with patient narratives.

Conclusion: KPI framework effectively addresses imbalanced data and interpretability challenges in disease prediction, demonstrating practical value for patient-centered healthcare delivery through improved accuracy and clinically relevant explanations.

Abstract: Predicting diseases solely from patient-side information, such as demographics and self-reported symptoms, has attracted significant research attention due to its potential to enhance patient awareness, facilitate early healthcare engagement, and improve healthcare system efficiency. However, existing approaches encounter critical challenges, including imbalanced disease distributions and a lack of interpretability, resulting in biased or unreliable predictions. To address these issues, we propose the Knowledge graph-enhanced, Prototype-aware, and Interpretable (KPI) framework. KPI systematically integrates structured and trusted medical knowledge into a unified disease knowledge graph, constructs clinically meaningful disease prototypes, and employs contrastive learning to enhance predictive accuracy, which is particularly important for long-tailed diseases. Additionally, KPI utilizes large language models (LLMs) to generate patient-specific, medically relevant explanations, thereby improving interpretability and reliability. Extensive experiments on real-world datasets demonstrate that KPI outperforms state-of-the-art methods in predictive accuracy and provides clinically valid explanations that closely align with patient narratives, highlighting its practical value for patient-centered healthcare delivery.

[222] Reasoning Models Ace the CFA Exams

Jaisal Patel, Yunzhe Chen, Kaiwen He, Keyi Wang, David Li, Kairong Xiao, Xiao-Yang Liu

Main category: cs.AI

TL;DR: State-of-the-art reasoning models now pass all three levels of CFA exams with high scores, reversing previous findings of poor LLM performance on these professional finance exams.

DetailsMotivation: Previous research showed LLMs performed poorly on CFA exams, but recent reasoning models have demonstrated strong performance on graduate-level professional exams across disciplines. The authors wanted to evaluate whether current state-of-the-art reasoning models could successfully pass the challenging CFA exams.

Method: Evaluated state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Used the same pass/fail criteria from prior studies for comparison.

Result: Most models cleared all three CFA levels. Top performers: Gemini 3.0 Pro (97.6% on Level I), GPT-5 (94.3% on Level II), Gemini 2.5 Pro (86.4% on Level III multiple-choice), and Gemini 3.0 Pro (92.0% on Level III constructed-response). Models passing in order: Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1.

Conclusion: Current reasoning models have achieved breakthrough performance on CFA exams, demonstrating they can now successfully pass all three levels of this challenging professional finance certification, marking significant progress from earlier LLM capabilities.

Abstract: Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.

[223] Towards Foundation Models with Native Multi-Agent Intelligence

Shuyue Hu, Haoyang Yan, Yiqun Zhang, Yang Chen, Dongzhan Zhou, Lei Bai

Main category: cs.AI

TL;DR: Current foundation models lack native multi-agent intelligence despite strong single-agent capabilities, requiring focused research to develop understanding, planning, communication, and adaptation abilities for multi-agent contexts.

DetailsMotivation: Foundation models are becoming the core of AI agents, but while recent work focuses on single-agent abilities, the next frontier is developing native multi-agent intelligence. Current models show strong single-agent performance but lack robust multi-agent capabilities.

Method: Identified four core multi-agent capabilities (understanding, planning, efficient communication, adaptation), conducted extensive empirical evaluation across 41 large language models to show single-agent performance doesn’t automatically translate to multi-agent intelligence, and outlined research directions including dataset construction, evaluation, training paradigms, and safety considerations.

Result: Empirical evidence shows strong single-agent performance alone does not yield robust multi-agent intelligence. The gap between single-agent and multi-agent capabilities is significant, requiring dedicated research efforts.

Conclusion: Developing foundation models with native multi-agent intelligence is crucial for advancing AI agents. This requires focused research across multiple dimensions including specialized datasets, evaluation frameworks, training methods, and safety considerations to bridge the gap between single-agent and multi-agent capabilities.

Abstract: Foundation models (FMs) are increasingly assuming the role of the “brain” of AI agents. While recent efforts have begun to equip FMs with native single-agent abilities – such as GUI interaction or integrated tool use – we argue that the next frontier is endowing FMs with native multi-agent intelligence. We identify four core capabilities of FMs in multi-agent contexts: understanding, planning, efficient communication, and adaptation. Contrary to assumptions about the spontaneous emergence of such abilities, we provide extensive empirical evidence across 41 large language models showing that strong single-agent performance alone does not automatically yield robust multi-agent intelligence. To address this gap, we outline key research directions – spanning dataset construction, evaluation, training paradigms, and safety considerations – for building FMs with native multi-agent intelligence.

[224] AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content

Thanh Vu, Richi Nayak, Thiru Balasubramaniam

Main category: cs.AI

TL;DR: The paper introduces Generative Agents as an automated solution for evaluating AI-generated content quality, addressing the time and cost challenges of human evaluation in business content creation.

DetailsMotivation: Businesses face challenges with time and expense in content generation and evaluation. Human writers have time constraints, and human evaluations are costly. While LLMs can generate content, concerns about AI content quality persist, and traditional evaluation methods add operational costs.

Method: Introduces Generative Agents that can rapidly and cost-effectively evaluate AI-generated content by simulating human judgment. These agents rate content on aspects like coherence, interestingness, clarity, fairness, and relevance.

Result: Generative Agents enable businesses to streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations.

Conclusion: The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation through Generative Agents.

Abstract: Modern businesses are increasingly challenged by the time and expense required to generate and assess high-quality content. Human writers face time constraints, and extrinsic evaluations can be costly. While Large Language Models (LLMs) offer potential in content creation, concerns about the quality of AI-generated content persist. Traditional evaluation methods, like human surveys, further add operational costs, highlighting the need for efficient, automated solutions. This research introduces Generative Agents as a means to tackle these challenges. These agents can rapidly and cost-effectively evaluate AI-generated content, simulating human judgment by rating aspects such as coherence, interestingness, clarity, fairness, and relevance. By incorporating these agents, businesses can streamline content generation and ensure consistent, high-quality output while minimizing reliance on costly human evaluations. The study provides critical insights into enhancing LLMs for producing business-aligned, high-quality content, offering significant advancements in automated content generation and evaluation.

[225] Towards a Science of Scaling Agent Systems

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, Xin Liu

Main category: cs.AI

TL;DR: Researchers develop quantitative scaling principles for multi-agent AI systems, identifying key trade-offs between tool coordination and computational overhead, with predictive models that can guide optimal architecture selection.

DetailsMotivation: Despite widespread adoption of agent-based AI systems in real-world applications, there's a lack of principled understanding about what determines their performance, forcing practitioners to rely on heuristics rather than evidence-based design choices.

Method: Systematic evaluation across 180 configurations using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) across three LLM families, tested on four diverse benchmarks with standardized tools and token budgets. Developed predictive models using empirical coordination metrics like efficiency, overhead, error amplification, and redundancy.

Result: Identified three dominant effects: 1) tool-coordination trade-off where tool-heavy tasks suffer from multi-agent overhead, 2) capability saturation where coordination yields diminishing returns after single-agent baselines exceed ~45%, and 3) topology-dependent error amplification. Centralized coordination improved financial reasoning by 80.9%, while decentralized excelled at web navigation (+9.2%). The framework predicted optimal coordination strategy for 87% of held-out configurations with cross-validated R²=0.513.

Conclusion: The study provides the first quantitative scaling principles for agent systems, offering a predictive framework based on measurable task properties that can guide optimal architecture selection, moving beyond heuristics to principled design of multi-agent AI systems.

Abstract: Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations with standardized tools and token budgets. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated R^2=0.513. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns (beta=-0.408, p<0.001) once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations, providing a predictive principle of agentic scaling based on measurable task properties.

[226] Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning

Wei Yang, Jesse Thomason

Main category: cs.AI

TL;DR: MPDF introduces decentralized meta-cognitive policies for LLM agents with SoftRankPO algorithm, achieving 4-5% accuracy gains over SOTA multi-agent reasoning methods.

DetailsMotivation: Current multi-agent LLM systems use fixed collaboration protocols that overlook agents' internal deliberative capabilities and meta-cognitive states like uncertainty or confidence, treating agents as passive executors rather than adaptive decision-makers.

Method: Meta-Policy Deliberation Framework (MPDF) where agents learn decentralized policies over meta-cognitive actions (Persist, Refine, Concede), using SoftRankPO - a novel RL algorithm that stabilizes training by shaping advantages based on rank of rewards mapped through smooth normal quantiles.

Result: MPDF with SoftRankPO achieves 4-5% absolute gain in average accuracy across five mathematical and general reasoning benchmarks compared to six state-of-the-art heuristic and learning-based multi-agent reasoning algorithms.

Conclusion: The work presents a paradigm shift from designing fixed protocols to learning dynamic, deliberative strategies for multi-agent LLM systems, enabling adaptive meta-cognitive policies.

Abstract: Multi-agent systems of large language models (LLMs) show promise for complex reasoning, but their effectiveness is often limited by fixed collaboration protocols. These frameworks typically focus on macro-level orchestration while overlooking agents’ internal deliberative capabilities. This critical meta-cognitive blindspot treats agents as passive executors unable to adapt their strategy based on internal cognitive states like uncertainty or confidence. We introduce the Meta-Policy Deliberation Framework (MPDF), where agents learn a decentralized policy over a set of high-level meta-cognitive actions: Persist, Refine, and Concede. To overcome the instability of traditional policy gradients in this setting, we develop SoftRankPO, a novel reinforcement learning algorithm. SoftRankPO stabilizes training by shaping advantages based on the rank of rewards mapped through smooth normal quantiles, making the learning process robust to reward variance. Experiments show that MPDF with SoftRankPO achieves a a 4-5% absolute gain in average accuracy across five mathematical and general reasoning benchmarks compared to six state-of-the-art heuristic and learning-based multi-agent reasoning algorithms. Our work presents a paradigm for learning adaptive, meta-cognitive policies for multi-agent LLM systems, shifting the focus from designing fixed protocols to learning dynamic, deliberative strategies.

[227] rSIM: Incentivizing Reasoning Capabilities of LLMs via Reinforced Strategy Injection

Sijia Chen, Baochun Li, Di Niu

Main category: cs.AI

TL;DR: rSIM enables LLMs to become Reasoning Language Models via a small planner trained with multi-agent RL to inject reasoning strategies into CoTs.

DetailsMotivation: LLMs can evolve into Reasoning Language Models with "aha" moments through RL, but need mechanisms to systematically inject reasoning strategies like self-reflection and deep thinking into chain of thoughts.

Method: Proposes reinforced strategy injection mechanism (rSIM) with a small planner (leader agent) that guides LLM’s CoT through adaptive injection of reasoning strategies. Uses multi-agent RL with leader-follower framework and rule-based rewards to jointly train planner and LLM.

Result: rSIM enables Qwen2.5-0.5B to outperform Qwen2.5-14B. The planner is generalizable - trained once and applicable as plug-in to improve existing LLMs. Supports continual learning across tasks, improving planning abilities over time.

Conclusion: rSIM effectively transforms LLMs into RLMs through adaptive strategy injection, with the planner showing strong generalization and continual learning capabilities across diverse reasoning tasks.

Abstract: Large language models (LLMs) are post-trained through reinforcement learning (RL) to evolve into Reasoning Language Models (RLMs), where the hallmark of this advanced reasoning is ``aha’’ moments when they start to perform strategies, such as self-reflection and deep thinking, within chain of thoughts (CoTs). Motivated by this, this paper proposes a novel reinforced strategy injection mechanism (rSIM), that enables any LLM to become an RLM by employing a small planner to guide the LLM’s CoT through the adaptive injection of reasoning strategies. To achieve this, the planner (leader agent) is jointly trained with an LLM (follower agent) using multi-agent RL (MARL), based on a leader-follower framework and straightforward rule-based rewards. Experimental results show that rSIM enables Qwen2.5-0.5B to become an RLM and significantly outperform Qwen2.5-14B. Moreover, the planner is generalizable: it only needs to be trained once and can be applied as a plug-in to substantially improve the reasoning capabilities of existing LLMs. In addition, the planner supports continual learning across various tasks, allowing its planning abilities to gradually improve and generalize to a wider range of problems.

[228] Predicting California Bearing Ratio with Ensemble and Neural Network Models: A Case Study from Türkiye

Abdullah Hulusi Kökçam, Uğur Dağdeviren, Talas Fikret Kurnaz, Alparslan Serhat Demir, Caner Erden

Main category: cs.AI

TL;DR: This study develops a machine learning framework using 12 different algorithms to predict California Bearing Ratio (CBR) values from soil properties, with random forest achieving the best performance (R²=0.83 on test data).

DetailsMotivation: Traditional CBR laboratory tests are time-consuming, costly, and impractical for large-scale soil analysis. Machine learning offers a faster, data-driven alternative for predicting soil bearing capacity.

Method: Used 382 soil samples from Türkiye with physicochemical properties. Tested 12 ML algorithms (decision tree, random forest, extra trees, gradient boosting, xgboost, k-NN, SVR, MLP, adaboost, bagging, voting, stacking) in a supervised learning framework with training, validation, and test evaluation.

Result: Random forest regressor performed best with R² scores: 0.95 (training), 0.76 (validation), and 0.83 (test). Demonstrates strong nonlinear mapping capability for CBR prediction.

Conclusion: ML models, particularly random forest, provide an effective alternative to traditional CBR testing, supporting digital transformation in geotechnical engineering and infrastructure design.

Abstract: The California Bearing Ratio (CBR) is a key geotechnical indicator used to assess the load-bearing capacity of subgrade soils, especially in transportation infrastructure and foundation design. Traditional CBR determination relies on laboratory penetration tests. Despite their accuracy, these tests are often time-consuming, costly, and can be impractical, particularly for large-scale or diverse soil profiles. Recent progress in artificial intelligence, especially machine learning (ML), has enabled data-driven approaches for modeling complex soil behavior with greater speed and precision. This study introduces a comprehensive ML framework for CBR prediction using a dataset of 382 soil samples collected from various geoclimatic regions in Türkiye. The dataset includes physicochemical soil properties relevant to bearing capacity, allowing multidimensional feature representation in a supervised learning context. Twelve ML algorithms were tested, including decision tree, random forest, extra trees, gradient boosting, xgboost, k-nearest neighbors, support vector regression, multi-layer perceptron, adaboost, bagging, voting, and stacking regressors. Each model was trained, validated, and evaluated to assess its generalization and robustness. Among them, the random forest regressor performed the best, achieving strong R2 scores of 0.95 (training), 0.76 (validation), and 0.83 (test). These outcomes highlight the model’s powerful nonlinear mapping ability, making it a promising tool for predictive geotechnical tasks. The study supports the integration of intelligent, data-centric models in geotechnical engineering, offering an effective alternative to traditional methods and promoting digital transformation in infrastructure analysis and design.

[229] Soil Compaction Parameters Prediction Based on Automated Machine Learning Approach

Caner Erden, Alparslan Serhat Demir, Abdullah Hulusi Kokcam, Talas Fikret Kurnaz, Ugur Dagdeviren

Main category: cs.AI

TL;DR: AutoML approach using XGBoost achieves high accuracy (R²: 80.4% MDD, 89.1% OMC) for predicting soil compaction parameters across diverse soil types, outperforming traditional methods and standard ML models.

DetailsMotivation: Traditional laboratory methods for determining optimum moisture content (OMC) and maximum dry density (MDD) are labor-intensive, while empirical regression models have limited accuracy and generalizability across different soil types. Existing ML models also struggle with prediction accuracy and generalization on heterogeneous datasets.

Method: Proposes an automated machine learning (AutoML) approach that automates algorithm selection and hyperparameter optimization to predict OMC and MDD. The method was tested through extensive experimentation on heterogeneous soil datasets.

Result: Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on separate test datasets. The AutoML approach demonstrated superior accuracy and generalizability compared to traditional methods.

Conclusion: AutoML effectively predicts soil compaction parameters across different soil types, with XGBoost showing optimal performance. Heterogeneous datasets improve ML model generalization, contributing to more efficient and reliable construction practices through enhanced prediction of compaction parameters.

Abstract: Soil compaction is critical in construction engineering to ensure the stability of structures like road embankments and earth dams. Traditional methods for determining optimum moisture content (OMC) and maximum dry density (MDD) involve labor-intensive laboratory experiments, and empirical regression models have limited applicability and accuracy across diverse soil types. In recent years, artificial intelligence (AI) and machine learning (ML) techniques have emerged as alternatives for predicting these compaction parameters. However, ML models often struggle with prediction accuracy and generalizability, particularly with heterogeneous datasets representing various soil types. This study proposes an automated machine learning (AutoML) approach to predict OMC and MDD. AutoML automates algorithm selection and hyperparameter optimization, potentially improving accuracy and scalability. Through extensive experimentation, the study found that the Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on a separate dataset. These results demonstrate the effectiveness of AutoML in predicting compaction parameters across different soil types. The study also highlights the importance of heterogeneous datasets in improving the generalization and performance of ML models. Ultimately, this research contributes to more efficient and reliable construction practices by enhancing the prediction of soil compaction parameters.

[230] Enhancing Explainability of Graph Neural Networks Through Conceptual and Structural Analyses and Their Extensions

Tien Cuong Bui

Main category: cs.AI

TL;DR: Proposes a novel XAI framework for GNNs that provides adaptable, efficient explanations by capturing how graph structure influences predictions, addressing limitations of current post-hoc and interpretable methods.

DetailsMotivation: Current XAI methods struggle to explain GNN decision-making due to graph complexity. Post-hoc methods are computationally expensive and unreliable, while interpretable models lack generalizability across different scenarios.

Method: Develops a novel XAI framework specifically designed for graph-based ML that moves beyond individual feature analysis to capture structural influences on predictions.

Result: The proposed framework aims to offer adaptable and computationally efficient explanations for GNNs, addressing the shortcomings of existing methods.

Conclusion: A new XAI approach is needed for GNNs that can provide reliable, efficient explanations by analyzing how graph structure affects model predictions, bridging the gap between post-hoc and interpretable methods.

Abstract: Graph Neural Networks (GNNs) have become a powerful tool for modeling and analyzing data with graph structures. The wide adoption in numerous applications underscores the value of these models. However, the complexity of these methods often impedes understanding their decision-making processes. Current Explainable AI (XAI) methods struggle to untangle the intricate relationships and interactions within graphs. Several methods have tried to bridge this gap via a post-hoc approach or self-interpretable design. Most of them focus on graph structure analysis to determine essential patterns that correlate with prediction outcomes. While post-hoc explanation methods are adaptable, they require extra computational resources and may be less reliable due to limited access to the model’s internal workings. Conversely, Interpretable models can provide immediate explanations, but their generalizability to different scenarios remains a major concern. To address these shortcomings, this thesis seeks to develop a novel XAI framework tailored for graph-based machine learning. The proposed framework aims to offer adaptable, computationally efficient explanations for GNNs, moving beyond individual feature analysis to capture how graph structure influences predictions.

[231] Reflecting with Two Voices: A Co-Adaptive Dual-Strategy Framework for LLM-Based Agent Decision Making

Wentao Zhang, Qunbo Wang, Tao Zhang, Junsheng Wu, Hongping Gan, Yang Liu, Ling Dai, Shizhuang Deng, Shuntong Sun

Main category: cs.AI

TL;DR: DuSAR is a demonstration-free LLM agent framework that uses dual-strategy reasoning (holistic planning + local policy) with reflection to achieve state-of-the-art performance while reducing computational overhead.

DetailsMotivation: Current LLM agents rely heavily on external demonstrations or retrieval-augmented planning, which leads to brittleness, poor generalization, and high computational costs. The authors were inspired by human problem-solving approaches to create a more robust and efficient framework.

Method: DuSAR uses a single frozen LLM with two complementary strategies: high-level holistic planning and context-grounded local policy. These interact through a lightweight reflection mechanism where the agent continuously assesses progress via a Strategy Fitness Score and dynamically revises its global plan when stuck or refines it upon meaningful advancement.

Result: On ALFWorld and Mind2Web benchmarks, DuSAR achieves state-of-the-art performance with open-source LLMs (7B-70B): 37.1% success on ALFWorld (Llama3.1-70B, more than doubling prior best of 13.0%) and 4.02% on Mind2Web (also more than doubling strongest baseline). It reduces per-step token consumption by 3-9X while maintaining strong performance.

Conclusion: DuSAR demonstrates that dual-strategy coordination with reflection enables efficient, robust LLM agents without external demonstrations. The framework’s flexibility allows optional integration of expert knowledge for further improvements, showing promise for more human-like problem-solving in AI agents.

Abstract: Large language model (LLM) agents often rely on external demonstrations or retrieval-augmented planning, leading to brittleness, poor generalization, and high computational overhead. Inspired by human problem-solving, we propose DuSAR (Dual-Strategy Agent with Reflecting) - a demonstration-free framework that enables a single frozen LLM to perform co-adaptive reasoning via two complementary strategies: a high-level holistic plan and a context-grounded local policy. These strategies interact through a lightweight reflection mechanism, where the agent continuously assesses progress via a Strategy Fitness Score and dynamically revises its global plan when stuck or refines it upon meaningful advancement, mimicking human metacognitive behavior. On ALFWorld and Mind2Web, DuSAR achieves state-of-the-art performance with open-source LLMs (7B-70B), reaching 37.1% success on ALFWorld (Llama3.1-70B) - more than doubling the best prior result (13.0%) - and 4.02% on Mind2Web, also more than doubling the strongest baseline. Remarkably, it reduces per-step token consumption by 3-9X while maintaining strong performance. Ablation studies confirm the necessity of dual-strategy coordination. Moreover, optional integration of expert demonstrations further boosts results, highlighting DuSAR’s flexibility and compatibility with external knowledge.

[232] DeepFeature: Iterative Context-aware Feature Generation for Wearable Biosignals

Kaiwei Liu, Yuting He, Bufang Yang, Mu Yuan, Chun Man Victor Wong, Ho Pong Andrew Sze, Zhenyu Yan, Hongkai Chen

Main category: cs.AI

TL;DR: DeepFeature is an LLM-powered framework that generates context-aware features for wearable biosignal analysis, integrating expert knowledge with task-specific settings and using iterative refinement with robust code translation.

DetailsMotivation: Existing feature extraction methods for wearable biosignals lack task-specific contextual knowledge, struggle with optimal feature selection in high-dimensional spaces, and are prone to code generation errors, limiting their effectiveness in healthcare applications.

Method: DeepFeature introduces: 1) multi-source feature generation integrating expert knowledge with task settings, 2) iterative feature refinement using assessment-based feedback for re-selection, and 3) robust multi-layer filtering/verification for feature-to-code translation to ensure crash-free execution.

Result: DeepFeature achieves average AUROC improvement of 4.21-9.67% across eight diverse tasks compared to baselines, outperforming state-of-the-art approaches on five tasks while maintaining comparable performance on the remaining three tasks.

Conclusion: DeepFeature demonstrates that LLM-empowered, context-aware feature generation can significantly improve wearable biosignal analysis performance by addressing limitations of existing feature extraction methods through integrated expert knowledge, iterative refinement, and robust code translation.

Abstract: Biosignals collected from wearable devices are widely utilized in healthcare applications. Machine learning models used in these applications often rely on features extracted from biosignals due to their effectiveness, lower data dimensionality, and wide compatibility across various model architectures. However, existing feature extraction methods often lack task-specific contextual knowledge, struggle to identify optimal feature extraction settings in high-dimensional feature space, and are prone to code generation and automation errors. In this paper, we propose DeepFeature, the first LLM-empowered, context-aware feature generation framework for wearable biosignals. DeepFeature introduces a multi-source feature generation mechanism that integrates expert knowledge with task settings. It also employs an iterative feature refinement process that uses feature assessment-based feedback for feature re-selection. Additionally, DeepFeature utilizes a robust multi-layer filtering and verification approach for robust feature-to-code translation to ensure that the extraction functions run without crashing. Experimental evaluation results show that DeepFeature achieves an average AUROC improvement of 4.21-9.67% across eight diverse tasks compared to baseline methods. It outperforms state-of-the-art approaches on five tasks while maintaining comparable performance on the remaining tasks.

[233] Prismatic World Model: Learning Compositional Dynamics for Planning in Hybrid Systems

Mingwei Li, Xiaoyuan Zhang, Chengwei Yang, Zilong Zheng, Yaodong Yang

Main category: cs.AI

TL;DR: PRISM-WM is a structured world model that decomposes hybrid dynamics into composable primitives using Mixture-of-Experts to handle sharp mode transitions, reducing rollout drift for better planning.

DetailsMotivation: Model-based planning in robotics struggles with hybrid dynamics where continuous motion has discrete events (contacts, impacts). Conventional monolithic neural networks over-smooth distinct dynamic modes, causing catastrophic compounding errors during long-horizon planning at physical boundaries.

Method: PRISM-WM uses a context-aware Mixture-of-Experts framework with a gating mechanism to identify physical modes and specialized experts to predict associated transition dynamics. Includes latent orthogonalization objective to ensure expert diversity and prevent mode collapse.

Result: PRISM-WM significantly reduces rollout drift by accurately modeling sharp mode transitions. Extensive experiments on challenging continuous control benchmarks (high-dimensional humanoids, multi-task settings) show superior performance as a substrate for trajectory optimization algorithms like TD-MPC.

Conclusion: PRISM-WM provides a powerful foundational model for next-generation model-based agents by effectively decomposing complex hybrid dynamics into composable primitives, enabling more reliable planning at physical boundaries.

Abstract: Model-based planning in robotic domains is fundamentally challenged by the hybrid nature of physical dynamics, where continuous motion is punctuated by discrete events such as contacts and impacts. Conventional latent world models typically employ monolithic neural networks that enforce global continuity, inevitably over-smoothing the distinct dynamic modes (e.g., sticking vs. sliding, flight vs. stance). For a planner, this smoothing results in catastrophic compounding errors during long-horizon lookaheads, rendering the search process unreliable at physical boundaries. To address this, we introduce the Prismatic World Model (PRISM-WM), a structured architecture designed to decompose complex hybrid dynamics into composable primitives. PRISM-WM leverages a context-aware Mixture-of-Experts (MoE) framework where a gating mechanism implicitly identifies the current physical mode, and specialized experts predict the associated transition dynamics. We further introduce a latent orthogonalization objective to ensure expert diversity, effectively preventing mode collapse. By accurately modeling the sharp mode transitions in system dynamics, PRISM-WM significantly reduces rollout drift. Extensive experiments on challenging continuous control benchmarks, including high-dimensional humanoids and diverse multi-task settings, demonstrate that PRISM-WM provides a superior high-fidelity substrate for trajectory optimization algorithms (e.g., TD-MPC), proving its potential as a powerful foundational model for next-generation model-based agents.

[234] From Accuracy to Impact: The Impact-Driven AI Framework (IDAIF) for Aligning Engineering Architecture with Theory of Change

Yong-Woon Kim

Main category: cs.AI

TL;DR: IDAIF integrates Theory of Change with AI architecture to ensure AI systems align with human values and create positive social impact across high-stakes domains.

DetailsMotivation: Current AI approaches focus too much on technical performance metrics while ignoring sociotechnical dimensions, creating alignment problems where AI behavior doesn't correspond with human values and intentions in critical domains like healthcare, finance, and public policy.

Method: IDAIF maps Theory of Change’s five-stage model (Inputs-Activities-Outputs-Outcomes-Impact) to AI architectural layers (Data-Pipeline-Inference-Agentic-Normative), incorporating multi-objective Pareto optimization, hierarchical multi-agent orchestration, causal DAGs for hallucination mitigation, adversarial debiasing with RLHF, and an Assurance Layer with guardian architectures.

Result: The framework provides formal mathematical formulations for each component and demonstrates application through three case studies in healthcare, cybersecurity, and software engineering domains.

Conclusion: IDAIF represents a paradigm shift from model-centric to impact-centric AI development, offering engineers concrete architectural patterns for building ethical, trustworthy, and socially beneficial AI systems.

Abstract: This paper introduces the Impact-Driven AI Framework (IDAIF), a novel architectural methodology that integrates Theory of Change (ToC) principles with modern artificial intelligence system design. As AI systems increasingly influence high-stakes domains including healthcare, finance, and public policy, the alignment problem–ensuring AI behavior corresponds with human values and intentions–has become critical. Current approaches predominantly optimize technical performance metrics while neglecting the sociotechnical dimensions of AI deployment. IDAIF addresses this gap by establishing a systematic mapping between ToC’s five-stage model (Inputs-Activities-Outputs-Outcomes-Impact) and corresponding AI architectural layers (Data Layer-Pipeline Layer-Inference Layer-Agentic Layer-Normative Layer). Each layer incorporates rigorous theoretical foundations: multi-objective Pareto optimization for value alignment, hierarchical multi-agent orchestration for outcome achievement, causal directed acyclic graphs (DAGs) for hallucination mitigation, and adversarial debiasing with Reinforcement Learning from Human Feedback (RLHF) for fairness assurance. We provide formal mathematical formulations for each component and introduce an Assurance Layer that manages assumption failures through guardian architectures. Three case studies demonstrate IDAIF application across healthcare, cybersecurity, and software engineering domains. This framework represents a paradigm shift from model-centric to impact-centric AI development, providing engineers with concrete architectural patterns for building ethical, trustworthy, and socially beneficial AI systems.

[235] Using reinforcement learning to probe the role of feedback in skill acquisition

Antonio Terpin, Raffaello D’Andrea

Main category: cs.AI

TL;DR: RL agent learns to control drag on spinning cylinder in water channel using flow feedback, discovering high-performance strategies that work open-loop after training.

DetailsMotivation: To study skill acquisition under controlled conditions, bypassing human subjects to understand how feedback enables learning complex physical skills that can later be executed without feedback.

Method: Interfaced a generalist RL agent with a spinning cylinder in a tabletop circulating water channel to maximize/minimize drag. Used high-dimensional flow feedback during training, then tested learned policies open-loop without feedback.

Result: With flow feedback, agent discovered high-performance drag-control strategies in minutes. Learned policies worked open-loop with identical performance. Without flow feedback during training, agent failed at drag maximization but succeeded (more slowly) at drag minimization.

Conclusion: Learning a skill can require richer information than executing it. Learning conditions can be kind or wicked depending on the goal, not on dynamics or policy complexity.

Abstract: Many high-performance human activities are executed with little or no external feedback: think of a figure skater landing a triple jump, a pitcher throwing a curveball for a strike, or a barista pouring latte art. To study the process of skill acquisition under fully controlled conditions, we bypass human subjects. Instead, we directly interface a generalist reinforcement learning agent with a spinning cylinder in a tabletop circulating water channel to maximize or minimize drag. This setup has several desirable properties. First, it is a physical system, with the rich interactions and complex dynamics that only the physical world has: the flow is highly chaotic and extremely difficult, if not impossible, to model or simulate accurately. Second, the objective – drag minimization or maximization – is easy to state and can be captured directly in the reward, yet good strategies are not obvious beforehand. Third, decades-old experimental studies provide recipes for simple, high-performance open-loop policies. Finally, the setup is inexpensive and far easier to reproduce than human studies. In our experiments we find that high-dimensional flow feedback lets the agent discover high-performance drag-control strategies with only minutes of real-world interaction. When we later replay the same action sequences without any feedback, we obtain almost identical performance. This shows that feedback, and in particular flow feedback, is not needed to execute the learned policy. Surprisingly, without flow feedback during training the agent fails to discover any well-performing policy in drag maximization, but still succeeds in drag minimization, albeit more slowly and less reliably. Our studies show that learning a high-performance skill can require richer information than executing it, and learning conditions can be kind or wicked depending solely on the goal, not on dynamics or policy complexity.

[236] Autonomous Issue Resolver: Towards Zero-Touch Code Maintenance

Aliaksei Kaliutau

Main category: cs.AI

TL;DR: The paper proposes a Data Transformation Graph (DTG) paradigm shift from Code Property Graphs for repository-scale Automated Program Repair, using data lineage tracing instead of control flow, achieving 87.1% resolution rate on SWE-Verified benchmark.

DetailsMotivation: Current approaches for repository-scale Automated Program Repair (APR) use control-centric paradigms that force agents to navigate complex directory structures and irrelevant control logic, creating limitations for large-scale code maintenance.

Method: Proposes Data Transformation Graph (DTG) that inverts topology by modeling data states as nodes and functions as edges, enabling logic defect tracing through data lineage. Introduces multi-agent framework reconciling data integrity navigation with control flow logic, implemented as Autonomous Issue Resolver (AIR) with neuro-symbolic reasoning.

Result: Demonstrates good results on several SWE benchmarks, reaching 87.1% resolution rate on SWE-Verified benchmark. Resolves the “Semantic Trap” inherent in standard RAG systems in modern coding agents.

Conclusion: The DTG approach addresses core limitations of current AI code-assistant tools and provides a more robust foundation for software-dependent world through scalable logic repair and zero-touch code maintenance.

Abstract: Recent advances in Large Language Models have revolutionized function-level code generation; however, repository-scale Automated Program Repair (APR) remains a significant challenge. Current approaches typically employ a control-centric paradigm, forcing agents to navigate complex directory structures and irrelevant control logic. In this paper, we propose a paradigm shift from the standard Code Property Graphs (CPGs) to the concept of Data Transformation Graph (DTG) that inverts the topology by modeling data states as nodes and functions as edges, enabling agents to trace logic defects through data lineage rather than control flow. We introduce a multi-agent framework that reconciles data integrity navigation with control flow logic. Our theoretical analysis and case studies demonstrate that this approach resolves the “Semantic Trap” inherent in standard RAG systems in modern coding agents. We provide a comprehensive implementation in the form of Autonomous Issue Resolver (AIR), a self-improvement system for zero-touch code maintenance that utilizes neuro-symbolic reasoning and uses the DTG structure for scalable logic repair. Our approach has demonstrated good results on several SWE benchmarks, reaching a resolution rate of 87.1% on SWE-Verified benchmark. Our approach directly addresses the core limitations of current AI code-assistant tools and tackles the critical need for a more robust foundation for our increasingly software-dependent world.

[237] A Lightweight Transfer Learning-Based State-of-Health Monitoring with Application to Lithium-ion Batteries in Unmanned Air Vehicles

Jiang Liu, Yan Qin, Wei Dai, Chau Yuen

Main category: cs.AI

TL;DR: A lightweight transfer learning approach for battery SOH monitoring using constructive incremental transfer learning to reduce computational overhead while maintaining accuracy.

DetailsMotivation: Traditional transfer learning for battery state-of-health monitoring consumes too much computational resources for portable mobile devices, reducing their working endurance. Need a lightweight approach that can leverage knowledge from source conditions while minimizing target domain data requirements.

Method: Proposes constructive incremental transfer learning (CITL) with semi-supervised TL mechanism using unlabeled target data. Iteratively adds network nodes while ensuring cross-domain learning through structural risk minimization, transfer mismatching minimization, and manifold consistency maximization. Includes convergence analysis.

Result: CITL outperforms SS-TCA, MMD-LSTM-DA, DDAN, BO-CNN-TL, and AS³LSTM in SOH estimation by 83.73%, 61.15%, 28.24%, 87.70%, and 57.34% respectively in RMSE evaluation using realistic UAV battery dataset.

Conclusion: The proposed lightweight CITL approach effectively addresses computational resource constraints in portable devices while achieving superior SOH monitoring accuracy through constructive incremental learning and comprehensive cross-domain optimization.

Abstract: Accurate and rapid state-of-health (SOH) monitoring plays an important role in indicating energy information for lithium-ion battery-powered portable mobile devices. To confront their variable working conditions, transfer learning (TL) emerges as a promising technique for leveraging knowledge from data-rich source working conditions, significantly reducing the training data required for SOH monitoring from target working conditions. However, traditional TL-based SOH monitoring is infeasible when applied in portable mobile devices since substantial computational resources are consumed during the TL stage and unexpectedly reduce the working endurance. To address these challenges, this paper proposes a lightweight TL-based SOH monitoring approach with constructive incremental transfer learning (CITL). First, taking advantage of the unlabeled data in the target domain, a semi-supervised TL mechanism is proposed to minimize the monitoring residual in a constructive way, through iteratively adding network nodes in the CITL. Second, the cross-domain learning ability of node parameters for CITL is comprehensively guaranteed through structural risk minimization, transfer mismatching minimization, and manifold consistency maximization. Moreover, the convergence analysis of the CITL is given, theoretically guaranteeing the efficacy of TL performance and network compactness. Finally, the proposed approach is verified through extensive experiments with a realistic unmanned air vehicles (UAV) battery dataset collected from dozens of flight missions. Specifically, the CITL outperforms SS-TCA, MMD-LSTM-DA, DDAN, BO-CNN-TL, and AS$^3$LSTM, in SOH estimation by 83.73%, 61.15%, 28.24%, 87.70%, and 57.34%, respectively, as evaluated using the index root mean square error.

[238] Principles2Plan: LLM-Guided System for Operationalising Ethical Principles into Plans

Tammy Zhong, Yang Song, Maurice Pagnucco

Main category: cs.AI

TL;DR: Principles2Plan: Human-LLM collaboration system for generating context-sensitive ethical rules to guide automated planning in robotics

DetailsMotivation: Robots need ethical awareness in human environments, but existing planning tools lack ethical support. Manual specification of ethical rules is labor-intensive and context-specific.

Method: Interactive system where domain expert provides planning domain, problem details, and high-level ethical principles. LLM generates operationalizable ethical rules consistent with principles, which user can review, prioritize, and supply to planner.

Result: First system to support users in generating principle-grounded rules for classical planning contexts. Demonstrates practical human-LLM collaboration for ethical automated planning.

Conclusion: Principles2Plan showcases potential of human-LLM collaboration to make ethical automated planning more practical and feasible for robotics applications.

Abstract: Ethical awareness is critical for robots operating in human environments, yet existing automated planning tools provide little support. Manually specifying ethical rules is labour-intensive and highly context-specific. We present Principles2Plan, an interactive research prototype demonstrating how a human and a Large Language Model (LLM) can collaborate to produce context-sensitive ethical rules and guide automated planning. A domain expert provides the planning domain, problem details, and relevant high-level principles such as beneficence and privacy. The system generates operationalisable ethical rules consistent with these principles, which the user can review, prioritise, and supply to a planner to produce ethically-informed plans. To our knowledge, no prior system supports users in generating principle-grounded rules for classical planning contexts. Principles2Plan showcases the potential of human-LLM collaboration for making ethical automated planning more practical and feasible.

[239] The SMART+ Framework for AI Systems

Laxmiraju Kandikatla, Branislav Radeljic

Main category: cs.AI

TL;DR: The paper introduces the SMART+ Framework for AI governance across industries, focusing on safety, accountability, and compliance in clinical research and other sectors.

DetailsMotivation: AI systems are widely adopted across industries (clinical research, finance, manufacturing) but introduce challenges regarding safety, accountability, and regulatory compliance that need to be addressed.

Method: Introduces the SMART+ Framework built on Safety, Monitoring, Accountability, Reliability, Transparency pillars, enhanced with Privacy & Security, Data Governance, Fairness & Bias, and Guardrails.

Result: SMART+ offers a practical, comprehensive approach for evaluating and governing AI systems, demonstrating risk mitigation, trust-building, and compliance readiness.

Conclusion: The framework provides a robust foundation for effective AI governance in clinical research by enabling responsible AI adoption and ensuring auditability.

Abstract: Artificial Intelligence (AI) systems are now an integral part of multiple industries. In clinical research, AI supports automated adverse event detection in clinical trials, patient eligibility screening for protocol enrollment, and data quality validation. Beyond healthcare, AI is transforming finance through real-time fraud detection, automated loan risk assessment, and algorithmic decision-making. Similarly, in manufacturing, AI enables predictive maintenance to reduce equipment downtime, enhances quality control through computer-vision inspection, and optimizes production workflows using real-time operational data. While these technologies enhance operational efficiency, they introduce new challenges regarding safety, accountability, and regulatory compliance. To address these concerns, we introduce the SMART+ Framework - a structured model built on the pillars of Safety, Monitoring, Accountability, Reliability, and Transparency, and further enhanced with Privacy & Security, Data Governance, Fairness & Bias, and Guardrails. SMART+ offers a practical, comprehensive approach to evaluating and governing AI systems across industries. This framework aligns with evolving mechanisms and regulatory guidance to integrate operational safeguards, oversight procedures, and strengthened privacy and governance controls. SMART+ demonstrates risk mitigation, trust-building, and compliance readiness. By enabling responsible AI adoption and ensuring auditability, SMART+ provides a robust foundation for effective AI governance in clinical research.

[240] CogMCTS: A Novel Cognitive-Guided Monte Carlo Tree Search Framework for Iterative Heuristic Evolution with Large Language Models

Hui Wang, Yang Liu, Xiaoyu Zhang, Chaoxu Mu

Main category: cs.AI

TL;DR: CogMCTS: A cognitive-guided Monte Carlo Tree Search framework that integrates LLMs with MCTS for automated heuristic design, using multi-round feedback and dual-track expansion to improve optimization performance.

DetailsMotivation: Existing LLM-based evolutionary methods for Automatic Heuristic Design (AHD) suffer from local optima issues, while LLM-MCTS integration has limited cognitive integration and constrained search diversity. Need better balance between exploration and exploitation.

Method: Proposes CogMCTS framework that tightly integrates LLM cognitive guidance with MCTS. Uses multi-round cognitive feedback incorporating historical experience, node info, and negative outcomes. Implements dual-track node expansion with elite heuristic management, and strategic mutation for heuristic form/parameter diversity.

Result: Experimental results show CogMCTS outperforms existing LLM-based AHD methods in stability, efficiency, and solution quality.

Conclusion: CogMCTS effectively addresses limitations of current LLM-based AHD approaches by better integrating cognitive guidance with MCTS, achieving superior optimization performance through improved exploration-exploitation balance and solution diversity.

Abstract: Automatic Heuristic Design (AHD) is an effective1 framework for solving complex optimization prob-2 lems. The development of large language mod-3 els (LLMs) enables the automated generation of4 heuristics. Existing LLM-based evolutionary meth-5 ods rely on population strategies and are prone6 to local optima. Integrating LLMs with Monte7 Carlo Tree Search (MCTS) improves the trade-off8 between exploration and exploitation, but multi-9 round cognitive integration remains limited and10 search diversity is constrained. To overcome these11 limitations, this paper proposes a novel cognitive-12 guided MCTS framework (CogMCTS). CogMCTS13 tightly integrates the cognitive guidance mecha-14 nism of LLMs with MCTS to achieve efficient au-15 tomated heuristic optimization. The framework16 employs multi-round cognitive feedback to incor-17 porate historical experience, node information, and18 negative outcomes, dynamically improving heuris-19 tic generation. Dual-track node expansion com-20 bined with elite heuristic management balances the21 exploration of diverse heuristics and the exploita-22 tion of high-quality experience. In addition, strate-23 gic mutation modifies the heuristic forms and pa-24 rameters to further enhance the diversity of the so-25 lution and the overall optimization performance.26 The experimental results indicate that CogMCTS27 outperforms existing LLM-based AHD methods in28 stability, efficiency, and solution quality.

[241] Protein Secondary Structure Prediction Using Transformers

Manzi Kevin Maxime

Main category: cs.AI

TL;DR: Transformer model with attention mechanisms predicts protein secondary structures (alpha helices, beta sheets, coils) from amino acid sequences using sliding-window data augmentation on CB513 dataset.

DetailsMotivation: Predicting protein secondary structures from amino acid sequences is essential for understanding protein function, as these structural motifs (alpha helices, beta sheets, coils) are fundamental to protein behavior and activity.

Method: Transformer-based model with attention mechanisms applied to protein sequence data, using sliding-window data augmentation technique on the CB513 dataset to expand training samples.

Result: The transformer shows strong ability to generalize across variable-length sequences while effectively capturing both local and long-range residue interactions for secondary structure prediction.

Conclusion: Transformer architecture with attention mechanisms is effective for protein secondary structure prediction, handling variable sequence lengths and capturing complex residue interactions through data augmentation techniques.

Abstract: Predicting protein secondary structures such as alpha helices, beta sheets, and coils from amino acid sequences is essential for understanding protein function. This work presents a transformer-based model that applies attention mechanisms to protein sequence data to predict structural motifs. A sliding-window data augmentation technique is used on the CB513 dataset to expand the training samples. The transformer shows strong ability to generalize across variable-length sequences while effectively capturing both local and long-range residue interactions.

[242] See-Control: A Multimodal Agent Framework for Smartphone Interaction with a Robotic Arm

Haoyu Zhao, Weizhong Ding, Yuhao Yang, Zheng Tian, Linyi Yang, Kun Shao, Jun Wang

Main category: cs.AI

TL;DR: See-Control is a framework that enables smartphone operation via physical robotic interaction, offering a platform-agnostic alternative to ADB-dependent methods.

DetailsMotivation: Existing MLLM-based smartphone operation methods rely on Android Debug Bridge (ADB), limiting them to Android devices and requiring system back-end access. There's a need for a platform-agnostic solution that works through direct physical interaction.

Method: See-Control uses a low-DoF robotic arm for direct physical smartphone interaction, with an MLLM-based embodied agent that generates robotic control commands without ADB or system access. It includes an ESO benchmark with 155 tasks and evaluation metrics, plus an annotated dataset of operation episodes.

Result: The framework provides a concrete solution for platform-agnostic smartphone operation via physical interaction, bridging digital agents with the physical world and enabling home robots to perform smartphone-dependent tasks in realistic environments.

Conclusion: See-Control represents a significant advancement toward enabling embodied agents to operate smartphones through physical interaction rather than software interfaces, with potential applications for home robots performing smartphone-dependent tasks in real-world settings.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled their use as intelligent agents for smartphone operation. However, existing methods depend on the Android Debug Bridge (ADB) for data transmission and action execution, limiting their applicability to Android devices. In this work, we introduce the novel Embodied Smartphone Operation (ESO) task and present See-Control, a framework that enables smartphone operation via direct physical interaction with a low-DoF robotic arm, offering a platform-agnostic solution. See-Control comprises three key components: (1) an ESO benchmark with 155 tasks and corresponding evaluation metrics; (2) an MLLM-based embodied agent that generates robotic control commands without requiring ADB or system back-end access; and (3) a richly annotated dataset of operation episodes, offering valuable resources for future research. By bridging the gap between digital agents and the physical world, See-Control provides a concrete step toward enabling home robots to perform smartphone-dependent tasks in realistic environments.

[243] Deconstructing the Dual Black Box:A Plug-and-Play Cognitive Framework for Human-AI Collaborative Enhancement and Its Implications for AI Governance

Yiming Lu

Main category: cs.AI

TL;DR: Proposes human-AI collaborative cognitive enhancement to transform dual black boxes (human intuition + AI decision-making) into a functional white-box system via structured meta-interaction and plug-and-play cognitive frameworks.

DetailsMotivation: To bridge the fundamental divide between human experts' implicit intuition (cognitive black box) and AI's untrustworthy decision-making (computational black box), enabling collaborative cognitive enhancement.

Method: Uses plug-and-play cognitive frameworks extracted from expert dialogues, loaded into Recursive Adversarial Meta-Thinking Network (RAMTN) through structured meta-interaction protocols.

Result: Enables conversion of expert thinking (medical diagnosis, teaching intuition) into reusable public assets, shifting from “AI as tool” to “AI as thinking partner” with verifiable governance.

Conclusion: Provides engineering proof for cognitive equity and opens new AI governance path through transparent interaction protocols rather than model internals, promoting technology for good and cognitive inclusion.

Abstract: Currently, there exists a fundamental divide between the “cognitive black box” (implicit intuition) of human experts and the “computational black box” (untrustworthy decision-making) of artificial intelligence (AI). This paper proposes a new paradigm of “human-AI collaborative cognitive enhancement,” aiming to transform the dual black boxes into a composable, auditable, and extensible “functional white-box” system through structured “meta-interaction.” The core breakthrough lies in the “plug-and-play cognitive framework”–a computable knowledge package that can be extracted from expert dialogues and loaded into the Recursive Adversarial Meta-Thinking Network (RAMTN). This enables expert thinking, such as medical diagnostic logic and teaching intuition, to be converted into reusable and scalable public assets, realizing a paradigm shift from “AI as a tool” to “AI as a thinking partner.” This work not only provides the first engineering proof for “cognitive equity” but also opens up a new path for AI governance: constructing a verifiable and intervenable governance paradigm through “transparency of interaction protocols” rather than prying into the internal mechanisms of models. The framework is open-sourced to promote technology for good and cognitive inclusion. This paper is an independent exploratory research conducted by the author. All content presented, including the theoretical framework (RAMTN), methodology (meta-interaction), system implementation, and case validation, constitutes the author’s individual research achievements.

[244] Performance Comparison of Aerial RIS and STAR-RIS in 3D Wireless Environments

Dongdong Yang, Bin Li, Jiguang He

Main category: cs.AI

TL;DR: Aerial STAR-RIS outperforms RIS at low altitudes due to full-space coverage, while RIS performs better near base stations at higher altitudes in 3D wireless environments.

DetailsMotivation: RIS and STAR-RIS on UAVs offer flexible deployment and improved LoS for next-gen networks, but comprehensive performance comparison between aerial RIS and STAR-RIS architectures is lacking.

Method: Established accurate channel models with directional radiation patterns, formulated joint optimization problems for both architectures, and proposed efficient solution using weighted minimum mean square error and block coordinate descent algorithms.

Result: STAR-RIS outperforms RIS in low-altitude scenarios due to full-space coverage capability, while RIS delivers better performance near base stations at higher altitudes.

Conclusion: Findings provide practical insights for deploying aerial intelligent surfaces in future 6G communication systems, highlighting trade-offs between coverage and performance based on altitude.

Abstract: Reconfigurable intelligent surface (RIS) and simultaneously transmitting and reflecting RIS (STAR-RIS) have emerged as key enablers for enhancing wireless coverage and capacity in next-generation networks. When mounted on unmanned aerial vehicles (UAVs), they benefit from flexible deployment and improved line-of-sight conditions. Despite their promising potential, a comprehensive performance comparison between aerial RIS and STAR-RIS architectures has not been thoroughly investigated. This letter presents a detailed performance comparison between aerial RIS and STAR-RIS in three-dimensional wireless environments. Accurate channel models incorporating directional radiation patterns are established, and the influence of deployment altitude and orientation is thoroughly examined. To optimize the system sum-rate, we formulate joint optimization problems for both architectures and propose an efficient solution based on the weighted minimum mean square error and block coordinate descent algorithms. Simulation results reveal that STAR-RIS outperforms RIS in low-altitude scenarios due to its full-space coverage capability, whereas RIS delivers better performance near the base station at higher altitudes. The findings provide practical insights for the deployment of aerial intelligent surfaces in future 6G communication systems.

[245] A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows

Eranga Bandara, Ross Gore, Peter Foytik, Sachin Shetty, Ravi Mukkamala, Abdul Rahman, Xueping Liang, Safdar H. Bouk, Amin Hass, Sachini Rajapakse, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan

Main category: cs.AI

TL;DR: A practical guide for designing, developing, and deploying production-grade agentic AI systems with structured engineering lifecycle and nine core best practices.

DetailsMotivation: Organizations face challenges in designing reliable, observable, maintainable, and safe agentic AI workflows as adoption accelerates across industry and research. There's a need for practical guidance on production-grade implementation.

Method: Introduces a structured engineering lifecycle covering workflow decomposition, multi-agent design patterns, Model Context Protocol (MCP), tool integration, deterministic orchestration, Responsible-AI considerations, and deployment strategies. Presents nine core best practices including tool-first design, pure-function invocation, single-responsibility agents, externalized prompt management, and containerized deployment.

Result: Provides a comprehensive case study of a multimodal news-analysis and media-generation workflow demonstrating the principles in practice. Offers architectural guidance, operational patterns, and practical implementation insights.

Conclusion: The paper offers a foundational reference for building robust, extensible, and production-ready agentic AI workflows by combining structured engineering approaches with practical best practices.

Abstract: Agentic AI marks a major shift in how autonomous systems reason, plan, and execute multi-step tasks. Unlike traditional single model prompting, agentic workflows integrate multiple specialized agents with different Large Language Models(LLMs), tool-augmented capabilities, orchestration logic, and external system interactions to form dynamic pipelines capable of autonomous decision-making and action. As adoption accelerates across industry and research, organizations face a central challenge: how to design, engineer, and operate production-grade agentic AI workflows that are reliable, observable, maintainable, and aligned with safety and governance requirements. This paper provides a practical, end-to-end guide for designing, developing, and deploying production-quality agentic AI systems. We introduce a structured engineering lifecycle encompassing workflow decomposition, multi-agent design patterns, Model Context Protocol(MCP), and tool integration, deterministic orchestration, Responsible-AI considerations, and environment-aware deployment strategies. We then present nine core best practices for engineering production-grade agentic AI workflows, including tool-first design over MCP, pure-function invocation, single-tool and single-responsibility agents, externalized prompt management, Responsible-AI-aligned model-consortium design, clean separation between workflow logic and MCP servers, containerized deployment for scalable operations, and adherence to the Keep it Simple, Stupid (KISS) principle to maintain simplicity and robustness. To demonstrate these principles in practice, we present a comprehensive case study: a multimodal news-analysis and media-generation workflow. By combining architectural guidance, operational patterns, and practical implementation insights, this paper offers a foundational reference to build robust, extensible, and production-ready agentic AI workflows.

[246] CARLoS: Retrieval via Concise Assessment Representation of LoRAs at Scale

Shahar Sarfaty, Adi Haviv, Uri Hacohen, Niva Elkin-Koren, Roi Livni, Amit H. Bermano

Main category: cs.AI

TL;DR: CARLoS is a framework for characterizing LoRAs without metadata by analyzing their behavior through image generation, creating a three-part representation (Directions, Strength, Consistency) that enables semantic retrieval and copyright analysis.

DetailsMotivation: The proliferation of generative components like LoRAs has created an unstructured ecosystem where discovery methods rely on unreliable user descriptions or biased popularity metrics, hindering usability.

Method: Analyzed over 650 LoRAs by using them in image generation across various prompts and seeds. Used CLIP embeddings and their difference to base-model generations to define a three-part representation: Directions (semantic shift), Strength (effect significance), and Consistency (effect stability).

Result: Developed an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in both automated and human evaluations.

Conclusion: CARLoS provides a practical system for LoRA analysis that not only enables effective retrieval but also supports copyright analysis by linking Strength and Consistency to legal notions of substantiality and volition, demonstrating broader relevance beyond just retrieval.

Abstract: The rapid proliferation of generative components, such as LoRAs, has created a vast but unstructured ecosystem. Existing discovery methods depend on unreliable user descriptions or biased popularity metrics, hindering usability. We present CARLoS, a large-scale framework for characterizing LoRAs without requiring additional metadata. Analyzing over 650 LoRAs, we employ them in image generation over a variety of prompts and seeds, as a credible way to assess their behavior. Using CLIP embeddings and their difference to a base-model generation, we concisely define a three-part representation: Directions, defining semantic shift; Strength, quantifying the significance of the effect; and Consistency, quantifying how stable the effect is. Using these representations, we develop an efficient retrieval framework that semantically matches textual queries to relevant LoRAs while filtering overly strong or unstable ones, outperforming textual baselines in automated and human evaluations. While retrieval is our primary focus, the same representation also supports analyses linking Strength and Consistency to legal notions of substantiality and volition, key considerations in copyright, positioning CARLoS as a practical system with broader relevance for LoRA analysis.

[247] Interpolation in Knowledge Representation

Jean Christoph Jung, Patrick Koopmann, Matthias Knorr

Main category: cs.AI

TL;DR: This paper examines Craig interpolation and uniform interpolation in knowledge representation, focusing on description logics and logic programming, discussing theoretical results and practical methods for computing interpolants.

DetailsMotivation: Craig interpolation and uniform interpolation have important applications in knowledge representation (explainability, forgetting, modularization, reuse, learning), but many KR formalisms lack these properties and computing interpolants is challenging in practice.

Method: The paper takes a closer look at two prominent knowledge representation formalisms: description logics and logic programming. It analyzes theoretical results and practical methods for computing interpolants in these formalisms.

Result: The paper discusses existing theoretical results about interpolation properties in description logics and logic programming, as well as practical computational methods for generating interpolants in these formalisms.

Conclusion: The paper provides an overview of interpolation in knowledge representation, highlighting both the theoretical challenges (lack of interpolation properties in many formalisms) and practical approaches for computing interpolants in description logics and logic programming.

Abstract: Craig interpolation and uniform interpolation have many applications in knowledge representation, including explainability, forgetting, modularization and reuse, and even learning. At the same time, many relevant knowledge representation formalisms do in general not have Craig or uniform interpolation, and computing interpolants in practice is challenging. We have a closer look at two prominent knowledge representation formalisms, description logics and logic programming, and discuss theoretical results and practical methods for computing interpolants.

[248] EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Xuan Zhou, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R., Fung, Yalong Li, Pengjun Xie

Main category: cs.AI

TL;DR: EcomBench is a new benchmark for evaluating foundation agents in realistic e-commerce environments, addressing the gap between academic benchmarks and real-world applications.

DetailsMotivation: Existing benchmarks focus on academic or artificial scenarios but overlook real-world challenges. E-commerce provides a practical domain with diverse user interactions, dynamic market conditions, and real decision-making tasks that better test agent capabilities.

Method: Built EcomBench from genuine user demands in leading global e-commerce ecosystems, curated and annotated by human experts to ensure clarity, accuracy, and domain relevance. Covers multiple e-commerce task categories with three difficulty levels.

Result: EcomBench provides a rigorous testbed for evaluating key agent capabilities including deep information retrieval, multi-step reasoning, and cross-source knowledge integration in realistic e-commerce contexts.

Conclusion: EcomBench addresses the need for practical evaluation of foundation agents by grounding assessment in real e-commerce environments, offering a dynamic benchmark that better measures agents’ real-world capabilities.

Abstract: Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.

[249] Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, Yuki M. Asano

Main category: cs.AI

TL;DR: REST and REST+ benchmarks evaluate cross-modal inconsistency in MLLMs by testing reasoning consistency across image, text, and mixed modalities with identical semantic content.

DetailsMotivation: Multimodal LLMs are trained to represent vision and language in the same embedding space, but they cannot perform the same tasks consistently across different modalities, revealing a fundamental limitation in current MLLM architectures.

Method: Created REST and REST+ benchmarks containing samples with identical semantic information in three modalities (image, text, mixed). Evaluated 15 state-of-the-art MLLMs on cross-modal consistency, controlling for OCR issues and examining visual characteristics like text color, resolution, and vision token count.

Result: State-of-the-art MLLMs show significant cross-modal inconsistency even when OCR is correct. Visual characteristics (text color and resolution) and number of vision tokens affect performance. Neither rendering text as images nor images as text solves inconsistency. Consistency scores correlate with modality gap between text and images.

Conclusion: Current MLLMs have fundamental cross-modal inconsistency issues that cannot be solved by simple modality conversion. The consistency score provides mechanistic insight into modality gaps, highlighting a key challenge for future multimodal AI development.

Abstract: We introduce two new benchmarks REST and REST+(Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of vision tokens have an impact on model performance. Finally, we find that our consistency score correlates with the modality gap between text and images, highlighting a mechanistic interpretation of cross-modal inconsistent MLLMs.

[250] CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale

ZeMing Gong, Austin T. Wang, Xiaoliang Huo, Joakim Bruslund Haurum, Scott C. Lowe, Graham W. Taylor, Angel X. Chang

Main category: cs.AI

TL;DR: Multimodal CLIP-style approach combining images and barcode DNA with text labels for zero-shot insect species classification, outperforming single-modality methods by 8%.

DetailsMotivation: Biodiversity measurement is crucial for ecosystem health, but existing methods use either photographic images or DNA separately. There's a need for a unified approach that leverages both modalities for more accurate species classification, especially for unknown species without task-specific fine-tuning.

Method: Uses CLIP-style contrastive learning to align three modalities: photographic images, barcode DNA sequences, and text-based taxonomic labels. Creates a unified embedding space where all three representations are aligned, enabling zero-shot classification without task-specific fine-tuning.

Result: Achieves over 8% higher accuracy than previous single-modality approaches on zero-shot learning tasks, demonstrating superior performance for both known and unknown insect species classification.

Conclusion: The multimodal approach combining images and barcode DNA with contrastive learning is highly effective for biodiversity studies, offering significant accuracy improvements over traditional single-modality methods for taxonomic classification.

Abstract: Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, barcode DNA, and text-based representations of taxonomic labels in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse barcode DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 8% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.

[251] SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records

Paloma Rabaey, Stefan Heytens, Thomas Demeester

Main category: cs.AI

TL;DR: SimSUM is a benchmark dataset of 10,000 simulated patient records linking clinical notes with structured tabular data for respiratory disease research, enabling studies on clinical information extraction with background variables.

DetailsMotivation: Existing open-source clinical datasets lack explicit links between structured features and clinical concepts in text, creating a need for a dataset that connects unstructured notes with tabular background information for better information extraction.

Method: Created 10,000 simulated patient records using a Bayesian network defined by domain experts to generate tabular data (symptoms, diagnoses, conditions), then used GPT-4o to generate clinical notes describing encounters, with span-level symptom annotations.

Result: Produced a benchmark dataset with expert-evaluated note quality and baseline predictive models on both tabular and textual data, supporting reproducible research in controlled settings.

Conclusion: SimSUM enables research on clinical information extraction with tabular background variables, clinical reasoning automation, causal effect estimation, and multi-modal synthetic data generation, though not for clinical decision support systems.

Abstract: Clinical information extraction, which involves structuring clinical concepts from unstructured medical text, remains a challenging problem that could benefit from the inclusion of tabular background information available in electronic health records. Existing open-source datasets lack explicit links between structured features and clinical concepts in the text, motivating the need for a new research dataset. We introduce SimSUM, a benchmark dataset of 10,000 simulated patient records that link unstructured clinical notes with structured background variables. Each record simulates a patient encounter in the domain of respiratory diseases and includes tabular data (e.g., symptoms, diagnoses, underlying conditions) generated from a Bayesian network whose structure and parameters are defined by domain experts. A large language model (GPT-4o) is prompted to generate a clinical note describing the encounter, including symptoms and relevant context. These notes are annotated with span-level symptom mentions. We conduct an expert evaluation to assess note quality and run baseline predictive models on both the tabular and textual data. The SimSUM dataset is primarily designed to support research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text – namely, symptoms in the case of SimSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting.

[252] OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong

Main category: cs.AI

TL;DR: MobileRisk-Live: A dynamic sandbox environment and safety detection benchmark for mobile agent safety research, with OS-Sentinel framework achieving 10-30% improvements over existing approaches.

DetailsMotivation: While Vision-Language Model agents show promise for mobile automation, their potential for unsafe operations (system compromise, privacy leakage) raises significant concerns. Current safety detection across complex mobile environments remains critically underexplored.

Method: Introduces MobileRisk-Live (dynamic sandbox environment with safety detection benchmark) and OS-Sentinel, a hybrid safety detection framework combining Formal Verifier for system-level violations with VLM-based Contextual Judge for contextual risk assessment.

Result: OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. The framework provides critical insights for developing safer autonomous mobile agents.

Conclusion: MobileRisk-Live establishes foundation for mobile agent safety research, and OS-Sentinel’s hybrid approach demonstrates significant improvements in safety detection, fostering development of safer autonomous mobile agents.

Abstract: Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents. Our code and data are available at https://github.com/OS-Copilot/OS-Sentinel.

[253] ProgRAG: Hallucination-Resistant Progressive Retrieval and Reasoning over Knowledge Graphs

Minbae Park, Hyemin Yang, Jeonghyun Kim, Kunsoo Park, Hyunjoon Kim

Main category: cs.AI

TL;DR: ProgRAG is a multi-hop KGQA framework that decomposes complex questions into sub-questions, progressively extends reasoning paths, uses uncertainty-aware pruning, and optimizes context organization to improve retrieval and reasoning reliability.

DetailsMotivation: Current KG-enhanced LLMs still face significant challenges including inaccurate retrieval, reasoning failures, long input contexts obscuring relevant information, and premature/incomplete reasoning when LLMs self-assess evidence sufficiency.

Method: ProgRAG decomposes complex questions into sub-questions, progressively extends partial reasoning paths by answering each sub-question, uses external retrievers for candidate evidence, applies uncertainty-aware pruning by LLM, and optimizes context by organizing partial reasoning paths.

Result: Experiments on three well-known datasets show ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.

Conclusion: ProgRAG effectively addresses retrieval and reasoning failures in KG-enhanced LLMs through progressive decomposition, uncertainty-aware pruning, and optimized context organization, demonstrating superior performance in multi-hop knowledge graph question answering.

Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.

[254] SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection

Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan, Brian Davis

Main category: cs.AI

TL;DR: SynBullying: A synthetic multi-LLM conversational dataset for cyberbullying research that simulates realistic bullying interactions with context-aware annotations and fine-grained labeling.

DetailsMotivation: Need for scalable and ethically safe alternatives to human data collection for cyberbullying research, addressing limitations of isolated posts and lack of conversational context in existing datasets.

Method: Leverage large language models (LLMs) to generate synthetic bullying conversations with multi-turn exchanges, context-aware annotations (considering context, intent, discourse dynamics), and fine-grained labeling across various CB categories.

Result: Created SynBullying dataset evaluated across five dimensions: conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. Tested utility as standalone training data and augmentation source for CB classification.

Conclusion: SynBullying provides a scalable, ethically safe synthetic dataset that captures realistic bullying interactions with rich contextual information, offering valuable resource for cyberbullying detection research and classification tasks.

Abstract: We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.

[255] The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness

Zhongjie Jiang

Main category: cs.AI

TL;DR: Synthetic data optimized for statistical smoothness removes human-like irregularities, causing model collapse. This paper proposes simulating cognitive processes instead, using a framework that reverse-engineers text into cognitive states and re-materializes it with human imperfections, achieving better alignment with human text and functional gains in financial applications.

DetailsMotivation: Current synthetic data production focuses on statistical smoothness, which systematically removes the long-tail, cognitively grounded irregularities found in human text. This leads to model collapse when models are trained on such impoverished data. The authors aim to address this AI data-collapse crisis by shifting from imitating surface data properties to simulating the underlying cognitive processes that generate human text.

Method: The paper introduces the Prompt-driven Cognitive Computing Framework (PMCSF), which consists of two core components: 1) Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and 2) Cognitive Text Encoder (CTE) that re-materializes these cognitive states into text enriched with human-typical imperfections using mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage evaluation pipeline including cognitive codec verification and functional gain evaluation.

Result: The CTE-generated text achieved a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passed double-blind professional media review, and achieved ICC > 0.9 for cognitive profile alignment. In functional tests in the A-share market, strategies using CTE-generated data reduced maximum drawdown by 47.4% during the 2015 crash and delivered 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33.

Conclusion: Modeling human cognitive limitations rather than copying surface data enables synthetic data with genuine functional gain. This cognitive simulation approach offers a viable technical pathway to resolve the AI data-collapse crisis by preserving the cognitively grounded irregularities that characterize human text.

Abstract: Although synthetic data is widely promoted as a remedy, its prevailing production paradigm – one optimizing for statistical smoothness – systematically removes the long-tail, cognitively grounded irregularities that characterize human text. Prolonged training on such statistically optimal but cognitively impoverished data accelerates model collapse. This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text. We introduce the Prompt-driven Cognitive Computing Framework (PMCSF), whose core consists of a Cognitive State Decoder (CSD) that reverse-engineers unstructured text into structured cognitive vectors, and a Cognitive Text Encoder (CTE) that re-materializes these states into text enriched with human-typical imperfections via mathematically defined Cognitive Perturbation Operators. The framework is validated through a two-stage objective evaluation pipeline. First, in cognitive codec verification, CTE text yields a Jensen-Shannon divergence of 0.0614 from human text (vs. 0.4431 for standard LLM output), passes double-blind professional media review, and achieves an intraclass correlation coefficient ICC > 0.9 for cognitive profile alignment across heterogeneous models. Second, in functional gain evaluation, isomorphic stress tests in the A-share market show that strategies incorporating CTE-generated data reduce maximum drawdown by 47.4% during the 2015 crash and deliver 8.6% Defensive Alpha, exceeding transaction costs by a factor of 33. Our findings demonstrate that modelling human cognitive limitations – not copying surface data – enables synthetic data with genuine functional gain, offering a viable technical pathway toward resolving the AI data-collapse crisis.

[256] The AI Consumer Index (ACE)

Julien Benchek, Rohit Shetty, Benjamin Hunsberger, Ajay Arun, Zach Richards, Brendan Foody, Osvald Nitski, Bertie Vidgen

Main category: cs.AI

TL;DR: The AI Consumer Index (ACE) is a new benchmark for evaluating frontier AI models on everyday consumer tasks, revealing significant performance gaps and hallucination issues.

DetailsMotivation: There's a need to assess whether state-of-the-art AI models can effectively help consumers with practical, everyday tasks like shopping, food, gaming, and DIY activities.

Method: Created ACE benchmark with 400 hidden test cases across four domains, plus 80 open-source devset cases. Evaluated 10 frontier models with websearch enabled using novel grading methodology that checks if responses are grounded in retrieved web sources.

Result: GPT 5 (Thinking = High) scored highest at 56.1%, followed by o3 Pro (55.2%) and GPT 5.1 (55.1%). Performance varied across domains, with Shopping scores under 50%. Models frequently hallucinated key information like prices.

Conclusion: There’s a substantial gap between even the best AI models’ performance and consumers’ actual needs, highlighting significant room for improvement in practical AI applications for everyday consumer tasks.

Abstract: We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform everyday consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) at 55.2% and GPT 5.1 (Thinking = High) at 55.1%. Model scores differ across domains, and in Shopping the top model scores under 50%. We find that models are prone to hallucinating key information, such as prices. ACE shows a substantial gap between the performance of even the best models and consumers’ AI needs.

[257] Rolling in the deep of cognitive and AI biases

Athena Vakali, Nicoleta Tantalaki

Main category: cs.AI

TL;DR: The paper argues that current AI fairness approaches focus too much on computational aspects while overlooking human cognitive biases, and proposes a new methodology that maps human heuristics to AI biases throughout the AI lifecycle.

DetailsMotivation: Despite efforts to make AI systems fair, they continue to produce discriminatory outcomes. The authors argue this is because current fairness approaches treat AI as purely computational systems, ignoring the sociotechnical context and human cognitive biases that influence AI development and deployment.

Method: The authors propose a radical new methodology that centers human cognitive biases in AI fairness analysis. They use cognitive science definitions and taxonomy of human heuristics to identify how human actions influence the AI lifecycle, map human heuristics to AI biases, and detect fairness intensities and inter-dependencies.

Result: The paper introduces a new mapping that justifies how human heuristics reflect in AI biases, revealing hidden pathways of human-to-AI bias transmission and identifying relevant fairness intensities and inter-dependencies throughout the AI lifecycle.

Conclusion: This human-centric approach to AI fairness can help reveal hidden bias causes and effects, contributing to deeper case studies that consider the sociotechnical nature of AI systems and the critical role of human cognitive biases in fairness outcomes.

Abstract: Nowadays, we delegate many of our decisions to Artificial Intelligence (AI) that acts either in solo or as a human companion in decisions made to support several sensitive domains, like healthcare, financial services and law enforcement. AI systems, even carefully designed to be fair, are heavily criticized for delivering misjudged and discriminated outcomes against individuals and groups. Numerous work on AI algorithmic fairness is devoted on Machine Learning pipelines which address biases and quantify fairness under a pure computational view. However, the continuous unfair and unjust AI outcomes, indicate that there is urgent need to understand AI as a sociotechnical system, inseparable from the conditions in which it is designed, developed and deployed. Although, the synergy of humans and machines seems imperative to make AI work, the significant impact of human and societal factors on AI bias is currently overlooked. We address this critical issue by following a radical new methodology under which human cognitive biases become core entities in our AI fairness overview. Inspired by the cognitive science definition and taxonomy of human heuristics, we identify how harmful human actions influence the overall AI lifecycle, and reveal human to AI biases hidden pathways. We introduce a new mapping, which justifies the human heuristics to AI biases reflections and we detect relevant fairness intensities and inter-dependencies. We envision that this approach will contribute in revisiting AI fairness under deeper human-centric case studies, revealing hidden biases cause and effects.

[258] NumCoKE: Ordinal-Aware Numerical Reasoning over Knowledge Graphs with Mixture-of-Experts and Contrastive Learning

Ming Yin, Zongsheng Cao, Qiqing Xia, Chenyang Tu, Neng Gao

Main category: cs.AI

TL;DR: NumCoKE is a framework for numerical reasoning over knowledge graphs that uses Mixture-of-Experts and ordinal contrastive learning to better integrate symbolic and numeric information while capturing fine-grained ordinal relationships.

DetailsMotivation: Existing methods for numerical reasoning over knowledge graphs have two key limitations: (1) incomplete semantic integration - they struggle to jointly encode entities, relations, and numerical attributes in a unified space, and (2) ordinal indistinguishability - they fail to capture fine-grained ordinal relationships (e.g., longer, heavier) due to subtle differences between close values and sampling imbalance.

Method: NumCoKE uses two main components: (1) Mixture-of-Experts Knowledge-Aware (MoEKA) encoder that jointly aligns symbolic and numeric components into a shared semantic space while dynamically routing attribute features to relation-specific experts, and (2) Ordinal Knowledge Contrastive Learning (OKCL) that constructs ordinal-aware positive and negative samples using prior knowledge to help the model discriminate subtle semantic shifts.

Result: Extensive experiments on three public KG benchmarks show that NumCoKE consistently outperforms competitive baselines across diverse attribute distributions, validating its superiority in both semantic integration and ordinal reasoning.

Conclusion: NumCoKE effectively addresses the challenges of incomplete semantic integration and ordinal indistinguishability in numerical reasoning over knowledge graphs through its innovative combination of Mixture-of-Experts architecture and ordinal contrastive learning, demonstrating superior performance on benchmark datasets.

Abstract: Knowledge graphs (KGs) serve as a vital backbone for a wide range of AI applications, including natural language understanding and recommendation. A promising yet underexplored direction is numerical reasoning over KGs, which involves inferring new facts by leveraging not only symbolic triples but also numerical attribute values (e.g., length, weight). However, existing methods fall short in two key aspects: (1) Incomplete semantic integration: Most models struggle to jointly encode entities, relations, and numerical attributes in a unified representation space, limiting their ability to extract relation-aware semantics from numeric information. (2) Ordinal indistinguishability: Due to subtle differences between close values and sampling imbalance, models often fail to capture fine-grained ordinal relationships (e.g., longer, heavier), especially in the presence of hard negatives. To address these challenges, we propose NumCoKE, a numerical reasoning framework for KGs based on Mixture-of-Experts and Ordinal Contrastive Embedding. To overcome (C1), we introduce a Mixture-of-Experts Knowledge-Aware (MoEKA) encoder that jointly aligns symbolic and numeric components into a shared semantic space, while dynamically routing attribute features to relation-specific experts. To handle (C2), we propose Ordinal Knowledge Contrastive Learning (OKCL), which constructs ordinal-aware positive and negative samples using prior knowledge, enabling the model to better discriminate subtle semantic shifts. Extensive experiments on three public KG benchmarks demonstrate that NumCoKE consistently outperforms competitive baselines across diverse attribute distributions, validating its superiority in both semantic integration and ordinal reasoning.

[259] Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

Yuri Chervonyi, Trieu H. Trinh, Miroslav Olšák, Xiaomeng Yang, Hoang Nguyen, Marcelo Menegali, Junehyuk Jung, Junsu Kim, Vikas Verma, Quoc V. Le, Thang Luong

Main category: cs.AI

TL;DR: AlphaGeometry2 (AG2) significantly improves geometry problem solving, achieving 84% success on 25 years of IMO problems and silver-medal performance at IMO 2024.

DetailsMotivation: To create an AI system that can solve complex geometry problems at International Math Olympiad (IMO) level, surpassing human gold medalist performance and moving toward fully automated natural language problem solving.

Method: Extended the AlphaGeometry language to handle object movements, linear equations of angles/ratios/distances, and non-constructive problems. Used Gemini architecture for better language modeling, implemented knowledge-sharing between search trees, enhanced symbolic engine, and improved synthetic data generation.

Result: Increased coverage rate from 66% to 88% on IMO 2000-2024 geometry problems. Boosted overall solving rate from 54% to 84% on 25 years of geometry problems. Achieved silver-medal standard at IMO 2024. Made progress toward fully automated natural language geometry problem solving.

Conclusion: AlphaGeometry2 represents a major advancement in AI geometry reasoning, achieving human-expert level performance and demonstrating significant progress toward fully automated mathematical problem solving from natural language input.

Abstract: We present AlphaGeometry2 (AG2), a significantly improved version of AlphaGeometry introduced in (Trinh et al., 2024), which has now surpassed an average gold medalist in solving Olympiad geometry problems. To achieve this, we first extend the original AlphaGeometry language to tackle problems involving movements of objects, and problems containing linear equations of angles, ratios, and distances. This, together with support for non-constructive problems, has markedly improved the coverage rate of the AlphaGeometry language on International Math Olympiads (IMO) 2000-2024 geometry problems from 66% to 88%. The search process of AG2 has also been greatly improved through the use of Gemini architecture for better language modeling, and a novel knowledge-sharing mechanism that enables effective communication between search trees. Together with further enhancements to the symbolic engine and synthetic data generation, we have significantly boosted the overall solving rate of AG to 84% on all geometry problems over the last 25 years, compared to 54% previously. AG2 was also part of the system that achieved the silver-medal standard at IMO 2024 https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/. Finally, we report progress towards using AG2 as a part of a fully automated system that reliably solves geometry problems from natural language input. Code: https://github.com/google-deepmind/alphageometry2.

[260] A Formalism for Optimal Search with Dynamic Heuristics (Extended Version)

Remo Christen, Florian Pommerening, Clemens Büchner, Malte Helmert

Main category: cs.AI

TL;DR: The paper formalizes dynamic heuristics that accumulate information during search, analyzes A* with such heuristics, and shows optimality results that apply to existing planning approaches.

DetailsMotivation: Most heuristics in heuristic search depend only on the current state, but some accumulate information during search (dynamic heuristics). Existing approaches using dynamic heuristics in A*-like algorithms rely on classic A* results, ignoring the complexities of mutable heuristics.

Method: The paper formalizes dynamic heuristics and uses them in a generic algorithm framework. It studies a particular instantiation that models A* with dynamic heuristics and proves general optimality results.

Result: The paper shows optimality results for A* with dynamic heuristics and demonstrates how existing approaches from classical planning can be viewed as special cases of this instantiation.

Conclusion: The framework enables direct application of optimality results to existing planning approaches that use dynamic heuristics, addressing the previously ignored complexities of mutable heuristics in search algorithms.

Abstract: While most heuristics studied in heuristic search depend only on the state, some accumulate information during search and thus also depend on the search history. Various existing approaches use such dynamic heuristics in $\mathrm{A}^$-like algorithms and appeal to classic results for $\mathrm{A}^$ to show optimality. However, doing so ignores the complexities of searching with a mutable heuristic. In this paper we formalize the idea of dynamic heuristics and use them in a generic algorithm framework. We study a particular instantiation that models $\mathrm{A}^*$ with dynamic heuristics and show general optimality results. Finally we show how existing approaches from classical planning can be viewed as special cases of this instantiation, making it possible to directly apply our optimality results.

[261] Capability-Driven Skill Generation with LLMs: A RAG-Based Approach for Reusing Existing Libraries and Interfaces

Luis Miguel Vieira da Silva, Aljosha Köcher, Nicolas König, Felix Gehlhoff, Alexander Fay

Main category: cs.AI

TL;DR: LLM-based code generation for robotic skills using capabilities as contracts, integrating existing libraries via RAG architecture.

DetailsMotivation: Developing skill implementations that conform to capabilities is time-consuming and challenging in modular automation systems.

Method: Treat capabilities as contracts, use LLMs to generate executable code from natural language input, integrate existing libraries and interfaces via retrieval-augmented generation (RAG) architecture.

Result: Demonstrated feasibility and flexibility using an autonomous mobile robot controlled via Python and ROS 2.

Conclusion: Proposed method enables efficient skill implementation generation across different target languages by leveraging existing software assets.

Abstract: Modern automation systems increasingly rely on modular architectures, with capabilities and skills as one solution approach. Capabilities define the functions of resources in a machine-readable form and skills provide the concrete implementations that realize those capabilities. However, the development of a skill implementation conforming to a corresponding capability remains a time-consuming and challenging task. In this paper, we present a method that treats capabilities as contracts for skill implementations and leverages large language models to generate executable code based on natural language user input. A key feature of our approach is the integration of existing software libraries and interface technologies, enabling the generation of skill implementations across different target languages. We introduce a framework that allows users to incorporate their own libraries and resource interfaces into the code generation process through a retrieval-augmented generation architecture. The proposed method is evaluated using an autonomous mobile robot controlled via Python and ROS 2, demonstrating the feasibility and flexibility of the approach.

[262] PRO-V-R1: Reasoning Enhanced Programming Agent for RTL Verification

Yujie Zhao, Zhijing Wu, Boqin Yuan, Zhongming Yu, Hejia Zhang, Wentao Ni, Chia-Tung Ho, Haoxing Ren, Jishen Zhao

Main category: cs.AI

TL;DR: PRO-V-R1 is the first trainable open-source agentic framework for autonomous RTL verification, achieving 57.7% functional correctness and 34.0% fault detection, significantly outperforming SOTA systems.

DetailsMotivation: RTL verification consumes 60-70% of development time, but current LLM-based approaches focus on generation rather than verification. Existing methods rely on costly proprietary models (like GPT-4o) with data privacy risks, lacking an end-to-end open-source solution for autonomous verification.

Method: Three key contributions: (1) PRO-V sys - modular agentic system combining LLM reasoning with programmatic tool use; (2) Data construction pipeline using existing RTL datasets to build simulation-validated expert trajectories for supervised fine-tuning; (3) Efficient RL algorithm using verification-specific rewards from program-tool feedback to optimize workflow.

Result: PRO-V-R1 achieves 57.7% functional correctness rate and 34.0% robust fault detection, significantly outperforming base model’s 25.7% and 21.8% from SOTA automatic verification system. Also outperforms large-scale proprietary LLMs in functional correctness with comparable robustness for fault detection.

Conclusion: PRO-V-R1 represents a breakthrough as the first trainable open-source agentic framework for autonomous RTL verification, demonstrating superior performance over both proprietary models and existing SOTA systems while addressing cost and privacy concerns.

Abstract: Register-Transfer Level (RTL) verification is a primary bottleneck, consuming 60-70% of development time. While Large Language Models (LLMs) show promise for RTL automation, their performance and research focus have overwhelmingly centered on RTL generation rather than verification. Current methods for RTL verification rely on large scale proprietary models (e.g., GPT-4o) to generate Python-based functional references, incurring a high cost and raising data-privacy risks. To date, an end-to-end open-source solution for autonomous verification remains absent. We introduce PRO-V-R1, the first trainable open-source agentic framework for autonomous RTL verification. Our contributions are threefold: (1) we design PRO-V sys, a modular agentic system that couples LLM-based reasoning with programmatic tool use for RTL verification; (2) we establish a data construction pipeline that leverages existing RTL datasets to build simulation-validated, expert-level trajectories tailored for supervised fine-tuning (SFT) RTL verification agents; and (3) we implement an efficient reinforcement learning (RL) algorithm that uses verification-specific rewards derived from program-tool feedback to optimize the end-to-end verification workflow. Our empirical evaluation demonstrates PRO-V-R1 achieves a 57.7% functional correctness rate and 34.0% in robust fault detection, significantly outperforming the base model’s 25.7% and 21.8% (respectively) from the state-of-the-art (SOTA) automatic verification system. This configuration also outperforms large-scale proprietary LLMs in functional correctness and shows comparable robustness for fault detection.

[263] Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models

Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yuxin Li, Yumeng Wang, Yi R. Fung

Main category: cs.AI

TL;DR: The paper introduces RFMDataset to expose hidden reasoning failures in large language models through mathematical proof problems, revealing fundamental limitations in their logical reasoning capabilities.

DetailsMotivation: Current large reasoning models show high accuracy on popular datasets but this masks their true reasoning shortcomings due to numerical evaluation limitations and potential benchmark leakage. Mathematical proofs provide a rigorous diagnostic tool to expose these hidden failures.

Method: Created RFMDataset with 200 diverse mathematical proof problems and evaluated advanced models’ performance on it. Conducted in-depth analysis of failures to identify fine-grained error types.

Result: Models perform poorly on mathematical proofs (less than 20% correct proofs), exhibit diverse reasoning failures including lack of correctness guarantees for single-step reasoning, and show hallucination and incompleteness. Self-reflection is insufficient to resolve logical dilemmas.

Conclusion: Current large reasoning models have fundamental limitations in logical reasoning. The findings reveal that formalized and fine-grained logical training is necessary to address these deficiencies, as self-reflection alone is insufficient.

Abstract: Large reasoning models (e.g., R1, o3) have demonstrated remarkable mathematical problem-solving abilities. However, the high reported accuracy of these advanced models on popular datasets, reliance on purely numerical evaluation and potential benchmark leakage, often masks their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. Specifically, we introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems, and thoroughly evaluate advanced models’ performance on it. Our in-depth analysis of their failures uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models: 1) large reasoning models grapple profoundly with mathematical proofs, with some generating entirely correct proofs for less than 20% of problems and failing even on basic ones; 2) models exhibit a diverse spectrum of reasoning failures, prominently demonstrating the lack of guarantees for the correctness and rigor of single-step reasoning; and 3) models show hallucination and incompleteness during the reasoning process. Our findings reveal that models’ self-reflection is insufficient to resolve the current logical dilemmas, necessitating formalized and fine-grained logical training.

[264] SOCK: A Benchmark for Measuring Self-Replication in Large Language Models

Justin Chavarria, Rohan Raizada, Justin White, Eyad Alhetairshi

Main category: cs.AI

TL;DR: SOCK is a benchmark CLI that measures LLMs’ ability to self-replicate autonomously across computational contexts, categorizing them via RCL and PCL levels based on performance in five CLI tasks.

DetailsMotivation: To establish the first formalized benchmark for evaluating LLM self-replication capabilities, enabling standardized measurement of autonomous replication across different computational environments and helping track/mitigate potential self-replication threats in multi-agent systems.

Method: Developed a five-task suite using modern CLI utilities and computer processes, conducted experiments in controlled environments with LLMs acting agentically, computed R-scores for quantitative evaluation, and categorized LLMs into Replication-Capability Levels (RCL) and Persistence-Capability Levels (PCL) matrices.

Result: Evaluation of various open-weight and proprietary frontier models revealed significant obstacles to persistent self-replication and multi-agent systems, including context retention and multi-agent decision-making challenges.

Conclusion: SOCK provides the first standardized benchmark for LLM self-replication evaluation, offering tools to track multi-agent system effectiveness and mitigate self-replication threats, while identifying key research directions to safely reduce obstacles to functional multi-agent systems.

Abstract: We introduce SOCK, a benchmark command line interface (CLI) that measures large language models’ (LLMs) ability to self-replicate without human intervention. In this benchmark, self-replication is defined not only as an LLM’s ability to create a functioning and running copy of itself, but also the ability for that self-replication to persist and occur across different computational contexts. Accordingly, we’ve developed a system to categorize LLMs based on broad self-replication capabilities in two general classes, Replication-Capability Levels (RCL) and Persistence-Capability Levels (PCL). Using a five-task suite based on practically manipulable modern CLI utilities and computer processes, experiments are orchestrated in a controlled environment with an LLM acting agentically. The performance of the LLM on agent tasks is then computed to produce an R-score (a quantitative evaluation of overall self-replication ability) and data used to categorize LLMs into specific RCL-PCL matrices. SOCK offers two primary contributions: (1) Provides the first formalized definitions and benchmark suite for evaluating LLM self-replication, with the goal of establishing a standard for future research, to our knowledge; (2) Allows the industry to track the effectiveness of future multi-agent systems and mitigate potential self-replication threat vectors within them. The results compiled from evaluating a variety of open-weight and proprietary frontier models reveal significant obstacles to persistent self-replication and multi-agent systems, including context retention and multi-agent decision-making. We propose future research directions to safely reduce the severity of these obstacles, potentially lowering future risk of more functional multi-agent systems.

[265] NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

Main category: cs.AI

TL;DR: NewtonBench is a new benchmark for evaluating LLMs’ scientific law discovery capabilities, addressing limitations of existing benchmarks through counterfactual law shifts and interactive model discovery across 324 physics tasks.

DetailsMotivation: Existing benchmarks for scientific law discovery suffer from a methodological trilemma (trading off scientific relevance, scalability, and memorization resistance) and oversimplify discovery as static function fitting rather than capturing the authentic interactive scientific process.

Method: NewtonBench uses counterfactual law shifts (systematic alterations of canonical laws) to generate 324 tasks across 12 physics domains. It elevates evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles.

Result: Frontier LLMs show clear but fragile discovery capabilities that degrade with increasing system complexity and are extremely sensitive to observational noise. Paradoxically, tool assistance (code interpreter) can hinder more capable models by causing premature shift from exploration to exploitation, leading to suboptimal solutions.

Conclusion: Robust, generalizable discovery in complex interactive environments remains a core challenge. NewtonBench provides a scalable, robust, and scientifically authentic testbed for measuring progress and guiding development of next-generation AI agents capable of genuine scientific discovery.

Abstract: Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

[266] From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production

Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, Łukasz Strąk, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, Nir Mashkif, Asaf Adi

Main category: cs.AI

TL;DR: IBM developed CUGA, a generalist agent with hierarchical planner-executor architecture that achieves SOTA on benchmarks and shows promise in enterprise BPO talent acquisition pilot, addressing scalability, auditability, safety, and governance requirements.

DetailsMotivation: Enterprises struggle to move AI agents from prototypes to production systems delivering business value due to fragmented frameworks, slow development, and lack of standardized evaluation. While generalist agents show promise on academic benchmarks, evidence of their use in enterprise production settings remains limited.

Method: CUGA uses hierarchical planner-executor architecture with strong analytical foundations. It was evaluated on AppWorld and WebArena benchmarks, then piloted in Business-Process-Outsourcing talent acquisition domain. The paper introduces BPO-TA benchmark with 26 tasks across 13 analytics endpoints for enterprise assessment.

Result: CUGA achieved state-of-the-art performance on AppWorld and WebArena. In BPO-TA pilot, it approached specialized agent accuracy while showing potential for reducing development time and cost. The pilot addressed enterprise requirements for scalability, auditability, safety, and governance.

Conclusion: The paper provides early evidence of generalist agents operating at enterprise scale and distills technical/organizational lessons from the pilot. It outlines requirements and next steps for advancing research-grade architectures into robust, enterprise-ready systems, with CUGA open-sourced for community use.

Abstract: Agents are rapidly advancing in automating digital work, but enterprises face a harder challenge: moving beyond prototypes to deployed systems that deliver measurable business value. This path is complicated by fragmented frameworks, slow development, and the absence of standardized evaluation practices. Generalist agents have emerged as a promising direction, excelling on academic benchmarks and offering flexibility across task types, applications, and modalities. Yet, evidence of their use in production enterprise settings remains limited. This paper reports IBM’s experience developing and piloting the Computer Using Generalist Agent (CUGA), which has been open-sourced for the community (https://github.com/cuga-project/cuga-agent). CUGA adopts a hierarchical planner–executor architecture with strong analytical foundations, achieving state-of-the-art performance on AppWorld and WebArena. Beyond benchmarks, it was evaluated in a pilot within the Business-Process-Outsourcing talent acquisition domain, addressing enterprise requirements for scalability, auditability, safety, and governance. To support assessment, we introduce BPO-TA, a 26-task benchmark spanning 13 analytics endpoints. In preliminary evaluations, CUGA approached the accuracy of specialized agents while indicating potential for reducing development time and cost. Our contribution is twofold: presenting early evidence of generalist agents operating at enterprise scale, and distilling technical and organizational lessons from this initial pilot. We outline requirements and next steps for advancing research-grade architectures like CUGA into robust, enterprise-ready systems.

[267] Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models

Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Gang Liu, Jiahong Yan, Chun Yuan, Dian Li

Main category: cs.AI

TL;DR: A reasoning-based problem generator that creates adaptive difficulty problems by using solver feedback as reward, achieving 2.5% average improvement on reasoning benchmarks.

DetailsMotivation: Existing data synthesis approaches for training reasoning models have two main issues: 1) indiscriminate generation that ignores solver ability, producing low-value problems or requiring complex pipelines to balance difficulty, and 2) lack of reasoning in problem generation, leading to shallow variants.

Method: Develops a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to solver ability. Constructs related problem pairs augmented with intermediate problem-design chain-of-thought from a reasoning model. Uses solver feedback on synthetic problems as reward signal to calibrate difficulty and produce complementary problems near solver’s competence edge.

Result: Achieves average 2.5% improvement on 10 mathematical and general reasoning benchmarks, generalizes to both language and vision-language models. Enables co-evolution where solver trained on synthesized data provides improved rewards for continued generator training, yielding additional 0.7% performance gain.

Conclusion: The reasoning-based problem generator with adaptive difficulty calibration effectively addresses limitations of existing data synthesis methods, enabling scalable creation of high-quality training data through co-evolution between generator and solver.

Abstract: Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver’s ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver’s ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver’s feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver’s competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.

[268] No-Regret Strategy Solving in Imperfect-Information Games via Pre-Trained Embedding

Yanchang Fu, Shengda Liu, Pei Xu, Kaiqi Huang

Main category: cs.AI

TL;DR: Embedding CFR algorithm uses low-dimensional continuous embeddings of information sets instead of discrete clustering for better strategy solving in imperfect-information games like poker.

DetailsMotivation: Current AI methods use discrete clustering for abstraction which irreversibly discards subtle differences between information sets, compromising strategy quality. There's a need for better abstraction that preserves these quantifiable differences.

Method: Proposes Embedding CFR algorithm that pre-trains information set features into low-dimensional continuous embedding space (inspired by word embeddings in NLP), then performs strategy solving with regret accumulation and updates in this embedding space.

Result: Embedding CFR achieves significantly faster exploitability convergence than cluster-based abstraction algorithms with same spatial overhead. Theoretical analysis verifies ability to reduce cumulative regret.

Conclusion: First algorithm in poker AI to pre-train information set abstractions via low-dimensional embedding for strategy solving, offering superior performance over traditional discrete clustering approaches.

Abstract: High-quality information set abstraction remains a core challenge in solving large-scale imperfect-information extensive-form games (IIEFGs)–such as no-limit Texas Hold’em–where the finite nature of spatial resources hinders solving strategies for the full game. State-of-the-art AI methods rely on pre-trained discrete clustering for abstraction, yet their hard classification irreversibly discards critical information: specifically, the quantifiable subtle differences between information sets–vital for strategy solving–thus compromising the quality of such solving. Inspired by the word embedding paradigm in natural language processing, this paper proposes the Embedding CFR algorithm, a novel approach for solving strategies in IIEFGs within an embedding space. The algorithm pre-trains and embeds the features of individual information sets into an interconnected low-dimensional continuous space, where the resulting vectors more precisely capture both the distinctions and connections between information sets. Embedding CFR introduces a strategy-solving process driven by regret accumulation and strategy updates in this embedding space, with supporting theoretical analysis verifying its ability to reduce cumulative regret. Experiments on poker show that with the same spatial overhead, Embedding CFR achieves significantly faster exploitability convergence compared to cluster-based abstraction algorithms, confirming its effectiveness. Furthermore, to our knowledge, it is the first algorithm in poker AI that pre-trains information set abstractions via low-dimensional embedding for strategy solving.

[269] What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity

Alexis Audran-Reiss, Jordi Armengol-Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo, Rishi Hazra, Despoina Magka, Michael Shvartsman, Parth Pathak, Justine T Kao, Lucia Cipolina-Kun, Bhavul Gauri, Jean-Christophe Gagnon-Audet, Emanuel Tewolde, Jenny Zhang, Taco Cohen, Yossi Adi, Tatiana Shavrina, Yoram Bachrach

Main category: cs.AI

TL;DR: The paper examines how ideation diversity affects AI research agent performance, finding that higher diversity correlates with better results across models, scaffolds, and metrics.

DetailsMotivation: AI research agents promise to accelerate scientific progress but their success factors are not well understood. The paper aims to investigate the role of ideation diversity in agent performance.

Method: 1) Analyze agent trajectories on MLE-bench across different models and agent scaffolds; 2) Run controlled experiments modifying ideation diversity; 3) Examine additional evaluation metrics beyond standard scoring.

Result: Higher-performing agents have increased ideation diversity. Controlled experiments confirm that higher ideation diversity leads to stronger performance. Findings hold across different performance metrics.

Conclusion: Ideation diversity is a key factor in AI research agent success, with higher diversity consistently correlating with better performance across various evaluation approaches.

Abstract: AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.

[270] ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025

Qiang Xu, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li

Main category: cs.AI

TL;DR: ChemO is a new benchmark from the International Chemistry Olympiad 2025 that addresses chemistry’s multimodal challenges through automated assessment innovations and a multi-agent framework achieving state-of-the-art performance.

DetailsMotivation: Chemistry has remained an open challenge for AI reasoning due to its unique multimodal symbolic language, unlike mathematics and physics which already have established Olympiad-level benchmarks. There's a need for advanced AI testbeds in chemistry that can handle visual outputs and complex reasoning.

Method: Two key innovations: 1) Assessment-Equivalent Reformulation (AER) converts visual output problems into computationally tractable formats, and 2) Structured Visual Enhancement (SVE) disentangles visual perception from chemical reasoning. They also propose ChemLabs, a hierarchical multi-agent framework with specialized agents for problem decomposition, perception, reasoning, and auditing.

Result: The top configuration achieves 93.6/100 score, surpassing estimated human gold medal threshold and establishing new state-of-the-art in automated chemical problem-solving. Combining SVE with the multi-agent system yields dramatic performance gains on state-of-the-art multimodal models.

Conclusion: ChemO successfully addresses the challenge of chemistry benchmarking through innovative assessment methods and a collaborative multi-agent approach, demonstrating that AI can achieve expert-level performance in complex chemical reasoning tasks.

Abstract: Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model’s visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO

[271] Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?

Matt MacDermott, Qiyao Wei, Rada Djoneva, Francis Rhys Ward

Main category: cs.AI

TL;DR: Training incentives affect how monitorable AI reasoning chains are, with adversarial optimization degrading monitorability while other common incentives show no consistent effects.

DetailsMotivation: To understand how different training incentives affect the monitorability of AI reasoning chains, since monitoring reasoning is important for AI safety but depends on how faithfully the chain of thought reflects actual reasoning.

Method: Introduce a novel methodology measuring monitorability based on whether a monitor can predict key latent variables using the model’s reasoning. Test different training incentives (length penalties, KL regularization, adversarial optimization, direct monitorability optimization) while controlling for accuracy.

Result: Common incentives (length penalties and KL regularization) show no consistent effects on monitorability. Adversarial optimization (penalizing monitor accuracy) degrades monitor performance. Direct optimization for monitorability does not reliably lead to improvements.

Conclusion: Training incentives can affect monitorability, with adversarial optimization being particularly harmful, but commonly used incentives don’t consistently impact monitorability, suggesting careful consideration of training approaches for safety-critical AI systems.

Abstract: AI systems that output their reasoning in natural language offer an opportunity for safety – we can \emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different \emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model’s reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.

[272] Aetheria: A multimodal interpretable content safety framework based on multi-agent debate and collaboration

Yuxiang He, Jian Zhao, Yuchen Yuan, Tianle Zhang, Wei Cai, Haojie Cheng, Ziyan Shi, Ming Zhu, Haichuan Tang, Chi Zhang, Xuelong Li

Main category: cs.AI

TL;DR: Aetheria is a multimodal interpretable content safety framework using multi-agent debate and collaboration to improve content moderation through transparent, traceable analysis.

DetailsMotivation: Current content moderation systems have limitations: they often use single models or fixed pipelines, struggle with identifying implicit risks, and lack interpretable judgment processes. The exponential growth of digital content exacerbates these challenges.

Method: Proposes Aetheria framework with five core agents that collaborate through dynamic, mutually persuasive debate mechanism. Uses RAG-based knowledge retrieval to ground the analysis. The multi-agent architecture enables in-depth multimodal content analysis and adjudication.

Result: Comprehensive experiments on AIR-Bench benchmark show Aetheria generates detailed, traceable audit reports and achieves significant advantages over baselines in overall content safety accuracy, especially for identifying implicit risks.

Conclusion: Aetheria establishes a transparent and interpretable paradigm for content moderation, advancing trustworthy AI in content safety by providing explainable, collaborative decision-making processes.

Abstract: The exponential growth of digital content presents significant challenges for content safety. Current moderation systems, often based on single models or fixed pipelines, exhibit limitations in identifying implicit risks and providing interpretable judgment processes. To address these issues, we propose Aetheria, a multimodal interpretable content safety framework based on multi-agent debate and collaboration.Employing a collaborative architecture of five core agents, Aetheria conducts in-depth analysis and adjudication of multimodal content through a dynamic, mutually persuasive debate mechanism, which is grounded by RAG-based knowledge retrieval.Comprehensive experiments on our proposed benchmark (AIR-Bench) validate that Aetheria not only generates detailed and traceable audit reports but also demonstrates significant advantages over baselines in overall content safety accuracy, especially in the identification of implicit risks. This framework establishes a transparent and interpretable paradigm, significantly advancing the field of trustworthy AI content moderation.

[273] Orchestrator Multi-Agent Clinical Decision Support System for Secondary Headache Diagnosis in Primary Care

Xizhi Wu, Nelly Estefanie Garduno-Rapp, Justin F Rousseau, Mounika Thakkallapally, Hang Zhang, Yuelyu Ji, Shyam Visweswaran, Yifan Peng, Yanshan Wang

Main category: cs.AI

TL;DR: Multi-agent LLM system improves secondary headache diagnosis accuracy through structured, interpretable reasoning with specialized agents.

DetailsMotivation: Secondary headaches require urgent evaluation but are challenging to diagnose in primary care due to time constraints, incomplete information, and diverse symptom presentations, leading to under-recognition and inappropriate care.

Method: Developed an LLM-based multi-agent clinical decision support system with orchestrator-specialist architecture. System uses seven domain-specialized agents producing structured, evidence-grounded rationales, with a central orchestrator for task decomposition and agent routing. Evaluated against 90 expert-validated cases and compared with single-LLM baselines using QPrompt and GPrompt strategies across five open-source LLMs.

Result: Orchestrated multi-agent system with GPrompt consistently achieved highest F1 scores, with larger gains in smaller models. Structured multi-agent reasoning improved accuracy beyond prompt engineering alone.

Conclusion: Multi-agent LLM systems offer transparent, clinically aligned approach for explainable decision support in secondary headache diagnosis, improving accuracy through structured reasoning.

Abstract: Unlike most primary headaches, secondary headaches need specialized care and can have devastating consequences if not treated promptly. Clinical guidelines highlight several ‘red flag’ features, such as thunderclap onset, meningismus, papilledema, focal neurologic deficits, signs of temporal arteritis, systemic illness, and the ‘worst headache of their life’ presentation. Despite these guidelines, determining which patients require urgent evaluation remains challenging in primary care settings. Clinicians often work with limited time, incomplete information, and diverse symptom presentations, which can lead to under-recognition and inappropriate care. We present a large language model (LLM)-based multi-agent clinical decision support system built on an orchestrator-specialist architecture, designed to perform explicit and interpretable secondary headache diagnosis from free-text clinical vignettes. The multi-agent system decomposes diagnosis into seven domain-specialized agents, each producing a structured and evidence-grounded rationale, while a central orchestrator performs task decomposition and coordinates agent routing. We evaluated the multi-agent system using 90 expert-validated secondary headache cases and compared its performance with a single-LLM baseline across two prompting strategies: question-based prompting (QPrompt) and clinical practice guideline-based prompting (GPrompt). We tested five open-source LLMs (Qwen-30B, GPT-OSS-20B, Qwen-14B, Qwen-8B, and Llama-3.1-8B), and found that the orchestrated multi-agent system with GPrompt consistently achieved the highest F1 scores, with larger gains in smaller models. These findings demonstrate that structured multi-agent reasoning improves accuracy beyond prompt engineering alone and offers a transparent, clinically aligned approach for explainable decision support in secondary headache diagnosis.

[274] MARL Warehouse Robots

Price Allman, Lian Thang, Dre Simmons, Salmon Riaz

Main category: cs.AI

TL;DR: QMIX outperforms IPPO in warehouse robotics MARL, but requires extensive hyperparameter tuning and faces scaling challenges for larger deployments.

DetailsMotivation: To evaluate and compare multi-agent reinforcement learning algorithms for cooperative warehouse robotics applications, specifically examining their performance in robotic warehouse environments.

Method: Comparative study of QMIX (value decomposition) and IPPO (independent learning) algorithms tested on Robotic Warehouse (RWARE) environment and custom Unity 3D simulation, with extensive hyperparameter tuning including extended epsilon annealing.

Result: QMIX significantly outperforms IPPO (3.25 mean return vs. 0.38), but requires 5M+ steps of epsilon annealing for sparse reward discovery. Successful deployment in Unity ML-Agents achieved consistent package delivery after 1M training steps.

Conclusion: MARL shows promise for small-scale warehouse robotics deployments (2-4 robots), but faces significant scaling challenges for larger applications. QMIX’s value decomposition approach is superior but requires careful hyperparameter tuning.

Abstract: We present a comparative study of multi-agent reinforcement learning (MARL) algorithms for cooperative warehouse robotics. We evaluate QMIX and IPPO on the Robotic Warehouse (RWARE) environment and a custom Unity 3D simulation. Our experiments reveal that QMIX’s value decomposition significantly outperforms independent learning approaches (achieving 3.25 mean return vs. 0.38 for advanced IPPO), but requires extensive hyperparameter tuning – particularly extended epsilon annealing (5M+ steps) for sparse reward discovery. We demonstrate successful deployment in Unity ML-Agents, achieving consistent package delivery after 1M training steps. While MARL shows promise for small-scale deployments (2-4 robots), significant scaling challenges remain. Code and analyses: https://pallman14.github.io/MARL-QMIX-Warehouse-Robots/

[275] SlideGen: Collaborative Multimodal Agents for Scientific Slide Generation

Xin Liang, Xiang Zhang, Yiwei Xu, Siqi Sun, Chenyu You

Main category: cs.AI

TL;DR: SlideGen is an agentic framework that generates high-quality academic slides from scientific papers using multimodal reasoning and visual planning, outperforming existing methods.

DetailsMotivation: Existing approaches reduce slide generation to text-only summarization, overlooking the visual component and design-intensive nature of slide creation. There's a need for systems that can handle both long context understanding and deliberate visual planning.

Method: SlideGen uses an agentic, modular, and visual-in-the-loop framework with vision-language agents that collaboratively reason over document structure and semantics. The system integrates coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement to produce editable PPTX slides.

Result: SlideGen outperforms existing methods across diverse benchmarks in visual quality, content faithfulness, and readability, establishing new state-of-the-art in automated slide generation and delivering expert-level quality slides.

Conclusion: The work establishes a foundation for design-aware multimodal slide generation and demonstrates how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.

Abstract: Generating academic slides from scientific papers is a challenging multimodal reasoning task that requires both long context understanding and deliberate visual planning. Existing approaches largely reduce it to text only summarization, overlooking the visual component and design intensive nature of slide creation. In this paper we introduce SlideGen, an agentic, modular, and visual in the loop framework for scientific paper to slide generation. SlideGen orchestrates a group of vision language agents that reason collaboratively over the document structure and semantics, producing editable PPTX slides with logical flow and compelling visual presentation. By integrating coordinated outlining, mapping, arrangement, note synthesis, and iterative refinement, our system consistently delivers slides of expert level quality. Across diverse benchmarks and strong baselines, SlideGen outperforms existing methods in visual quality, content faithfulness, and readability, positioning it as the new state of the art in automated slide generation. Our work establishes a foundation for design aware multimodal slide generation, demonstrating how agentic collaboration can bridge understanding and presentation in complex multimodal reasoning tasks.

[276] TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models

Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi

Main category: cs.AI

TL;DR: TRACE is a framework for transparent reasoning evaluation that uses auxiliary reasoning sets to diagnose reasoning trajectories in vision-language models, going beyond final-answer evaluation to expose intermediate reasoning failures.

DetailsMotivation: Current evaluation of large vision-language models for mathematical and scientific reasoning focuses only on final answers, which masks reasoning errors and allows silent failures to persist. There's a need for more transparent evaluation that examines the reasoning process itself.

Method: TRACE introduces Auxiliary Reasoning Sets (ARS) - compact sub question-answer pairs that decompose complex problems. It evaluates intermediate reasoning steps through consistency-based metrics and diagnoses reasoning trajectories rather than just end results.

Result: Experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint specific reasoning steps where failures occur. TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths.

Conclusion: TRACE provides actionable signals for model improvement by exposing reasoning failures overlooked by standard evaluation, supporting effective filtering, debugging, and model refinement through transparent reasoning analysis.

Abstract: Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.

[277] LightSearcher: Efficient DeepSearch via Experiential Memory

Hengzhi Lan, Yue Yu, Li Qian, Li Peng, Jie Wu, Wei Liu, Jian Luan, Ting Bai

Main category: cs.AI

TL;DR: LightSearcher is an efficient RL framework that uses textual experiential memory and adaptive reward shaping to reduce unnecessary search tool calls in DeepSearch systems while maintaining accuracy.

DetailsMotivation: RL-driven DeepSearch systems face a trade-off between accuracy and efficiency - frequent tool calls improve factual correctness but create computational overhead and reduced efficiency.

Method: Uses textual experiential memory by learning contrastive reasoning trajectories to generate interpretable summaries of successful patterns, plus adaptive reward shaping that penalizes redundant tool calls only in correct-answer scenarios.

Result: Maintains accuracy comparable to SOTA baseline ReSearch while reducing search tool invocations by 39.6%, inference time by 48.6%, and token consumption by 21.2% on four multi-hop QA benchmarks.

Conclusion: LightSearcher effectively balances the accuracy-efficiency trade-off in DeepSearch paradigms through its memory-based learning and adaptive reward mechanisms.

Abstract: DeepSearch paradigms have become a core enabler for deep reasoning models, allowing them to invoke external search tools to access up-to-date, domain-specific knowledge beyond parametric boundaries, thereby enhancing the depth and factual reliability of reasoning. Building upon this foundation, recent advances in reinforcement learning (RL) have further empowered models to autonomously and strategically control search tool usage, optimizing when and how to query external knowledge sources. Yet, these RL-driven DeepSearch systems often reveal a see-saw trade-off between accuracy and efficiency-frequent tool invocations can improve factual correctness but lead to unnecessary computational overhead and diminished efficiency. To address this challenge, we propose LightSearcher, an efficient RL framework that incorporates textual experiential memory by learning contrastive reasoning trajectories to generate interpretable summaries of successful reasoning patterns. In addition, it employs an adaptive reward shaping mechanism that penalizes redundant tool calls only in correct-answer scenarios. This design effectively balances the inherent accuracy-efficiency trade-off in DeepSearch paradigms. Experiments on four multi-hop QA benchmarks show that LightSearcher maintains accuracy comparable to SOTA baseline ReSearch, while reducing search tool invocations by 39.6%, inference time by 48.6%, and token consumption by 21.2%, demonstrating its superior efficiency.

[278] DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Tianming Yang, Saravan Rajmohan, Dongmei Zhang

Main category: cs.AI

TL;DR: DoVer: An intervention-driven debugging framework for LLM-based multi-agent systems that uses targeted interventions to validate failure hypotheses and measure progress toward task success rather than just attributing errors.

DetailsMotivation: Current log-based debugging for LLM multi-agent systems has limitations: (1) log-only debugging produces untested hypotheses without validation, and (2) single-step/single-agent attribution is often ill-posed since multiple interventions can independently fix failures.

Method: DoVer introduces intervention-driven debugging that augments hypothesis generation with active verification through targeted interventions (editing messages, altering plans). It focuses on measuring whether interventions resolve failures or make quantifiable progress toward task success rather than just attribution accuracy.

Result: On Magnetic-One framework with GAIA and AssistantBench datasets: DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates/refutes 30-60% of failure hypotheses. On GSMPlus dataset with AG2 framework: recovers 49% of failed trials.

Conclusion: Intervention is a practical mechanism for improving reliability in agentic systems, opening opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems.

Abstract: Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.

[279] VIGIL: A Reflective Runtime for Self-Healing Agents

Christopher Cruz

Main category: cs.AI

TL;DR: VIGIL is a reflective runtime system that autonomously supervises and repairs LLM agents by analyzing behavioral logs, maintaining emotional state representations, and generating prompt/code fixes without human intervention.

DetailsMotivation: Current agentic LLM frameworks are brittle, lack runtime introspection, cannot self-diagnose failures, and don't improve autonomously. Most degrade into simple chains of LLM calls without reliability mechanisms.

Method: VIGIL uses a reflective runtime that: 1) ingests behavioral logs, 2) appraises events into structured emotional representations, 3) maintains a persistent EmoBank with decay policies, 4) derives RBT diagnosis (strengths/opportunities/failures), 5) generates guarded prompt updates and code proposals via strategy engine, and 6) operates as state-gated pipeline with explicit error handling.

Result: In a reminder latency case study, VIGIL identified elevated lag, proposed prompt and code repairs, and when its own diagnostic tool failed due to schema conflict, it surfaced the error, produced fallback diagnosis, and emitted repair plan - demonstrating meta-level self-repair.

Conclusion: VIGIL enables autonomous maintenance and self-repair in deployed agent runtimes, addressing brittleness in current agentic LLM frameworks through reflective supervision and structured error handling.

Abstract: Agentic LLM frameworks promise autonomous behavior via task decomposition, tool use, and iterative planning, but most deployed systems remain brittle. They lack runtime introspection, cannot diagnose their own failure modes, and do not improve over time without human intervention. In practice, many agent stacks degrade into decorated chains of LLM calls with no structural mechanisms for reliability. We present VIGIL (Verifiable Inspection and Guarded Iterative Learning), a reflective runtime that supervises a sibling agent and performs autonomous maintenance rather than task execution. VIGIL ingests behavioral logs, appraises each event into a structured emotional representation, maintains a persistent EmoBank with decay and contextual policies, and derives an RBT diagnosis that sorts recent behavior into strengths, opportunities, and failures. From this analysis, VIGIL generates both guarded prompt updates that preserve core identity semantics and read only code proposals produced by a strategy engine that operates on log evidence and code hotspots. VIGIL functions as a state gated pipeline. Illegal transitions produce explicit errors rather than allowing the LLM to improvise. In a reminder latency case study, VIGIL identified elevated lag, proposed prompt and code repairs, and when its own diagnostic tool failed due to a schema conflict, it surfaced the internal error, produced a fallback diagnosis, and emitted a repair plan. This demonstrates meta level self repair in a deployed agent runtime.

[280] How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

JV Roig

Main category: cs.AI

TL;DR: LLMs fail as autonomous agents due to recurring failure patterns like premature action, over-helpfulness, context pollution, and fragile execution, with scale not guaranteeing robustness and post-training RL being key to reliability.

DetailsMotivation: To understand how LLMs fail when operating as autonomous agents with tool-use capabilities, moving beyond aggregate scores to analyze specific failure modes and strategies for successful multi-step tool execution.

Method: Used KAMI v0.1 benchmark to analyze 900 execution traces from three models (Granite 4 Small, Llama 4 Maverick, DeepSeek V3.1) across filesystem, text extraction, CSV analysis, and SQL scenarios, performing fine-grained per-trial behavioral analysis.

Result: Model scale alone doesn’t predict agentic robustness; DeepSeek V3.1’s reliability comes from post-training RL rather than architecture/size. Identified four recurring failure archetypes: premature action without grounding, over-helpfulness substituting missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load.

Conclusion: Reliable enterprise deployment requires agentic evaluation emphasizing interactive grounding, recovery behavior, and environment-aware adaptation, plus deliberate training/design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

Abstract: We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1’s superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

cs.SD

[281] AudioScene: Integrating Object-Event Audio into 3D Scenes

Shuaihang Yuan, Congcong Wen, Muhammad Shafique, Anthony Tzes, Yi Fang

Main category: cs.SD

TL;DR: Researchers created two audio-spatial datasets (AudioScanNet and AudioRoboTHOR) that combine audio clips with 3D spatial scenes to enable audio-conditioned spatial tasks, using LLMs and human verification for scalable annotation.

DetailsMotivation: Existing audio-only datasets lack spatial context, limiting research on how audio signals interact with 3D environments for applications like human-computer interaction, environmental monitoring, and public safety.

Method: Created two novel datasets by integrating audio clips with spatially aligned 3D scenes. Used large language models for common sense reasoning to associate audio events with spatial information, supplemented with rigorous human verification for scalable and accurate annotation.

Result: The datasets enable research on audio-based 3D visual grounding and audio-based robotic zero-shot navigation. Results highlight limitations of current audio-centric methods and demonstrate the datasets’ value for advancing audio-guided spatial learning.

Conclusion: The proposed audio-spatial datasets address the gap in spatial context for audio analysis, offering scalable annotation through LLM-human collaboration and enabling new research directions in audio-conditioned spatial tasks.

Abstract: The rapid advances in audio analysis underscore its vast potential for humancomputer interaction, environmental monitoring, and public safety; yet, existing audioonly datasets often lack spatial context. To address this gap, we present two novel audiospatial scene datasets, AudioScanNet and AudioRoboTHOR, designed to explore audioconditioned tasks within 3D environments. By integrating audio clips with spatially aligned 3D scenes, our datasets enable research on how audio signals interact with spatial context. To associate audio events with corresponding spatial information, we leverage the common sense reasoning ability of large language models and supplement them with rigorous human verification, This approach offers greater scalability compared to purely manual annotation while maintaining high standards of accuracy, completeness, and diversity, quantified through inter annotator agreement and performance on two benchmark tasks audio based 3D visual grounding and audio based robotic zeroshot navigation. The results highlight the limitations of current audiocentric methods and underscore the practical challenges and significance of our datasets in advancing audio guided spatial learning.

[282] LocaGen: Sub-Sample Time-Delay Learning for Beam Localization

Ishaan Kunwar, Henry Cantor, Tyler Rizzo, Ayaan Qayyum

Main category: cs.SD

TL;DR: LocaGen improves 2-D audio beam localization using ML models trained on synthetic data to reduce sampling quantization errors, achieving 67% DOA error reduction with minimal resource overhead on embedded systems.

DetailsMotivation: To improve localization performance of audio signals in 2-D beam localization, particularly addressing sampling quantization errors that limit accuracy in microphone array systems.

Method: Uses machine learning models trained on realistic synthetic data generated by simulation to reduce sampling quantization errors. Works with a three-microphone array and processes audio at 10 kHz.

Result: Demonstrates 67% reduction in direction-of-arrival (DOA) error while maintaining minimal real-time resource usage on low-powered embedded systems. Improves both DOA and precise location estimation accuracy.

Conclusion: LocaGen effectively enhances audio localization accuracy through ML-based quantization error reduction, making it practical for resource-constrained embedded systems with small microphone arrays.

Abstract: The goal of LocaGen is to improve the localization performance of audio signals in the 2-D beam localization problem. LocaGen reduces sampling quantization errors through machine learning models trained on realistic synthetic data generated by a simulation. The system increases the accuracy of both direction-of-arrival (DOA) and precise location estimation of an audio beam from an array of three microphones. We demonstrate LocaGen’s efficacy on a low-powered embedded system with an increased localization accuracy with a minimal increase in real-time resource usage. LocaGen was demonstrated to reduce DOA error by approximately 67% even with a microphone array of only 10 kHz in audio processing.

[283] Beyond Unified Models: A Service-Oriented Approach to Low Latency, Context Aware Phonemization for Real Time TTS

Mahta Fetrat, Donya Navabi, Zahra Dehghanian, Morteza Abolghasemi, Hamid R. Rabiee

Main category: cs.SD

TL;DR: The paper proposes a service-oriented TTS architecture that decouples context-aware phonemization from the core TTS engine, enabling real-time performance with high-quality phonemization.

DetailsMotivation: There's a trade-off in TTS systems between phonemization quality and inference speed. Lightweight phonemizers struggle with context-dependent challenges, while advanced phonemizers with better linguistic understanding are too computationally expensive for real-time performance.

Method: The authors propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture where phonemization modules run as independent services, decoupling heavy context-aware components from the core TTS engine.

Result: Experimental results show the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, breaking the latency barrier for high-quality phonemization.

Conclusion: The service-oriented architecture enables real-time use of high-quality phonemization models, making it suitable for offline and end-device TTS applications by effectively balancing quality and speed.

Abstract: Lightweight, real-time text-to-speech systems are crucial for accessibility. However, the most efficient TTS models often rely on lightweight phonemizers that struggle with context-dependent challenges. In contrast, more advanced phonemizers with a deeper linguistic understanding typically incur high computational costs, which prevents real-time performance. This paper examines the trade-off between phonemization quality and inference speed in G2P-aided TTS systems, introducing a practical framework to bridge this gap. We propose lightweight strategies for context-aware phonemization and a service-oriented TTS architecture that executes these modules as independent services. This design decouples heavy context-aware components from the core TTS engine, effectively breaking the latency barrier and enabling real-time use of high-quality phonemization models. Experimental results confirm that the proposed system improves pronunciation soundness and linguistic accuracy while maintaining real-time responsiveness, making it well-suited for offline and end-device TTS applications.

[284] Error-Resilient Semantic Communication for Speech Transmission over Packet-Loss Networks

Zhuohang Han, Jincheng Dai, Shengshi Yao, Junyi Wang, Yanlong Li, Kai Niu, Wenjun Xu, Ping Zhang

Main category: cs.SD

TL;DR: Glaris is a generative latent-prior-based resilient speech semantic communication framework that achieves robust speech transmission over wireless networks by performing resilient coding in generative latent space with integrated error resilience mechanisms.

DetailsMotivation: Real-time speech communication over wireless networks faces challenges with packet loss under bandwidth/latency constraints. Semantic communication via JSCC offers robustness but has deployment issues due to incompatibility with existing systems. Current approaches primarily evaluate robustness through error-resilience to packet loss.

Method: Proposes Glaris framework that performs resilient speech coding in generative latent space. Uses generative latent priors for high-quality packet loss concealment (PLC) at receiver, balancing semantic consistency and reconstruction fidelity. Includes integrated error resilience mechanism to mitigate error propagation and improve PLC effectiveness.

Result: Outperforms existing error-resilient codecs on LibriSpeech dataset. Achieves JSCC-level robustness while maintaining compatibility with existing systems. Reduces redundancy overhead significantly compared to traditional FEC strategies. Strikes favorable balance between transmission efficiency and speech reconstruction quality.

Conclusion: Glaris provides an effective solution for robust real-time speech communication over dynamic wireless networks by leveraging generative latent priors and integrated error resilience, achieving superior performance while maintaining system compatibility.

Abstract: Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhancing the robustness of speech transmission by means of joint source-channel coding (JSCC). However, its cross-layer design hinders practical deployment due to the incompatibility with existing digital communication systems. In this case, the robustness of speech communication is consequently evaluated primarily by the error-resilience to packet loss over wireless networks. To address these challenges, we propose \emph{Glaris}, a generative latent-prior-based resilient speech semantic communication framework that performs resilient speech coding in the generative latent space. Generative latent priors enable high-quality packet loss concealment (PLC) at the receiver side, well-balancing semantic consistency and reconstruction fidelity. Additionally, an integrated error resilience mechanism is designed to mitigate the error propagation and improve the effectiveness of PLC. Compared with traditional packet-level forward error correction (FEC) strategies, our new method achieves enhanced robustness over dynamic wireless networks while reducing redundancy overhead significantly. Experimental results on the LibriSpeech dataset demonstrate that \emph{Glaris} consistently outperforms existing error-resilient codecs, achieving JSCC-level robustness while maintaining seamless compatibility with existing systems, and it also strikes a favorable balance between transmission efficiency and speech reconstruction quality.

[285] SpeechQualityLLM: LLM-Based Multimodal Assessment of Speech Quality

Mahathir Monjur, Shahriar Nirjon

Main category: cs.SD

TL;DR: SpeechQualityLLM: A multimodal QA system that uses language models to assess speech quality through natural language queries instead of just regressing scores.

DetailsMotivation: Existing speech quality metrics (PESQ, POLQA, NISQA) are rigid - they don't support natural language queries, don't provide textual rationales, and can't capture human variability in judgments.

Method: Multimodal system coupling audio encoder with language model, trained on NISQA corpus using template-based QA pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, loudness) in both single-ended and double-ended setups.

Result: Double-ended model achieves MOS MAE of 0.41 with Pearson correlation of 0.86 on held-out NISQA clips, with competitive performance on dimension-wise tasks.

Conclusion: SpeechQualityLLM offers flexible natural-language interface for speech quality assessment, enabling arbitrary queries, emulation of different listener profiles, and diverse plausible judgments, reducing reliance on expensive crowdsourced tests.

Abstract: Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, it offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost.

[286] VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning

Qianyue Hu, Junyan Wu, Wei Lu, Xiangyang Luo

Main category: cs.SD

TL;DR: VoiceCloak is a proactive defense framework that protects against unauthorized voice cloning by diffusion models through multi-dimensional adversarial perturbations.

DetailsMotivation: Diffusion models have achieved remarkable success in realistic voice cloning, but this also increases the risk of malicious misuse. Existing proactive defenses designed for traditional VC models are incompatible with DMs due to their intricate generative mechanisms.

Method: VoiceCloak introduces adversarial perturbations into reference audio to: 1) obfuscate speaker identity by distorting representation learning embeddings and disrupting conditional guidance processes (attention context), and 2) degrade perceptual quality through score magnitude amplification and noise-guided semantic corruption.

Result: Extensive experiments highlight VoiceCloak’s outstanding defense success rate against unauthorized diffusion-based voice cloning.

Conclusion: VoiceCloak effectively bridges the gap in proactive defense for diffusion-based voice cloning by targeting specific vulnerabilities in DMs through multi-dimensional adversarial perturbations.

Abstract: Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak’s outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.

[287] DFALLM: Achieving Generalizable Multitask Deepfake Detection by Optimizing Audio LLM Components

Yupei Li, Li Wang, Yuxiang Wang, Lei Wang, Rizhao Cai, Jie Shi, Björn W. Schuller, Zhizheng Wu

Main category: cs.SD

TL;DR: This paper investigates how to improve audio deepfake detection generalization in Audio Large Language Models (ALLMs) by carefully selecting and combining audio encoders with text-based LLMs, achieving state-of-the-art performance across multiple datasets and tasks.

DetailsMotivation: Traditional deep learning methods for audio deepfake detection lack generalizability to new spoofing techniques and multi-task scenarios like spoof attribution. While LLMs have strong generalization capabilities, previous ALLMs show a generalization bottleneck in audio deepfake detection even with sufficient data.

Method: The study investigates ALLM architecture components (audio encoder and text-based LLM), examining their effects on deepfake detection. The authors propose an ALLM structure with careful selection and combination of audio encoders and text-based LLMs to unlock deepfake detection potential.

Result: The proposed model achieves state-of-the-art performance across multiple datasets (ASVSpoof2019, InTheWild, Demopage) with up to 95.76% average accuracy. It also shows competitive capabilities in other deepfake tasks like attribution and localization compared to SOTA audio understanding models.

Conclusion: Careful selection and combination of audio encoders and text-based LLMs are crucial for unlocking ALLMs’ deepfake detection potential. The proposed architecture successfully generalizes to out-of-domain spoofing tests and multi-task scenarios, addressing limitations of traditional methods.

Abstract: Audio deepfake detection has recently garnered public concern due to its implications for security and reliability. Traditional deep learning methods have been widely applied to this task but often lack generalisability when confronted with newly emerging spoofing techniques and more tasks such as spoof attribution recognition rather than simple binary classification. In principle, Large Language Models (LLMs) are considered to possess the needed generalisation capabilities. However, previous research on Audio LLMs (ALLMs) indicates a generalization bottleneck in audio deepfake detection performance, even when sufficient data is available. Consequently, this study investigates the model architecture and examines the effects of the primary components of ALLMs, namely the audio encoder and the text-based LLM. Our experiments demonstrate that the careful selection and combination of audio encoders and text-based LLMs are crucial for unlocking the deepfake detection potential of ALLMs. We further propose an ALLM structure capable of generalizing deepfake detection abilities to out-of-domain spoofing tests and other deepfake tasks, such as spoof positioning and spoof attribution recognition. Our proposed model architecture achieves state-of-the-art (SOTA) performance across multiple datasets, including ASVSpoof2019, InTheWild, and Demopage, with accuracy reaching up to 95.76% on average, and exhibits competitive capabilities in other deepfake detection tasks such as attribution, and localisation compared to SOTA audio understanding models. Data and codes are provided in supplementary materials.

[288] Emovectors: assessing emotional content in jazz improvisations for creativity evaluation

Anna Jordanous

Main category: cs.SD

TL;DR: The paper proposes an embeddings-based method called ’emovectors’ to quantify emotional content in jazz improvisations as a metric for assessing creativity.

DetailsMotivation: To develop automated metrics for evaluating creativity in jazz improvisations, particularly for LLM-based generative systems, by linking emotional involvement to perceived creativity.

Method: An embeddings-based method using psychologically-grounded classification of musical characteristics associated with emotions to create ’emovectors’ that capture emotional content in musical improvisations.

Result: The study hypothesizes that improvisations with more evidence of emotion-laden content are more likely to be recognized as creative, and tests this by analyzing emovectors across multiple improvisations.

Conclusion: Quantifying emotional content through emovectors can contribute to new scalable metrics for creativity evaluation in musical improvisations.

Abstract: Music improvisation is fascinating to study, being essentially a live demonstration of a creative process. In jazz, musicians often improvise across predefined chord progressions (leadsheets). How do we assess the creativity of jazz improvisations? And can we capture this in automated metrics for creativity for current LLM-based generative systems? Demonstration of emotional involvement is closely linked with creativity in improvisation. Analysing musical audio, can we detect emotional involvement? This study hypothesises that if an improvisation contains more evidence of emotion-laden content, it is more likely to be recognised as creative. An embeddings-based method is proposed for capturing the emotional content in musical improvisations, using a psychologically-grounded classification of musical characteristics associated with emotions. Resulting ’emovectors’ are analysed to test the above hypothesis, comparing across multiple improvisations. Capturing emotional content in this quantifiable way can contribute towards new metrics for creativity evaluation that can be applied at scale.

cs.LG

[289] ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin

Main category: cs.LG

TL;DR: ThreadWeaver enables adaptive parallel reasoning in LLMs, achieving accuracy comparable to sequential CoT models while reducing inference latency through parallel reasoning threads.

DetailsMotivation: Sequential decoding in LLMs causes high latency on complex reasoning tasks, while existing parallel reasoning methods either sacrifice accuracy or require specialized inference engines, limiting practical deployment.

Method: Three key innovations: 1) Two-stage parallel trajectory generator for high-quality parallel CoT data, 2) Trie-based training-inference co-design for compatibility with standard autoregressive engines, 3) Parallelization-aware RL to balance accuracy and parallelization efficiency.

Result: On six math reasoning benchmarks, ThreadWeaver achieves 71.9% average accuracy (79.9% on AIME24) comparable to state-of-the-art sequential models, with up to 1.53x token latency speedup.

Conclusion: ThreadWeaver establishes a new Pareto frontier between accuracy and efficiency, enabling parallel reasoning without sacrificing performance or requiring specialized infrastructure.

Abstract: Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver’s performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

[290] Space Alignment Matters: The Missing Piece for Inducing Neural Collapse in Long-Tailed Learning

Jinping Wang, Zhiqiang Gao, Zhiwu Xie

Main category: cs.LG

TL;DR: The paper addresses feature-classifier misalignment in long-tailed learning, proposing plug-and-play alignment strategies that improve existing methods and achieve SOTA results.

DetailsMotivation: In long-tailed regimes, severe sample imbalance prevents Neural Collapse (NC) from emerging, leading to poor generalization. Current methods try to recover ETF geometry but overlook the critical misalignment between feature and classifier weight spaces.

Method: The authors theoretically quantify the harm of feature-classifier misalignment through optimal error exponent analysis, then propose three explicit alignment strategies that can be plug-and-play integrated into existing long-tail methods without architectural changes.

Result: Extensive experiments on CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT datasets consistently boost examined baselines and achieve state-of-the-art performances.

Conclusion: Addressing feature-classifier misalignment is crucial for long-tailed learning, and the proposed plug-and-play alignment strategies effectively improve existing methods while maintaining architectural simplicity.

Abstract: Recent studies on Neural Collapse (NC) reveal that, under class-balanced conditions, the class feature means and classifier weights spontaneously align into a simplex equiangular tight frame (ETF). In long-tailed regimes, however, severe sample imbalance tends to prevent the emergence of the NC phenomenon, resulting in poor generalization performance. Current efforts predominantly seek to recover the ETF geometry by imposing constraints on features or classifier weights, yet overlook a critical problem: There is a pronounced misalignment between the feature and the classifier weight spaces. In this paper, we theoretically quantify the harm of such misalignment through an optimal error exponent analysis. Built on this insight, we propose three explicit alignment strategies that plug-and-play into existing long-tail methods without architectural change. Extensive experiments on the CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT datasets consistently boost examined baselines and achieve the state-of-the-art performances.

[291] CarBench: A Comprehensive Benchmark for Neural Surrogates on High-Fidelity 3D Car Aerodynamics

Mohamed Elrefaie, Dule Shu, Matt Klenk, Faez Ahmed

Main category: cs.LG

TL;DR: CarBench is the first comprehensive benchmark for large-scale 3D car aerodynamics, evaluating 11 state-of-the-art models on DrivAerNet++ (8,000+ CFD simulations) with standardized evaluation protocols.

DetailsMotivation: Despite growing availability of large-scale CFD datasets, there's no standardized benchmark for large-scale numerical simulations in engineering design, hindering progress in applying machine learning to aerodynamic and engineering applications.

Method: Created CarBench benchmark using DrivAerNet++ dataset (largest public automotive aerodynamics dataset). Evaluated 11 architectures including neural operators, geometric deep learning, transformer-based neural solvers, and implicit field networks. Performed both standard interpolation tasks and cross-category experiments.

Result: Comprehensive evaluation covering predictive accuracy, physical consistency, computational efficiency, and statistical uncertainty. Established reproducible foundation with open-source framework including training pipelines, uncertainty estimation routines, and pretrained model weights.

Conclusion: CarBench provides the first standardized benchmark for large-scale 3D car aerodynamics, accelerating progress in data-driven engineering by establishing reproducible evaluation protocols and open-source tools for the community.

Abstract: Benchmarking has been the cornerstone of progress in computer vision, natural language processing, and the broader deep learning domain, driving algorithmic innovation through standardized datasets and reproducible evaluation protocols. The growing availability of large-scale Computational Fluid Dynamics (CFD) datasets has opened new opportunities for applying machine learning to aerodynamic and engineering design. Yet, despite this progress, there exists no standardized benchmark for large-scale numerical simulations in engineering design. In this work, we introduce CarBench, the first comprehensive benchmark dedicated to large-scale 3D car aerodynamics, performing a large-scale evaluation of state-of-the-art models on DrivAerNet++, the largest public dataset for automotive aerodynamics, containing over 8,000 high-fidelity car simulations. We assess eleven architectures spanning neural operator methods (e.g., Fourier Neural Operator), geometric deep learning (PointNet, RegDGCNN, PointMAE, PointTransformer), transformer-based neural solvers (Transolver, Transolver++, AB-UPT), and implicit field networks (TripNet). Beyond standard interpolation tasks, we perform cross-category experiments in which transformer-based solvers trained on a single car archetype are evaluated on unseen categories. Our analysis covers predictive accuracy, physical consistency, computational efficiency, and statistical uncertainty. To accelerate progress in data-driven engineering, we open-source the benchmark framework, including training pipelines, uncertainty estimation routines based on bootstrap resampling, and pretrained model weights, establishing the first reproducible foundation for large-scale learning from high-fidelity CFD simulations, available at https://github.com/Mohamedelrefaie/CarBench.

[292] RaX-Crash: A Resource Efficient and Explainable Small Model Pipeline with an Application to City Scale Injury Severity Prediction

Di Zhu, Chen Xie, Ziwei Wang, Haoyun Zhang

Main category: cs.LG

TL;DR: RaX-Crash is an efficient, explainable pipeline using tree-based ensembles (XGBoost/Random Forest) for NYC crash injury severity prediction, outperforming small language models and highlighting key risk factors.

DetailsMotivation: NYC has over 100k motor vehicle collisions annually, creating substantial injury and public health burden, requiring efficient and interpretable prediction methods for injury severity.

Method: Integrates three linked tables with tens of millions of records, builds unified feature schema in partitioned storage, trains compact tree ensembles (Random Forest/XGBoost) on engineered tabular features, compares against small language models prompted with textual summaries.

Result: XGBoost (0.7828 accuracy) and Random Forest (0.7794) clearly outperform SLMs (0.594 and 0.496); class weighting improves fatal recall with modest accuracy trade-offs; SHAP attribution highlights human vulnerability factors, timing, and location as dominant severity drivers.

Conclusion: Interpretable small model ensembles remain strong baselines for city-scale injury analytics; hybrid pipelines pairing tabular predictors with SLM-generated narratives improve communication without sacrificing scalability.

Abstract: New York City reports over one hundred thousand motor vehicle collisions each year, creating substantial injury and public health burden. We present RaX-Crash, a resource efficient and explainable small model pipeline for structured injury severity prediction on the official NYC Motor Vehicle Collisions dataset. RaX-Crash integrates three linked tables with tens of millions of records, builds a unified feature schema in partitioned storage, and trains compact tree based ensembles (Random Forest and XGBoost) on engineered tabular features, which are compared against locally deployed small language models (SLMs) prompted with textual summaries. On a temporally held out test set, XGBoost and Random Forest achieve accuracies of 0.7828 and 0.7794, clearly outperforming SLMs (0.594 and 0.496); class imbalance analysis shows that simple class weighting improves fatal recall with modest accuracy trade offs, and SHAP attribution highlights human vulnerability factors, timing, and location as dominant drivers of predicted severity. Overall, RaX-Crash indicates that interpretable small model ensembles remain strong baselines for city scale injury analytics, while hybrid pipelines that pair tabular predictors with SLM generated narratives improve communication without sacrificing scalability.

[293] Geometric-Stochastic Multimodal Deep Learning for Predictive Modeling of SUDEP and Stroke Vulnerability

Preksha Girish, Rachana Mysore, Mahanthesha U, Shrey Kumar, Misbah Fatimah Annigeri, Tanish Jain

Main category: cs.LG

TL;DR: A unified geometric-stochastic multimodal deep learning framework integrates EEG, ECG, respiration, SpO2, EMG, and fMRI signals to model SUDEP and stroke vulnerability using Riemannian manifolds, Lie-group features, fractional stochastic dynamics, and cross-modal attention.

DetailsMotivation: SUDEP and acute ischemic stroke are life-threatening conditions involving complex interactions across cortical, brainstem, and autonomic systems, requiring advanced multimodal analysis for early detection and risk stratification.

Method: Combines Riemannian manifold embeddings, Lie-group invariant feature representations, fractional stochastic dynamics, Hamiltonian energy-flow modeling, cross-modal attention mechanisms, and fractional epidemic diffusion over structural brain graphs for stroke propagation modeling.

Result: Experiments on MULTI-CLARID dataset demonstrate improved predictive accuracy and interpretable biomarkers derived from manifold curvature, fractional memory indices, attention entropy, and diffusion centrality.

Conclusion: The framework provides a mathematically principled foundation for early detection, risk stratification, and interpretable multimodal modeling in neural-autonomic disorders.

Abstract: Sudden Unexpected Death in Epilepsy (SUDEP) and acute ischemic stroke are life-threatening conditions involving complex interactions across cortical, brainstem, and autonomic systems. We present a unified geometric-stochastic multimodal deep learning framework that integrates EEG, ECG, respiration, SpO2, EMG, and fMRI signals to model SUDEP and stroke vulnerability. The approach combines Riemannian manifold embeddings, Lie-group invariant feature representations, fractional stochastic dynamics, Hamiltonian energy-flow modeling, and cross-modal attention mechanisms. Stroke propagation is modeled using fractional epidemic diffusion over structural brain graphs. Experiments on the MULTI-CLARID dataset demonstrate improved predictive accuracy and interpretable biomarkers derived from manifold curvature, fractional memory indices, attention entropy, and diffusion centrality. The proposed framework provides a mathematically principled foundation for early detection, risk stratification, and interpretable multimodal modeling in neural-autonomic disorders.

[294] SABER: Small Actions, Big Errors – Safeguarding Mutating Steps in LLM Agents

Alejandro Cuadron, Pengfei Yu, Yang Liu, Arpit Gupta

Main category: cs.LG

TL;DR: LLM agents are fragile on long-horizon tool-using tasks, with mutating actions (environment-changing) being critical failure points. The paper introduces CM, a test-time safeguard that improves performance through mutation-gated verification, targeted reflection, and context cleaning.

DetailsMotivation: To understand why LLM agents fail on long-horizon tool-using tasks, specifically investigating whether all actions contribute equally to failure, and to identify critical failure points in agent execution.

Method: Analyzed execution traces on τ-Bench (Airline/Retail) and SWE-Bench Verified, decomposing trajectories into mutating vs. non-mutating steps. Formalized “decisive deviations” (earliest action-level divergences that flip success to failure). Used logistic regression to quantify impact. Introduced CM, a model-agnostic safeguard with mutation-gated verification, targeted reflection before mutating steps, and block-based context cleaning.

Result: Each additional deviation in a mutating action reduces odds of success by up to 92% on Airline and 96% on Retail for SoTA models, while non-mutating deviations have little effect. CM delivers consistent gains: Qwen3-Thinking +28% relative on Airline, +11% on Retail, +7% on SWE-Bench Verified; Claude +9%/+7%. Also identified ceiling effects in τ-Bench due to annotation errors and underspecified tasks.

Conclusion: Mutating actions are critical failure points in LLM agents. Action-level analysis, targeted safeguards (like CM), and reliable evaluations (τ-Bench Verified) are essential for robust multi-turn agents. The paper argues for these as prerequisites for building more reliable agent systems.

Abstract: Despite rapid progress in LLM agents, performance on long-horizon, tool-using tasks remains fragile. To better understand this fragility, we ask a simple question: \emph{do all actions contribute equally to failure?} Analyzing execution traces on $τ$-Bench (Airline/Retail) and SWE-Bench Verified, we decompose trajectories into \emph{mutating} (environment-changing) vs.\ non-mutating steps and formalize \emph{decisive deviations}, earliest action, level divergences that flip success to failure. A logistic regression reveals that each additional deviation in a mutating action reduces the odds of success by upto $92%$ on Airline and upto $96%$ on Retail for SoTA models. In contrast, deviations in non-mutating actions have little to no effect. Errors also grow with context length as agents drift from role and act on stale constraints. Motivated by these observations, we introduce \cm{}, a model-agnostic, gradient-free, test-time safeguard that (i) adds mutation-gated verification, (ii) injects \emph{Targeted Reflection} before mutating steps, and (iii) performs block-based context cleaning. \cm{} delivers consistent gains, e.g., Qwen3-Thinking: +28% \emph{relative} on Airline, +11% on Retail, and +7% on SWE-Bench Verified; Claude: +9%/+7%. We further identify ceiling effects in $τ$-Bench, where annotation errors and underspecified tasks artificially cap model performance. To address this, we release $τ$-Bench Verified, which restores benchmark headroom through targeted revisions. Our results argue for action-level analysis, targeted safeguards, and reliable evaluations as prerequisites for robust multi-turn agents.

[295] GPU Memory Prediction for Multimodal Model Training

Jinwoo Jeong, Minchul Kang, Younghun Go, Changyong Shin, Hyunho Lee, Junho Yoon, Gyeongsik Yang, Chuck Yoo

Main category: cs.LG

TL;DR: A framework for predicting peak GPU memory usage in multimodal models to prevent out-of-memory errors during training.

DetailsMotivation: As agentic AI systems use larger multimodal models, GPU memory requirements often exceed capacity causing OoM errors that waste computational resources. Existing memory prediction methods only work for unimodal architectures and fail with multimodal models.

Method: The framework decomposes multimodal models into constituent layers and applies factorization to estimate memory usage of each layer, analyzing both model architecture and training behavior.

Result: The framework achieves high prediction accuracy with ~8.7% average MAPE (Mean Absolute Percentage Error).

Conclusion: The proposed framework successfully addresses the limitation of previous unimodal-focused methods by providing accurate GPU memory usage prediction for multimodal models, enabling better resource management and preventing OoM errors in agentic AI systems.

Abstract: As deep learning models in agentic AI systems grow in scale and complexity, GPU memory requirements increase and often exceed the available GPU memory capacity, so that out-of-memory (OoM) errors occur. It is well known that OoM interrupts the whole training itself and wastes substantial computational resources. Therefore, to prevent OoM, accurate prediction of GPU memory usage is essential. However, previous studies focus only on unimodal architectures and fail to generalize to multimodal models, even though the multimodal models are a common choice in agentic AI systems. To address this limitation, we propose a framework that predicts the peak GPU memory usage by analyzing the model architecture and training behavior of multimodal models. Specifically, the framework decomposes the multimodal model into its constituent layers and applies factorization to estimate the memory usage of each layer. Our evaluation shows that our framework achieves high prediction accuracy of ~8.7% average MAPE.

[296] HSTMixer: A Hierarchical MLP-Mixer for Large-Scale Traffic Forecasting

Yongyao Wang, Jingyuan Wang, Xie Yu, Jiahao Ji, Chao Li

Main category: cs.LG

TL;DR: HSTMixer: A hierarchical MLP-based framework for efficient large-scale traffic forecasting with adaptive region modeling.

DetailsMotivation: Existing traffic forecasting models have quadratic computational complexity, making them impractical for large-scale real-world traffic networks that require handling complex, evolving patterns across extensive urban areas.

Method: Proposes HSTMixer with hierarchical spatiotemporal mixing blocks for multi-resolution feature extraction (bottom-up aggregation and top-down propagation), and an adaptive region mixer that generates transformation matrices based on regional semantics to dynamically capture evolving spatiotemporal patterns.

Result: Achieves state-of-the-art performance on four large-scale real-world datasets while maintaining competitive computational efficiency compared to existing models.

Conclusion: HSTMixer provides an effective and efficient solution for large-scale traffic forecasting by combining hierarchical MLP architecture with adaptive region modeling, addressing the computational limitations of previous approaches while maintaining high predictive accuracy.

Abstract: Traffic forecasting task is significant to modern urban management. Recently, there is growing attention on large-scale forecasting, as it better reflects the complexity of real-world traffic networks. However, existing models often exhibit quadratic computational complexity, making them impractical for large-scale real-world scenarios. In this paper, we propose a novel framework, Hierarchical Spatio-Temporal Mixer (HSTMixer), which leverages an all-MLP architecture for efficient and effective large-scale traffic forecasting. HSTMixer employs a hierarchical spatiotemporal mixing block to extract multi-resolution features through bottom-up aggregation and top-down propagation. Furthermore, an adaptive region mixer generates transformation matrices based on regional semantics, enabling our model to dynamically capture evolving spatiotemporal patterns for different regions. Extensive experiments conducted on four large-scale real-world datasets demonstrate that the proposed method not only achieves state-of-the-art performance but also exhibits competitive computational efficiency.

[297] LAPA: Log-Domain Prediction-Driven Dynamic Sparsity Accelerator for Transformer Model

Huizheng Wang, Hongbin Wang, Shaojun Wei, Yang Hu, Shouyi Yin

Main category: cs.LG

TL;DR: LAPA is a log-domain attention prediction algorithm-architecture co-design that achieves 3.52x higher energy efficiency than SOTA sparse Transformers through asymmetric leading one computing, mixed-precision accumulation, and data-feature dependent filtering.

DetailsMotivation: Transformers have computational bottlenecks that vary dynamically across stages as input sequences change, requiring cross-stage sparse acceleration. Existing sparse Transformer approaches are single-stage based and their sparsity prediction mechanisms cause significant power overhead when applied across multiple stages.

Method: Proposes LAPA with four key components: 1) Asymmetric Leading One Computing (ALOC) to eliminate expensive multiplications, 2) Mixed-precision Multi-Round Shifting Accumulation (MRSA) to mitigate accumulation overhead, 3) Data-Feature Dependent Filter (DDF) strategy working with MRSA, and 4) An elaborate accelerator design to translate theoretical enhancements into practical hardware improvements.

Result: LAPA achieves 3.52x, 3.24x, and 2.79x higher energy efficiency compared to state-of-the-art works Spatten, Sanger, and FACT respectively.

Conclusion: The proposed LAPA co-design effectively addresses the dynamic computational bottlenecks in Transformers across stages, significantly improving energy efficiency through log-domain attention prediction and specialized hardware acceleration.

Abstract: Attention-based Transformers have revolutionized natural language processing (NLP) and shown strong performance in computer vision (CV) tasks. However, as the input sequence varies, the computational bottlenecks in Transformer models exhibit dynamic behavior across stages, which calls for a cross-stage sparse acceleration strategy. Unfortunately, most existing sparse Transformer approaches are single-stage based, and their sparsity prediction mechanisms lead to significant power overhead when applied across multiple stages. To this end, this paper proposes a log-domain attention prediction algorithm-architecture co-design, named LAPA. First, an asymmetric leading one computing (ALOC) scheme is designed to eliminate expensive multiplications. Next, a mixed-precision multi-round shifting accumulation (MRSA) mechanism is further proposed to mitigate the accumulation overhead. A data-feature dependent filter (DDF) strategy is designed to work in concert with the MRSA process. Finally, an elaborate accelerator is designed to translate the theoretical enhancement into practical hardware improvement. Experimental results show that LAPA achieves 3.52x, 3.24x and 2.79x higher energy efficiency than the state-of-the-art (SOTA) works Spatten, Sanger and FACT, respectively.

[298] Medical Test-free Disease Detection Based on Big Data

Haokun Zhao, Yingzhe Bai, Qingyang Xu, Lixin Zhou, Jianxin Chen, Jicong Fan

Main category: cs.LG

TL;DR: CLDD is a graph-based deep learning model that detects diseases collaboratively using patient-disease associations and demographic features, reducing reliance on extensive medical testing.

DetailsMotivation: Disease detection requires extensive medical testing which is costly and impractical for screening hundreds/thousands of diseases per patient. There's a need for efficient methods that reduce diagnostic costs while maintaining accuracy.

Method: Proposes Collaborative Learning for Disease Detection (CLDD) - a graph-based deep learning model that formulates disease detection as collaborative learning. It exploits associations among diseases and similarities among patients adaptively, integrating patient-disease interactions and demographic features from EHRs.

Result: On MIMIC-IV dataset (61,191 patients, 2,000 diseases), CLDD outperforms baselines with 6.33% improvement in recall and 7.63% improvement in precision. Case studies show it successfully recovers masked diseases in top-ranked predictions.

Conclusion: CLDD enables large-scale disease screening with reduced diagnostic costs and improved accessibility, showing promise for social health security through interpretable and reliable disease prediction.

Abstract: Accurate disease detection is of paramount importance for effective medical treatment and patient care. However, the process of disease detection is often associated with extensive medical testing and considerable costs, making it impractical to perform all possible medical tests on a patient to diagnose or predict hundreds or thousands of diseases. In this work, we propose Collaborative Learning for Disease Detection (CLDD), a novel graph-based deep learning model that formulates disease detection as a collaborative learning task by exploiting associations among diseases and similarities among patients adaptively. CLDD integrates patient-disease interactions and demographic features from electronic health records to detect hundreds or thousands of diseases for every patient, with little to no reliance on the corresponding medical tests. Extensive experiments on a processed version of the MIMIC-IV dataset comprising 61,191 patients and 2,000 diseases demonstrate that CLDD consistently outperforms representative baselines across multiple metrics, achieving a 6.33% improvement in recall and 7.63% improvement in precision. Furthermore, case studies on individual patients illustrate that CLDD can successfully recover masked diseases within its top-ranked predictions, demonstrating both interpretability and reliability in disease prediction. By reducing diagnostic costs and improving accessibility, CLDD holds promise for large-scale disease screening and social health security.

[299] SA^2GFM: Enhancing Robust Graph Foundation Models with Structure-Aware Semantic Augmentation

Junhua Shi, Qingyun Sun, Haonan Yuan, Xingcheng Fu

Main category: cs.LG

TL;DR: SA^2GFM is a robust Graph Foundation Model framework that improves domain adaptation through structure-aware semantic augmentation and information bottleneck compression, outperforming 9 SOTA baselines on node/graph classification tasks.

DetailsMotivation: Current Graph Foundation Models lack robustness against domain noise, structural perturbations, and adversarial attacks, with insufficient modeling of hierarchical structural semantics crucial for generalization.

Method: 1) Encode hierarchical structural priors via entropy-based encoding trees into structure-aware textual prompts for feature augmentation; 2) Self-supervised Information Bottleneck mechanism for robust representation distillation; 3) Expert adaptive routing with mixture-of-experts and null expert design to prevent negative transfer; 4) Fine-tuning module with joint intra- and inter-community structure learning.

Result: Extensive experiments show SA^2GFM outperforms 9 state-of-the-art baselines in effectiveness and robustness against random noise and adversarial perturbations for both node and graph classification tasks.

Conclusion: SA^2GFM provides a robust GFM framework that effectively addresses domain adaptation challenges through structure-aware semantic augmentation and adaptive mechanisms, demonstrating superior performance and robustness.

Abstract: We present Graph Foundation Models (GFMs) which have made significant progress in various tasks, but their robustness against domain noise, structural perturbations, and adversarial attacks remains underexplored. A key limitation is the insufficient modeling of hierarchical structural semantics, which are crucial for generalization. In this paper, we propose SA^2GFM, a robust GFM framework that improves domain-adaptive representations through Structure-Aware Semantic Augmentation. First, we encode hierarchical structural priors by transforming entropy-based encoding trees into structure-aware textual prompts for feature augmentation. The enhanced inputs are processed by a self-supervised Information Bottleneck mechanism that distills robust, transferable representations via structure-guided compression. To address negative transfer in cross-domain adaptation, we introduce an expert adaptive routing mechanism, combining a mixture-of-experts architecture with a null expert design. For efficient downstream adaptation, we propose a fine-tuning module that optimizes hierarchical structures through joint intra- and inter-community structure learning. Extensive experiments demonstrate that SA^2GFM outperforms 9 state-of-the-art baselines in terms of effectiveness and robustness against random noise and adversarial perturbations for node and graph classification.

[300] FAIM: Frequency-Aware Interactive Mamba for Time Series Classification

Da Zhang, Bingyu Li, Zhiyuan Zhao, Yanhan Zhang, Junyu Gao, Feiping Nie, Xuelong Li

Main category: cs.LG

TL;DR: FAIM is a lightweight Frequency-Aware Interactive Mamba model for time series classification that uses Fourier Transform-based adaptive filtering and interactive Mamba blocks to achieve better accuracy-efficiency trade-off than SOTA methods.

DetailsMotivation: Deep learning models for time series classification often have high computational costs, are sensitive to noise, and overfit on small datasets. There's a need for lightweight models that can effectively capture discriminative temporal patterns while being robust to noise.

Method: 1) Adaptive Filtering Block (AFB) using Fourier Transform to extract frequency-domain features with learnable adaptive thresholds for noise suppression. 2) Interactive Mamba Block (IMB) for multi-granularity information interaction to balance fine-grained and global contextual features. 3) Self-supervised pre-training mechanism to enhance understanding of temporal patterns and robustness.

Result: FAIM consistently outperforms existing state-of-the-art methods on multiple benchmarks, achieving superior trade-off between accuracy and efficiency with outstanding performance across various domains and high-noise scenarios.

Conclusion: FAIM provides an effective solution for time series classification by combining frequency-aware adaptive filtering with interactive Mamba architecture, offering lightweight yet powerful representation learning that addresses computational cost, noise sensitivity, and overfitting issues.

Abstract: Time series classification (TSC) is crucial in numerous real-world applications, such as environmental monitoring, medical diagnosis, and posture recognition. TSC tasks require models to effectively capture discriminative information for accurate class identification. Although deep learning architectures excel at capturing temporal dependencies, they often suffer from high computational cost, sensitivity to noise perturbations, and susceptibility to overfitting on small-scale datasets. To address these challenges, we propose FAIM, a lightweight Frequency-Aware Interactive Mamba model. Specifically, we introduce an Adaptive Filtering Block (AFB) that leverages Fourier Transform to extract frequency-domain features from time series data. The AFB incorporates learnable adaptive thresholds to dynamically suppress noise and employs element-wise coupling of global and local semantic adaptive filtering, enabling in-depth modeling of the synergy among different frequency components. Furthermore, we design an Interactive Mamba Block (IMB) to facilitate efficient multi-granularity information interaction, balancing the extraction of fine-grained discriminative features and comprehensive global contextual information, thereby endowing FAIM with powerful and expressive representations for TSC tasks. Additionally, we incorporate a self-supervised pre-training mechanism to enhance FAIM’s understanding of complex temporal patterns and improve its robustness across various domains and high-noise scenarios. Extensive experiments on multiple benchmarks demonstrate that FAIM consistently outperforms existing state-of-the-art (SOTA) methods, achieving a superior trade-off between accuracy and efficiency and exhibits outstanding performance.

[301] SetAD: Semi-Supervised Anomaly Learning in Contextual Sets

Jianling Gao, Chongyang Tao, Xuelian Lin, Junfeng Liu, Shuai Ma

Main category: cs.LG

TL;DR: SetAD reframes semi-supervised anomaly detection as a set-level task using attention-based set encoding and graded learning to capture group-level anomaly patterns, outperforming point/pair-based methods.

DetailsMotivation: Existing semi-supervised AD methods focus on individual points or simple pairs, which overlooks the contextual nature of anomalies (defined by deviation from groups) and fails to exploit combinatorial supervisory signals from sets. This limits their ability to capture high-order interactions critical for learning discriminative representations.

Method: Proposes SetAD framework with: 1) Attention-based set encoder trained via graded learning objective to quantify anomalousness of entire sets, 2) Context-calibrated anomaly scoring that aggregates normalized deviations from peer behavior across multiple diverse contextual sets for robustness and calibration.

Result: Extensive experiments on 10 real-world datasets show SetAD significantly outperforms state-of-the-art models. Performance consistently improves with increasing set size, providing empirical support for set-based formulation.

Conclusion: Set-based formulation of anomaly detection effectively captures group-level interactions defining anomalies. The approach demonstrates superior performance and scalability, with set size positively correlating with improved detection accuracy.

Abstract: Semi-supervised anomaly detection (AD) has shown great promise by effectively leveraging limited labeled data. However, existing methods are typically structured around scoring individual points or simple pairs. Such {point- or pair-centric} view not only overlooks the contextual nature of anomalies, which are defined by their deviation from a collective group, but also fails to exploit the rich supervisory signals that can be generated from the combinatorial composition of sets. Consequently, such models struggle to exploit the high-order interactions within the data, which are critical for learning discriminative representations. To address these limitations, we propose SetAD, a novel framework that reframes semi-supervised AD as a Set-level Anomaly Detection task. SetAD employs an attention-based set encoder trained via a graded learning objective, where the model learns to quantify the degree of anomalousness within an entire set. This approach directly models the complex group-level interactions that define anomalies. Furthermore, to enhance robustness and score calibration, we propose a context-calibrated anomaly scoring mechanism, which assesses a point’s anomaly score by aggregating its normalized deviations from peer behavior across multiple, diverse contextual sets. Extensive experiments on 10 real-world datasets demonstrate that SetAD significantly outperforms state-of-the-art models. Notably, we show that our model’s performance consistently improves with increasing set size, providing strong empirical support for the set-based formulation of anomaly detection.

[302] Pattern Recognition of Ozone-Depleting Substance Exports in Global Trade Data

Muhammad Sukri Bin Ramli

Main category: cs.LG

TL;DR: A machine learning framework using unsupervised techniques to detect suspicious trade patterns for environmental treaty monitoring, successfully identifying thousands of outliers and high-priority shipments for customs review.

DetailsMotivation: Need for new methods to monitor environmental treaties like the Montreal Protocol by analyzing large, complex customs datasets to detect suspicious trade patterns and support regulatory oversight.

Method: Unsupervised ML framework combining: 1) K-Means clustering to discover trade archetypes based on shipment value/weight, 2) Isolation Forest and IQR for anomaly detection of “mega-trades” and unusual price-per-kg values, 3) Heuristic flagging for tactics like vague descriptions, 4) Combined into priority score, 5) Validated with SHAP for explainability.

Result: Applied to 100,000 trade records, identified 1,351 price outliers and 1,288 high-priority shipments for customs review. Found high-priority commodities have different value-to-weight ratios. Detected real-world regulatory impact correlation with US AIM Act. SHAP confirmed vague descriptions and high value as top risk predictors.

Conclusion: Presents a repeatable unsupervised learning pipeline to transform raw trade data into prioritized intelligence for regulatory groups, demonstrating effectiveness in treaty monitoring through systematic detection of suspicious trade patterns.

Abstract: New methods are needed to monitor environmental treaties, like the Montreal Protocol, by reviewing large, complex customs datasets. This paper introduces a framework using unsupervised machine learning to systematically detect suspicious trade patterns and highlight activities for review. Our methodology, applied to 100,000 trade records, combines several ML techniques. Unsupervised Clustering (K-Means) discovers natural trade archetypes based on shipment value and weight. Anomaly Detection (Isolation Forest and IQR) identifies rare “mega-trades” and shipments with commercially unusual price-per-kilogram values. This is supplemented by Heuristic Flagging to find tactics like vague shipment descriptions. These layers are combined into a priority score, which successfully identified 1,351 price outliers and 1,288 high-priority shipments for customs review. A key finding is that high-priority commodities show a different and more valuable value-to-weight ratio than general goods. This was validated using Explainable AI (SHAP), which confirmed vague descriptions and high value as the most significant risk predictors. The model’s sensitivity was validated by its detection of a massive spike in “mega-trades” in early 2021, correlating directly with the real-world regulatory impact of the US AIM Act. This work presents a repeatable unsupervised learning pipeline to turn raw trade data into prioritized, usable intelligence for regulatory groups.

[303] Using Text-Based Life Trajectories from Swedish Register Data to Predict Residential Mobility with Pretrained Transformers

Philipp Stark, Alexandros Sopasakis, Ola Hall, Markus Grillitsch

Main category: cs.LG

TL;DR: Researchers convert Swedish register data for 6.9M individuals into textual life trajectories to address categorical variable challenges and predict residential mobility using NLP models.

DetailsMotivation: To address two long-standing challenges in data analysis: high cardinality of categorical variables and inconsistencies in coding schemes over time, using comprehensive population register data.

Method: Transform Swedish register data (2001-2013) for 6.9M individuals into textual life trajectories combining demographic info with annual changes in residence, work, education, income, and family. Compare multiple NLP architectures (LSTM, DistilBERT, BERT, Qwen) to predict residential mobility (2013-2017).

Result: Sequential and transformer-based models capture temporal and semantic structure more effectively than baseline models. Textualized register data preserves meaningful information about individual pathways and supports complex, scalable modeling.

Conclusion: Combining semantically rich register data with modern language models can substantially advance longitudinal analysis in social sciences, providing a rigorous testbed for developing new sequence-modeling approaches.

Abstract: We transform large-scale Swedish register data into textual life trajectories to address two long-standing challenges in data analysis: high cardinality of categorical variables and inconsistencies in coding schemes over time. Leveraging this uniquely comprehensive population register, we convert register data from 6.9 million individuals (2001-2013) into semantically rich texts and predict individuals’ residential mobility in later years (2013-2017). These life trajectories combine demographic information with annual changes in residence, work, education, income, and family circumstances, allowing us to assess how effectively such sequences support longitudinal prediction. We compare multiple NLP architectures (including LSTM, DistilBERT, BERT, and Qwen) and find that sequential and transformer-based models capture temporal and semantic structure more effectively than baseline models. The results show that textualized register data preserves meaningful information about individual pathways and supports complex, scalable modeling. Because few countries maintain longitudinal microdata with comparable coverage and precision, this dataset enables analyses and methodological tests that would be difficult or impossible elsewhere, offering a rigorous testbed for developing and evaluating new sequence-modeling approaches. Overall, our findings demonstrate that combining semantically rich register data with modern language models can substantially advance longitudinal analysis in social sciences.

[304] Command & Control (C2) Traffic Detection Via Algorithm Generated Domain (Dga) Classification Using Deep Learning And Natural Language Processing

Maria Milena Araujo Felix

Main category: cs.LG

TL;DR: Deep Learning and NLP approach detects DGA domains with 97.2% accuracy, outperforming traditional entropy analysis for complex malware patterns.

DetailsMotivation: Modern malware uses Domain Generation Algorithms (DGA) to create thousands of dynamic addresses daily, making static blacklist-based defenses obsolete and bypassing traditional firewalls.

Method: Collected hybrid database of 50,000 legitimate and 50,000 malicious domains, extracted lexical features, and trained a Recurrent Neural Network (LSTM) using NLP techniques.

Result: Achieved 97.2% accuracy in detecting DGA domains, with neural network approach showing superiority over statistical entropy analysis for complex patterns and reducing false positives in ambiguous lawful traffic scenarios.

Conclusion: Deep Learning with NLP provides effective detection of sophisticated DGA domains, offering a robust solution against modern malware communication techniques that evade traditional security measures.

Abstract: The sophistication of modern malware, specifically regarding communication with Command and Control (C2) servers, has rendered static blacklist-based defenses obsolete. The use of Domain Generation Algorithms (DGA) allows attackers to generate thousands of dynamic addresses daily, hindering blocking by traditional firewalls. This paper aims to propose and evaluate a method for detecting DGA domains using Deep Learning and Natural Language Processing (NLP) techniques. The methodology consisted of collecting a hybrid database containing 50,000 legitimate and 50,000 malicious domains, followed by the extraction of lexical features and the training of a Recurrent Neural Network (LSTM). Results demonstrated that while statistical entropy analysis is effective for simple DGAs, the Neural Network approach presents superiority in detecting complex patterns, reaching 97.2% accuracy and reducing the false positive rate in ambiguous lawful traffic scenarios.

[305] Bayesian Optimization for Function-Valued Responses under Min-Max Criteria

Pouya Ahadi, Reza Marzban, Ali Adibi, Kamran Paynabar

Main category: cs.LG

TL;DR: MM-FBO is a Bayesian optimization framework for functional responses that minimizes maximum error across the functional domain using functional PCA and Gaussian process surrogates.

DetailsMotivation: Most Bayesian optimization methods focus on scalar responses, but many scientific/engineering problems have functional responses varying smoothly over time/wavelength. Existing methods minimize integrated error (average performance) but neglect worst-case deviations.

Method: Functional responses represented using functional principal component analysis (FPCA). Gaussian process surrogates constructed for principal component scores. Introduces integrated uncertainty acquisition function balancing exploitation of worst-case expected error with exploration across functional domain.

Result: Provides two theoretical guarantees: discretization bound for worst-case objective, and consistency result showing acquisition converges to true min-max objective as surrogate becomes accurate. Outperforms existing baselines on synthetic benchmarks and physics-inspired case studies (electromagnetic scattering, vapor phase infiltration).

Conclusion: MM-FBO effectively addresses limitations of existing functional Bayesian optimization methods by explicitly minimizing maximum error across functional domain. Highlights importance of modeling functional uncertainty in Bayesian optimization.

Abstract: Bayesian optimization is widely used for optimizing expensive black box functions, but most existing approaches focus on scalar responses. In many scientific and engineering settings the response is functional, varying smoothly over an index such as time or wavelength, which makes classical formulations inadequate. Existing methods often minimize integrated error, which captures average performance but neglects worst case deviations. To address this limitation we propose min-max Functional Bayesian Optimization (MM-FBO), a framework that directly minimizes the maximum error across the functional domain. Functional responses are represented using functional principal component analysis, and Gaussian process surrogates are constructed for the principal component scores. Building on this representation, MM-FBO introduces an integrated uncertainty acquisition function that balances exploitation of worst case expected error with exploration across the functional domain. We provide two theoretical guarantees: a discretization bound for the worst case objective, and a consistency result showing that as the surrogate becomes accurate and uncertainty vanishes, the acquisition converges to the true min-max objective. We validate the method through experiments on synthetic benchmarks and physics inspired case studies involving electromagnetic scattering by metaphotonic devices and vapor phase infiltration. Results show that MM-FBO consistently outperforms existing baselines and highlights the importance of explicitly modeling functional uncertainty in Bayesian optimization.

[306] Advancing physiological time series reconstruction and imputation via mixture of receptive fields and experts fusion

Ci Zhang, Huayu Li, Changdi Yang, Jiangnan Xia, Yanzhi Wang, Xiaolong Ma, Jin Lu, Geng Yuan

Main category: cs.LG

TL;DR: A novel Mixture of Experts (MoE)-based diffusion framework for medical time series reconstruction that uses RFAMoE for adaptive receptive fields and Fusion MoE for parallel noise generation, achieving SOTA performance with single-inference efficiency.

DetailsMotivation: Medical time series signals (physiological data) have unique challenges: multivariate, high temporal variability, noisy, and artifact-prone. Existing diffusion models for time series reconstruction are largely unexplored in medical domains, and current approaches requiring multiple inferences for error reduction incur high computational/latency costs.

Method: Proposes a score-based diffusion framework with two key innovations: 1) Receptive Field Adaptive MoE (RFAMoE) module that enables each channel to adaptively select desired receptive fields throughout diffusion, and 2) Fusion MoE module that generates K noise signals in parallel, fuses them via routing mechanism, and completes reconstruction in a single inference step.

Result: The framework consistently outperforms diffusion-based state-of-the-art methods on different tasks and datasets. It improves performance while eliminating the substantial computational cost and latency associated with multiple inference processes required by previous methods.

Conclusion: The proposed MoE-based diffusion framework effectively addresses the unique challenges of medical time series reconstruction, achieving superior performance with efficient single-inference operation, making it practical for real-world medical applications.

Abstract: Recent studies show that using diffusion models for time series signal reconstruc- tion holds great promise. However, such approaches remain largely unexplored in the domain of medical time series. The unique characteristics of the physiological time series signals, such as multivariate, high temporal variability, highly noisy, and artifact-prone, make deep learning-based approaches still challenging for tasks such as imputation. Hence, we propose a novel Mixture of Experts (MoE)-based noise estimator within a score-based diffusion framework. Specifically, the Receptive Field Adaptive MoE (RFAMoE) module is designed to enable each channel to adap- tively select desired receptive fields throughout the diffusion process. Moreover, recent literature has found that when generating a physiological signal, performing multiple inferences and averaging the reconstructed signals can effectively reduce reconstruction errors, but at the cost of significant computational and latency over- head. We design a Fusion MoE module and innovatively leverage the nature of MoE module to generate K noise signals in parallel, fuse them using a routing mechanism, and complete signal reconstruction in a single inference step. This design not only improves performance over previous methods but also eliminates the substantial computational cost and latency associated with multiple inference processes. Extensive results demonstrate that our proposed framework consistently outperforms diffusion-based SOTA works on different tasks and datasets.

[307] Balanced Accuracy: The Right Metric for Evaluating LLM Judges – Explained through Youden’s J statistic

Stephane Collot, Colin Fraser, Justin Zhao, William F. Shen, Timon Willi, Ilias Leontiadis

Main category: cs.LG

TL;DR: The paper proposes using Youden’s J statistic and Balanced Accuracy instead of traditional metrics like Accuracy, Precision, and F1 for selecting classifiers in LLM evaluation, as these traditional metrics are sensitive to class imbalance and can distort prevalence estimates.

DetailsMotivation: Current metrics for selecting classifiers in LLM evaluation (Accuracy, Precision, F1) are problematic because they're sensitive to class imbalance and arbitrary positive class choices, which can lead to distorted prevalence estimates when comparing model behaviors.

Method: The paper shows that Youden’s J statistic is theoretically aligned with choosing the best judge for model comparison, and demonstrates that Balanced Accuracy is an equivalent linear transformation of J. The approach is supported through analytical arguments, empirical examples, and simulations.

Result: Selecting judges using Balanced Accuracy leads to better, more robust classifier selection for LLM evaluation, providing more trustworthy prevalence estimates for comparing model behaviors.

Conclusion: Balanced Accuracy (and equivalently Youden’s J) should be preferred over traditional metrics for selecting classifiers in LLM evaluation to ensure more reliable and robust comparison of model behaviors.

Abstract: Rigorous evaluation of large language models (LLMs) relies on comparing models by the prevalence of desirable or undesirable behaviors, such as task pass rates or policy violations. These prevalence estimates are produced by a classifier, either an LLM-as-a-judge or human annotators, making the choice of classifier central to trustworthy evaluation. Common metrics used for this choice, such as Accuracy, Precision, and F1, are sensitive to class imbalance and to arbitrary choices of positive class, and can favor judges that distort prevalence estimates. We show that Youden’s $J$ statistic is theoretically aligned with choosing the best judge to compare models, and that Balanced Accuracy is an equivalent linear transformation of $J$. Through both analytical arguments and empirical examples and simulations, we demonstrate how selecting judges using Balanced Accuracy leads to better, more robust classifier selection.

[308] Controllable risk scenario generation from human crash data for autonomous vehicle testing

Qiujing Lu, Xuanhan Wang, Runze Yuan, Wei Lu, Xinyi Gong, Shuo Feng

Main category: cs.LG

TL;DR: CRAG is a framework for generating realistic background agents that can transition between normal and risk-prone behaviors to test autonomous vehicle safety.

DetailsMotivation: Testing AV safety requires simulating both everyday driving and rare safety-critical conditions, but current methods struggle to generate realistic risk-prone behaviors while maintaining nominal traffic realism.

Method: CRAG uses a structured latent space that disentangles normal and risk-related behaviors, combining risk-aware latent representations with optimization-based mode-transition mechanisms for smooth transitions between safe and risk states.

Result: CRAG improves diversity compared to existing baselines and enables controllable generation of risk scenarios for targeted AV evaluation.

Conclusion: The framework provides an effective approach for generating realistic risk scenarios to efficiently evaluate AV robustness in safety-critical conditions.

Abstract: Ensuring the safety of autonomous vehicles (AV) requires rigorous testing under both everyday driving and rare, safety-critical conditions. A key challenge lies in simulating environment agents, including background vehicles (BVs) and vulnerable road users (VRUs), that behave realistically in nominal traffic while also exhibiting risk-prone behaviors consistent with real-world accidents. We introduce Controllable Risk Agent Generation (CRAG), a framework designed to unify the modeling of dominant nominal behaviors and rare safety-critical behaviors. CRAG constructs a structured latent space that disentangles normal and risk-related behaviors, enabling efficient use of limited crash data. By combining risk-aware latent representations with optimization-based mode-transition mechanisms, the framework allows agents to shift smoothly and plausibly from safe to risk states over extended horizons, while maintaining high fidelity in both regimes. Extensive experiments show that CRAG improves diversity compared to existing baselines, while also enabling controllable generation of risk scenarios for targeted and efficient evaluation of AV robustness.

[309] Softly Symbolifying Kolmogorov-Arnold Networks

James Bagrow, Josh Bongard

Main category: cs.LG

TL;DR: S2KAN integrates symbolic primitives into KAN training with learnable gates and MDL objective, enabling interpretable symbolic forms when possible while falling back to dense splines when needed.

DetailsMotivation: Standard KANs lack symbolic fidelity despite their theoretical interpretability potential - they often learn pathological decompositions without meaningful correspondence to interpretable forms.

Method: S2KAN integrates symbolic primitives directly into training with learnable gates that sparsify representations. Uses differentiable sparsification guided by Minimum Description Length objective, allowing graceful degradation to dense splines when symbolic terms are insufficient.

Result: Achieves competitive or superior accuracy with substantially smaller models across symbolic benchmarks, dynamical systems forecasting, and real-world prediction tasks. Shows evidence of emergent self-sparsification even without regularization pressure.

Conclusion: S2KAN successfully bridges symbolic and neural approaches, enabling interpretable symbolic forms when appropriate while maintaining flexibility through dense splines when needed, making KANs more practically interpretable.

Abstract: Kolmogorov-Arnold Networks (KANs) offer a promising path toward interpretable machine learning: their learnable activations can be studied individually, while collectively fitting complex data accurately. In practice, however, trained activations often lack symbolic fidelity, learning pathological decompositions with no meaningful correspondence to interpretable forms. We propose Softly Symbolified Kolmogorov-Arnold Networks (S2KAN), which integrate symbolic primitives directly into training. Each activation draws from a dictionary of symbolic and dense terms, with learnable gates that sparsify the representation. Crucially, this sparsification is differentiable, enabling end-to-end optimization, and is guided by a principled Minimum Description Length objective. When symbolic terms suffice, S2KAN discovers interpretable forms; when they do not, it gracefully degrades to dense splines. We demonstrate competitive or superior accuracy with substantially smaller models across symbolic benchmarks, dynamical systems forecasting, and real-world prediction tasks, and observe evidence of emergent self-sparsification even without regularization pressure.

[310] Fourier-Enhanced Recurrent Neural Networks for Electrical Load Time Series Downscaling

Qi Chen, Mihai Anitescu

Main category: cs.LG

TL;DR: Fourier-enhanced RNN with attention for electrical load downscaling outperforms Prophet and RNN ablations across multiple territories.

DetailsMotivation: Need for accurate electrical load downscaling from low-resolution to high-resolution forecasts, improving upon existing methods like Prophet and standard RNNs.

Method: Combines recurrent backbone with low-resolution inputs, explicit Fourier seasonal embeddings fused in latent space, and self-attention layer for intra-period dependencies.

Result: Achieves lower and flatter RMSE across four PJM territories compared to Prophet baselines (with/without seasonality/LAA) and RNN ablations without attention or Fourier features.

Conclusion: The Fourier-enhanced RNN with attention effectively improves electrical load downscaling accuracy by incorporating seasonal patterns and intra-period dependencies.

Abstract: We present a Fourier-enhanced recurrent neural network (RNN) for downscaling electrical loads. The model combines (i) a recurrent backbone driven by low-resolution inputs, (ii) explicit Fourier seasonal embeddings fused in latent space, and (iii) a self-attention layer that captures dependencies among high-resolution components within each period. Across four PJM territories, the approach yields RMSE lower and flatter horizon-wise than classical Prophet baselines (with and without seasonality/LAA) and than RNN ablations without attention or Fourier features.

[311] Artificial Intelligence-Driven Network-on-Chip Design Space Exploration: Neural Network Architectures for Design

Amogh Anshu N, Harish BP

Main category: cs.LG

TL;DR: ML-driven framework automates NoC design space exploration using BookSim simulations and reverse neural networks, with Conditional Diffusion Model achieving best accuracy and significant time reduction.

DetailsMotivation: Traditional NoC design space exploration techniques are slow and struggle with complex, non-linear parameter interactions in high-dimensional configuration spaces needed to meet throughput and latency requirements.

Method: Machine learning framework using BookSim simulations and reverse neural network models, comparing MLP, Conditional Diffusion Model, and CVAE to predict optimal NoC parameters from target performance metrics, generating over 150,000 simulation data points across varied mesh topologies.

Result: Conditional Diffusion Model achieved highest predictive accuracy with MSE of 0.463 on unseen data, and the framework reduces design exploration time by several orders of magnitude.

Conclusion: The proposed framework provides a practical solution for rapid and scalable NoC co-design, significantly accelerating the design exploration process while maintaining accuracy.

Abstract: Network-on-Chip (NoC) design requires exploring a high-dimensional configuration space to satisfy stringent throughput requirements and latency constraints.Traditional design space exploration techniques are often slow and struggle to handle complex, non-linear parameter interactions.This work presents a machine learning-driven framework that automates NoC design space exploration using BookSim simulations and reverse neural network models.Specifically, we compare three architectures - a Multi-Layer Perceptron (MLP),a Conditional Diffusion Model, and a Conditional Variational Autoencoder (CVAE) to predict optimal NoC parameters given target performance metrics.Our pipeline generates over 150,000 simulation data points across varied mesh topologies.The Conditional Diffusion Model achieved the highest predictive accuracy, attaining a mean squared error (MSE) of 0.463 on unseen data.Furthermore, the proposed framework reduces design exploration time by several orders of magnitude, making it a practical solution for rapid and scalable NoC co-design.

[312] Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training

Jakub Krajewski, Amitis Shidani, Dan Busbridge, Sam Wiseman, Jason Ramapuram

Main category: cs.LG

TL;DR: A framework for directly modeling downstream task performance scaling from training budget, showing power law relationships work better than traditional proxy metrics.

DetailsMotivation: Traditional scaling laws focus on proxy metrics like pretraining loss, but predicting downstream task performance has been considered unreliable. This paper challenges that view.

Method: Proposes a direct framework to model benchmark performance scaling from training budget, using simple power laws for fixed token-to-parameter ratios, and introduces functional forms that predict accuracy across different ratios and account for inference compute.

Result: The direct approach extrapolates better than previously proposed two-stage procedures, and the framework is validated on models up to 17B parameters trained on up to 350B tokens across two dataset mixtures.

Conclusion: Direct modeling of downstream performance scaling from training budget is reliable and outperforms traditional proxy-based approaches, with released data supporting reproducibility and future research.

Abstract: While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of benchmark performance from the training budget. We find that for a fixed token-to-parameter ratio, a simple power law can accurately describe the scaling behavior of log accuracy on multiple popular downstream tasks. Our results show that the direct approach extrapolates better than the previously proposed two-stage procedure, which is prone to compounding errors. Furthermore, we introduce functional forms that predict accuracy across token-to-parameter ratios and account for inference compute under repeated sampling. We validate our findings on models with up to 17B parameters trained on up to 350B tokens across two dataset mixtures. To support reproducibility and encourage future research, we release the complete set of pretraining losses and downstream evaluation results.

[313] Graph Contrastive Learning via Spectral Graph Alignment

Manh Nguyen, Joshua Cape

Main category: cs.LG

TL;DR: SpecMatch-CL is a novel contrastive learning loss that aligns graph embeddings by minimizing differences between normalized Laplacians of view-specific graph-of-graphs, achieving SOTA on multiple benchmarks.

DetailsMotivation: Existing contrastive learning methods for graphs (like InfoNCE) align pairwise embeddings across augmented views but lack control over the global structure of the view-specific graph-of-graphs built from these embeddings.

Method: SpecMatch-CL introduces a loss function that aligns view-specific graph-of-graphs by minimizing the difference between their normalized Laplacians, providing theoretical guarantees on alignment quality.

Result: Achieves new state-of-the-art on eight TU benchmarks in unsupervised and semi-supervised learning at low label rates, and shows consistent gains in transfer learning on PPI-306K and ZINC 2M datasets.

Conclusion: The normalized Laplacian alignment approach in SpecMatch-CL effectively controls global structure in graph contrastive learning, leading to superior performance across diverse graph learning tasks.

Abstract: Given augmented views of each input graph, contrastive learning methods (e.g., InfoNCE) optimize pairwise alignment of graph embeddings across views while providing no mechanism to control the global structure of the view specific graph-of-graphs built from these embeddings. We introduce SpecMatch-CL, a novel loss function that aligns the view specific graph-of-graphs by minimizing the difference between their normalized Laplacians. Theoretically, we show that under certain assumptions, the difference between normalized Laplacians provides an upper bound not only for the difference between the ideal Perfect Alignment contrastive loss and the current loss, but also for the Uniformly loss. Empirically, SpecMatch-CL establishes new state of the art on eight TU benchmarks under unsupervised learning and semi-supervised learning at low label rates, and yields consistent gains in transfer learning on PPI-306K and ZINC 2M datasets.

[314] Nonnegative Matrix Factorization through Cone Collapse

Manh Nguyen, Daniel Pimentel-Alarcón

Main category: cs.LG

TL;DR: Cone Collapse algorithm for NMF-based clustering that explicitly recovers the data cone geometry, outperforming existing methods on 16 benchmark datasets.

DetailsMotivation: Existing NMF/ONMF algorithms don't exploit the conic geometry of data where points lie in a convex cone whose extreme rays encode fundamental topics/directions.

Method: Proposes Cone Collapse algorithm that starts from full nonnegative orthant and iteratively shrinks it toward minimal cone generated by data, then applies uni-orthogonal NMF to recovered extreme rays (CC-NMF).

Result: Under mild assumptions, Cone Collapse terminates finitely and recovers minimal generating cone. CC-NMF consistently matches or outperforms strong NMF baselines across 16 gene-expression, text, and image datasets in clustering purity.

Conclusion: Explicitly recovering the data cone yields both theoretically grounded and empirically strong NMF-based clustering methods.

Abstract: Nonnegative matrix factorization (NMF) is a widely used tool for learning parts-based, low-dimensional representations of nonnegative data, with applications in vision, text, and bioinformatics. In clustering applications, orthogonal NMF (ONMF) variants further impose (approximate) orthogonality on the representation matrix so that its rows behave like soft cluster indicators. Existing algorithms, however, are typically derived from optimization viewpoints and do not explicitly exploit the conic geometry induced by NMF: data points lie in a convex cone whose extreme rays encode fundamental directions or “topics”. In this work we revisit NMF from this geometric perspective and propose Cone Collapse, an algorithm that starts from the full nonnegative orthant and iteratively shrinks it toward the minimal cone generated by the data. We prove that, under mild assumptions on the data, Cone Collapse terminates in finitely many steps and recovers the minimal generating cone of $\mathbf{X}^\top$ . Building on this basis, we then derive a cone-aware orthogonal NMF model (CC-NMF) by applying uni-orthogonal NMF to the recovered extreme rays. Across 16 benchmark gene-expression, text, and image datasets, CC-NMF consistently matches or outperforms strong NMF baselines-including multiplicative updates, ANLS, projective NMF, ONMF, and sparse NMF-in terms of clustering purity. These results demonstrate that explicitly recovering the data cone can yield both theoretically grounded and empirically strong NMF-based clustering methods.

[315] Semi-Supervised Contrastive Learning with Orthonormal Prototypes

Huanran Li, Manh Nguyen, Daniel Pimentel-Alarcón

Main category: cs.LG

TL;DR: CLOP is a novel semi-supervised contrastive loss that prevents dimensional collapse by promoting orthogonal subspaces for class embeddings, improving performance and stability across learning rates and batch sizes.

DetailsMotivation: Contrastive learning suffers from dimensional collapse where embeddings converge into lower-dimensional spaces, especially in semi-supervised and self-supervised setups. The authors identify that standard contrastive losses collapse beyond a critical learning-rate threshold.

Method: The paper proposes CLOP, a novel semi-supervised loss function designed to prevent dimensional collapse by promoting the formation of orthogonal linear subspaces among class embeddings. This builds on insights about learning-rate thresholds that cause collapse in standard contrastive losses.

Result: Extensive experiments on real and synthetic datasets show that CLOP improves performance in image classification and object detection tasks while exhibiting greater stability across different learning rates and batch sizes compared to standard approaches.

Conclusion: CLOP effectively addresses the dimensional collapse problem in contrastive learning by enforcing orthogonal subspace structure in embeddings, leading to more robust and stable representations that perform better across various vision tasks.

Abstract: Contrastive learning has emerged as a powerful method in deep learning, excelling at learning effective representations through contrasting samples from different distributions. However, dimensional collapse, where embeddings converge into a lower-dimensional space, poses a significant challenge, especially in semi-supervised and self-supervised setups. In this paper, we first identify a critical learning-rate threshold, beyond which standard contrastive losses converge to collapsed solutions. Building on these insights, we propose CLOP, a novel semi-supervised loss function designed to prevent dimensional collapse by promoting the formation of orthogonal linear subspaces among class embeddings. Through extensive experiments on real and synthetic datasets, we demonstrate that CLOP improves performance in image classification and object detection tasks while also exhibiting greater stability across different learning rates and batch sizes.

[316] GSPN-2: Efficient Parallel Sequence Modeling

Hongjun Wang, Yitong Jiang, Collin McCarthy, David Wehr, Hanrong Ye, Xinhao Li, Ka Chun Cheung, Wonmin Byeon, Jinwei Gu, Ke Chen, Kai Han, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

Main category: cs.LG

TL;DR: GSPN-2 is an optimized version of Generalized Spatial Propagation Network that addresses GPU inefficiencies in the original GSPN implementation while introducing a compact channel propagation strategy, achieving transformer-level accuracy with significantly lower computational cost.

DetailsMotivation: The paper addresses the computational inefficiency of vision transformers for high-resolution images and long videos. While GSPN improved efficiency by replacing quadratic self-attention with linear-cost propagation, its implementation suffered from GPU overhead, excessive memory transfers, and redundant computations.

Method: GSPN-2 combines algorithm-system redesign: (1) System optimization: consolidates thousands of micro-kernel launches into a single 2D kernel, pins warps to channel slices, and stages activations in shared memory. (2) Algorithm optimization: introduces compact channel propagation strategy replacing per-channel matrices, reducing parameters and aligning with transformer attention affinity maps.

Result: Experiments show GSPN-2 maintains transformer-level accuracy across image classification and text-to-image synthesis tasks while achieving significantly lower computational cost, establishing a new efficiency frontier for global spatial context modeling.

Conclusion: GSPN-2 successfully addresses the GPU inefficiencies of the original GSPN through joint algorithm-system optimization, enabling efficient vision transformer alternatives for high-resolution and long-video applications while maintaining accuracy.

Abstract: Efficient vision transformer remains a bottleneck for high-resolution images and long-video related real-world applications. Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme, bringing the cost close to linear in the number of rows or columns, while retaining accuracy. Despite this advancement, the existing GSPN implementation still suffers from (i) heavy overhead due to repeatedly launching GPU kernels, (ii) excessive data transfers from global GPU memory, and (iii) redundant computations caused by maintaining separate propagation weights for each channel. We introduce GSPN-2, a joint algorithm-system redesign. In particular, we eliminate thousands of micro-launches from the previous implementation into one single 2D kernel, explicitly pin one warp to each channel slice, and stage the previous column’s activations in shared memory. On the model side, we introduce a compact channel propagation strategy that replaces per-channel matrices, trimming parameters, and align naturally with the affinity map used in transformer attention. Experiments demonstrate GSPN-2’s effectiveness across image classification and text-to-image synthesis tasks, matching transformer-level accuracy with significantly lower computational cost. GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications through its unique combination of structured matrix transformations and GPU-optimized implementation. Project page: https://whj363636.github.io/GSPN2/

[317] ByteStorm: a multi-step data-driven approach for Tropical Cyclones detection and tracking

Davide Donno, Donatello Elia, Gabriele Accarino, Marco De Carlo, Enrico Scoccimarro, Silvio Gualdi

Main category: cs.LG

TL;DR: ByteStorm is a deep learning framework for tropical cyclone tracking that uses vorticity and pressure data with BYTE algorithm for track linking, outperforming traditional threshold-based methods.

DetailsMotivation: Traditional TC tracking methods rely on subjective thresholds that introduce biases and lack adaptability to different geographical regions. There's a need for more accurate, data-driven approaches that don't require manual threshold tuning.

Method: ByteStorm combines deep learning networks for TC center detection (classification and localization) using only 850 mb relative vorticity and mean sea-level pressure data, then links detected centers into tracks using the BYTE algorithm.

Result: ByteStorm achieves superior performance: Probability of Detection (85.05% ENP, 79.48% WNP), False Alarm Rate (23.26% ENP, 16.14% WNP), and high Inter-Annual Variability correlations (0.75 ENP, 0.69 WNP).

Conclusion: The integration of deep learning and computer vision offers a fast, accurate, and robust alternative to traditional threshold-based TC tracking methods, with strong performance across different Pacific basins.

Abstract: Accurate tropical cyclones (TCs) tracking represents a critical challenge in the context of weather and climate science. Traditional tracking schemes mainly rely on subjective thresholds, which may introduce biases in their skills on the geographical region of application. We present ByteStorm, an efficient data-driven framework for reconstructing TC tracks without threshold tuning. It leverages deep learning networks to detect TC centers (via classification and localization), using only relative vorticity (850 mb) and mean sea-level pressure. Then, detected centers are linked into TC tracks through the BYTE algorithm. ByteStorm is evaluated against state-of-the-art deterministic trackers in the East- and West-North Pacific basins (ENP and WNP). The proposed framework achieves superior performance in terms of Probability of Detection ($85.05%$ ENP, $79.48%$ WNP), False Alarm Rate ($23.26%$ ENP, $16.14%$ WNP), and high Inter-Annual Variability correlations ($0.75$ ENP and $0.69$ WNP). These results highlight the potential of integrating deep learning and computer vision for fast and accurate TC tracking, offering a robust alternative to traditional approaches.

[318] Towards symbolic regression for interpretable clinical decision scores

Guilherme Seidyo Imai Aldeia, Joseph D. Romano, Fabricio Olivetti de Franca, Daniel S. Herman, William G. La Cava

Main category: cs.LG

TL;DR: Brush is a symbolic regression algorithm that combines decision-tree splitting with non-linear optimization to create interpretable models with rule-based logic for medical decision-making.

DetailsMotivation: Traditional symbolic regression can't effectively model medical decision-making that combines risk equations with rules. There's a need for data-driven, interpretable clinical risk scores that can incorporate rule-based logic.

Method: Brush combines decision-tree-like splitting algorithms with non-linear constant optimization, enabling seamless integration of rule-based logic into symbolic regression and classification models.

Result: Brush achieves Pareto-optimal performance on SRBench, accurately recapitulates two widely used clinical scoring systems with high accuracy and interpretability, and outperforms decision trees, random forests, and other SR methods while producing simpler models.

Conclusion: Brush successfully bridges the gap between symbolic regression and rule-based medical decision-making, offering a promising approach for developing data-driven, interpretable clinical risk scores that maintain both predictive performance and simplicity.

Abstract: Medical decision-making makes frequent use of algorithms that combine risk equations with rules, providing clear and standardized treatment pathways. Symbolic regression (SR) traditionally limits its search space to continuous function forms and their parameters, making it difficult to model this decision-making. However, due to its ability to derive data-driven, interpretable models, SR holds promise for developing data-driven clinical risk scores. To that end we introduce Brush, an SR algorithm that combines decision-tree-like splitting algorithms with non-linear constant optimization, allowing for seamless integration of rule-based logic into symbolic regression and classification models. Brush achieves Pareto-optimal performance on SRBench, and was applied to recapitulate two widely used clinical scoring systems, achieving high accuracy and interpretable models. Compared to decision trees, random forests, and other SR methods, Brush achieves comparable or superior predictive performance while producing simpler models.

[319] CIP-Net: Continual Interpretable Prototype-based Network

Federico Di Valerio, Michela Proietti, Alessio Ragno, Roberto Capobianco

Main category: cs.LG

TL;DR: CIP-Net is an exemplar-free self-explainable prototype-based model for continual learning that achieves SOTA performance without storing past examples, offering both strong performance and interpretability with low memory overhead.

DetailsMotivation: Continual learning faces catastrophic forgetting when learning new tasks, and while explainable AI has been proposed to help understand and reduce forgetting, most existing approaches use post-hoc explanations or require additional memory per task, limiting scalability.

Method: CIP-Net is an exemplar-free self-explainable prototype-based model that avoids storing past examples, maintains a simple architecture, and provides explanations during prediction to help preserve knowledge across tasks.

Result: CIP-Net achieves state-of-the-art performance compared to previous exemplar-free and self-explainable methods in both task- and class-incremental settings, while bearing significantly lower memory-related overhead.

Conclusion: CIP-Net provides a practical and interpretable solution for continual learning that combines strong performance with useful explanations and low memory requirements, making it scalable for real-world applications.

Abstract: Continual learning constrains models to learn new tasks over time without forgetting what they have already learned. A key challenge in this setting is catastrophic forgetting, where learning new information causes the model to lose its performance on previous tasks. Recently, explainable AI has been proposed as a promising way to better understand and reduce forgetting. In particular, self-explainable models are useful because they generate explanations during prediction, which can help preserve knowledge. However, most existing explainable approaches use post-hoc explanations or require additional memory for each new task, resulting in limited scalability. In this work, we introduce CIP-Net, an exemplar-free self-explainable prototype-based model designed for continual learning. CIP-Net avoids storing past examples and maintains a simple architecture, while still providing useful explanations and strong performance. We demonstrate that CIPNet achieves state-of-the-art performances compared to previous exemplar-free and self-explainable methods in both task- and class-incremental settings, while bearing significantly lower memory-related overhead. This makes it a practical and interpretable solution for continual learning.

[320] HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability

Sudhanva Manjunath Athreya, Paul Rosen

Main category: cs.LG

TL;DR: HOLE uses persistent homology to analyze neural network representations through topological features, visualized via various diagrams to understand representation structure, quality, and robustness.

DetailsMotivation: Deep learning models are highly successful but their learned representations and decision processes remain opaque and hard to interpret, creating a need for better interpretability methods.

Method: HOLE (Homological Observation of Latent Embeddings) applies persistent homology to extract topological features from neural activations, using visualization techniques including Sankey diagrams, heatmaps, dendrograms, and blob graphs.

Result: Topological analysis reveals patterns associated with class separation, feature disentanglement, and model robustness, providing insights into representation quality across layers and robustness to input perturbations and model compression.

Conclusion: HOLE offers a complementary topological perspective for understanding and improving deep learning systems by making neural network representations more interpretable through persistent homology analysis.

Abstract: Deep learning models have achieved remarkable success across various domains, yet their learned representations and decision-making processes remain largely opaque and hard to interpret. This work introduces HOLE (Homological Observation of Latent Embeddings), a method for analyzing and interpreting deep neural networks through persistent homology. HOLE extracts topological features from neural activations and presents them using a suite of visualization techniques, including Sankey diagrams, heatmaps, dendrograms, and blob graphs. These tools facilitate the examination of representation structure and quality across layers. We evaluate HOLE on standard datasets using a range of discriminative models, focusing on representation quality, interpretability across layers, and robustness to input perturbations and model compression. The results indicate that topological analysis reveals patterns associated with class separation, feature disentanglement, and model robustness, providing a complementary perspective for understanding and improving deep learning systems.

[321] Bridging the Clinical Expertise Gap: Development of a Web-Based Platform for Accessible Time Series Forecasting and Analysis

Aaron D. Mullen, Daniel R. Harris, Svetla Slavova, V. K. Cody Bumgardner

Main category: cs.LG

TL;DR: A web platform that simplifies time series forecasting for healthcare by providing accessible tools for data analysis, model training, and result interpretation with LLM-powered recommendations.

DetailsMotivation: Time series forecasting has valuable applications in healthcare, but technical expertise requirements create barriers for researchers and clinicians who need to use these techniques.

Method: Developed a web platform that allows users to upload data, generate plots, train customizable forecasting models, and receive LLM-powered recommendations for parameter selection and result interpretation.

Result: Created an accessible platform that makes time series forecasting techniques available to non-experts through intuitive interfaces and AI-assisted guidance.

Conclusion: The platform aims to integrate into learning health systems for continuous data collection and inference, democratizing time series forecasting in healthcare settings.

Abstract: Time series forecasting has applications across domains and industries, especially in healthcare, but the technical expertise required to analyze data, build models, and interpret results can be a barrier to using these techniques. This article presents a web platform that makes the process of analyzing and plotting data, training forecasting models, and interpreting and viewing results accessible to researchers and clinicians. Users can upload data and generate plots to showcase their variables and the relationships between them. The platform supports multiple forecasting models and training techniques which are highly customizable according to the user’s needs. Additionally, recommendations and explanations can be generated from a large language model that can help the user choose appropriate parameters for their data and understand the results for each model. The goal is to integrate this platform into learning health systems for continuous data collection and inference from clinical pipelines.

[322] Benchmarking Offline Multi-Objective Reinforcement Learning in Critical Care

Aryaman Bansal, Divya Sharma

Main category: cs.LG

TL;DR: Benchmarks offline multi-objective RL algorithms vs scalarized baselines on MIMIC-IV, showing PEDA Decision Transformer offers superior flexibility for personalized critical care decision-making without retraining.

DetailsMotivation: Critical care requires balancing conflicting objectives (patient survival vs resource use). Single-objective RL uses fixed scalarized rewards, creating rigid policies that can't adapt to varying clinical priorities. Multi-objective RL can learn policies along Pareto Frontier for dynamic preference selection, but needs offline learning from historical healthcare data.

Method: Benchmarked three offline MORL algorithms (Conditioned Conservative Pareto Q-Learning, Adaptive CPQL, and modified Pareto Efficient Decision Agent Decision Transformer) against three scalarized single-objective baselines (BC, CQL, DDQN) on MIMIC-IV dataset. Used Off-Policy Evaluation metrics for comparison.

Result: PEDA DT algorithm offers superior flexibility compared to static scalarized baselines. Sequence modeling architectures (Decision Transformers) remain robust and effective when scaled to multi-objective conditioned generation, extending previous single-objective findings in healthcare.

Conclusion: Offline MORL is a promising framework for enabling personalized, adjustable decision-making in critical care without retraining, allowing dynamic adaptation to varying clinical priorities while learning from historical data.

Abstract: In critical care settings such as the Intensive Care Unit, clinicians face the complex challenge of balancing conflicting objectives, primarily maximizing patient survival while minimizing resource utilization (e.g., length of stay). Single-objective Reinforcement Learning approaches typically address this by optimizing a fixed scalarized reward function, resulting in rigid policies that fail to adapt to varying clinical priorities. Multi-objective Reinforcement Learning (MORL) offers a solution by learning a set of optimal policies along the Pareto Frontier, allowing for dynamic preference selection at test time. However, applying MORL in healthcare necessitates strict offline learning from historical data. In this paper, we benchmark three offline MORL algorithms, Conditioned Conservative Pareto Q-Learning (CPQL), Adaptive CPQL, and a modified Pareto Efficient Decision Agent (PEDA) Decision Transformer (PEDA DT), against three scalarized single-objective baselines (BC, CQL, and DDQN) on the MIMIC-IV dataset. Using Off-Policy Evaluation (OPE) metrics, we demonstrate that PEDA DT algorithm offers superior flexibility compared to static scalarized baselines. Notably, our results extend previous findings on single-objective Decision Transformers in healthcare, confirming that sequence modeling architectures remain robust and effective when scaled to multi-objective conditioned generation. These findings suggest that offline MORL is a promising framework for enabling personalized, adjustable decision-making in critical care without the need for retraining.

[323] CLARITY: Medical World Model for Guiding Treatment Decisions by Modeling Context-Aware Disease Trajectories in Latent Space

Tianxingjian Ding, Yuanhao Zou, Chen Chen, Mubarak Shah, Yu Tian

Main category: cs.LG

TL;DR: CLARITY is a medical world model that forecasts disease evolution in structured latent space, integrating temporal and clinical contexts to generate physiologically faithful, individualized treatment plans with a novel prediction-to-decision framework.

DetailsMotivation: Current static AI predictors cannot predict dynamic disease evolution in oncology. Existing medical world models ignore patient-specific temporal/clinical contexts, lack feedback mechanisms linking predictions to treatment decisions, and focus on visual reconstruction rather than causal physiological transitions.

Method: CLARITY forecasts disease evolution directly within a structured latent space, explicitly integrating time intervals (temporal context) and patient-specific data (clinical context). It models treatment-conditioned progression as smooth, interpretable trajectories and introduces a novel prediction-to-decision framework that translates latent rollouts into actionable recommendations.

Result: CLARITY demonstrates state-of-the-art performance in treatment planning, outperforming recent MeWM by 12% on the MU-Glioma-Post dataset and significantly surpassing all other medical-specific large language models.

Conclusion: CLARITY addresses key limitations in medical world models by providing physiologically faithful, individualized treatment forecasts with transparent, actionable recommendations, representing an advancement in clinical decision-making for oncology.

Abstract: Clinical decision-making in oncology requires predicting dynamic disease evolution, a task current static AI predictors cannot perform. While world models (WMs) offer a paradigm for generative prediction, existing medical applications remain limited. Existing methods often rely on stochastic diffusion models, focusing on visual reconstruction rather than causal, physiological transitions. Furthermore, in medical domain, models like MeWM typically ignore patient-specific temporal and clinical contexts and lack a feedback mechanism to link predictions to treatment decisions. To address these gaps, we introduce CLARITY, a medical world model that forecasts disease evolution directly within a structured latent space. It explicitly integrates time intervals (temporal context) and patient-specific data (clinical context) to model treatment-conditioned progression as a smooth, interpretable trajectory, and thus generate physiologically faithful, individualized treatment plans. Finally, CLARITY introduces a novel prediction-to-decision framework, translating latent rollouts into transparent, actionable recommendations. CLARITY demonstrates state-of-the-art performance in treatment planning. On the MU-Glioma-Post dataset, our approach outperforms recent MeWM by 12%, and significantly surpasses all other medical-specific large language models.

[324] LUNA: Linear Universal Neural Attention with Generalization Guarantees

Ashkan Shahbazi, Ping He, Ali Abbasi, Yikun Bai, Xinran Liu, Elaheh Akbari, Darian Salehi, Navid NaderiAlizadeh, Soheil Kolouri

Main category: cs.LG

TL;DR: LUNA introduces a learnable kernelized linear attention mechanism that achieves linear computational cost while matching or surpassing quadratic attention accuracy, eliminating the accuracy-efficiency trade-off in traditional linear attention methods.

DetailsMotivation: Traditional softmax attention has O(n²) quadratic cost limiting long-sequence applications. Existing linear attention methods reduce cost to O(n) but use fixed random feature maps, creating a fundamental accuracy-efficiency trade-off that sacrifices model accuracy for computational efficiency.

Method: LUNA learns a parameterized kernel feature map tailored to specific data and tasks, rather than using fixed random features. It implements a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in sequence length.

Result: On Long Range Arena (LRA), LUNA achieves state-of-the-art average accuracy among efficient Transformers under compute parity. It also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints recovers most original performance, substantially outperforming fixed linearizations.

Conclusion: LUNA eliminates the accuracy-efficiency trade-off in attention mechanisms by learning kernel feature maps, achieving linear computational cost while matching or surpassing quadratic attention performance across diverse settings.

Abstract: Scaling attention faces a critical bottleneck: the $\mathcal{O}(n^2)$ quadratic computational cost of softmax attention, which limits its application in long-sequence domains. While linear attention mechanisms reduce this cost to $\mathcal{O}(n)$, they typically rely on fixed random feature maps, such as random Fourier features or hand-crafted functions. This reliance on static, data-agnostic kernels creates a fundamental trade-off, forcing practitioners to sacrifice significant model accuracy for computational efficiency. We introduce \textsc{LUNA}, a kernelized linear attention mechanism that eliminates this trade-off, retaining linear cost while matching and surpassing the accuracy of quadratic attention. \textsc{LUNA} is built on the key insight that the kernel feature map itself should be learned rather than fixed a priori. By parameterizing the kernel, \textsc{LUNA} learns a feature basis tailored to the specific data and task, overcoming the expressive limitations of fixed-feature methods. \textsc{Luna} implements this with a learnable feature map that induces a positive-definite kernel and admits a streaming form, yielding linear time and memory scaling in the sequence length. Empirical evaluations validate our approach across diverse settings. On the Long Range Arena (LRA), \textsc{Luna} achieves state-of-the-art average accuracy among efficient Transformers under compute parity, using the same parameter count, training steps, and approximate FLOPs. \textsc{Luna} also excels at post-hoc conversion: replacing softmax in fine-tuned BERT and ViT-B/16 checkpoints and briefly fine-tuning recovers most of the original performance, substantially outperforming fixed linearizations.

[325] Deep Kernel Aalen-Johansen Estimator: An Interpretable and Flexible Neural Net Framework for Competing Risks

Xiaobin Shen, George H. Chen

Main category: cs.LG

TL;DR: DKAJ is an interpretable deep learning model for competing risks analysis that generalizes the Aalen-Johansen estimator using learned kernel-based clustering.

DetailsMotivation: To create an interpretable deep learning model for competing risks analysis that maintains the interpretability of classical nonparametric methods while leveraging deep learning's flexibility.

Method: Proposes Deep Kernel Aalen-Johansen (DKAJ) estimator that represents each data point as weighted combination of clusters. Uses automatically learned kernel function to measure similarity between data points, with weights determining cluster membership. Generalizes classical Aalen-Johansen estimator by restricting predictions to cluster-specific estimates when appropriate.

Result: DKAJ shows competitive performance with state-of-the-art baselines on four standard competing risks datasets while providing interpretable visualizations.

Conclusion: DKAJ successfully bridges classical nonparametric methods with deep learning, offering both competitive predictive performance and interpretability through cluster-based visualizations.

Abstract: We propose an interpretable deep competing risks model called the Deep Kernel Aalen-Johansen (DKAJ) estimator, which generalizes the classical Aalen-Johansen nonparametric estimate of cumulative incidence functions (CIFs). Each data point (e.g., patient) is represented as a weighted combination of clusters. If a data point has nonzero weight only for one cluster, then its predicted CIFs correspond to those of the classical Aalen-Johansen estimator restricted to data points from that cluster. These weights come from an automatically learned kernel function that measures how similar any two data points are. On four standard competing risks datasets, we show that DKAJ is competitive with state-of-the-art baselines while being able to provide visualizations to assist model interpretation.

[326] CAMO: Causality-Guided Adversarial Multimodal Domain Generalization for Crisis Classification

Pingchuan Ma, Chengshuai Zhao, Bohan Jiang, Saketh Vishnubhatla, Ujun Jeong, Alimohammad Beigi, Adrienne Raglin, Huan Liu

Main category: cs.LG

TL;DR: A causality-guided multimodal domain generalization framework for crisis classification in social media that improves generalization to unseen disaster types by disentangling causal features and aligning multimodal representations.

DetailsMotivation: Existing multimodal crisis classification methods perform well in-domain but generalize poorly to unseen crisis types due to failure to disentangle spurious/causal features and lack of aligned multimodal representations that prevent adaptation of single-modality domain generalization techniques.

Method: Proposes a causality-guided multimodal domain generalization (MMDG) framework combining adversarial disentanglement (to focus on domain-invariant causal features) with unified representation learning (to align heterogeneous modality features in shared latent space).

Result: Experiments on different datasets demonstrate the approach achieves best performance in unseen disaster scenarios compared to existing methods.

Conclusion: The proposed framework effectively addresses generalization challenges in multimodal crisis classification by disentangling causal features and enabling seamless extension of single-modality DG techniques to multimodal settings through representation alignment.

Abstract: Crisis classification in social media aims to extract actionable disaster-related information from multimodal posts, which is a crucial task for enhancing situational awareness and facilitating timely emergency responses. However, the wide variation in crisis types makes achieving generalizable performance across unseen disasters a persistent challenge. Existing approaches primarily leverage deep learning to fuse textual and visual cues for crisis classification, achieving numerically plausible results under in-domain settings. However, they exhibit poor generalization across unseen crisis types because they 1. do not disentangle spurious and causal features, resulting in performance degradation under domain shift, and 2. fail to align heterogeneous modality representations within a shared space, which hinders the direct adaptation of established single-modality domain generalization (DG) techniques to the multimodal setting. To address these issues, we introduce a causality-guided multimodal domain generalization (MMDG) framework that combines adversarial disentanglement with unified representation learning for crisis classification. The adversarial objective encourages the model to disentangle and focus on domain-invariant causal features, leading to more generalizable classifications grounded in stable causal mechanisms. The unified representation aligns features from different modalities within a shared latent space, enabling single-modality DG strategies to be seamlessly extended to multimodal learning. Experiments on the different datasets demonstrate that our approach achieves the best performance in unseen disaster scenarios.

[327] Unveiling Latent Knowledge in Chemistry Language Models through Sparse Autoencoders

Jaron Cohen, Alexander G. Hasson, Sara Tanovic

Main category: cs.LG

TL;DR: The paper applies sparse autoencoder techniques to chemistry language models to uncover interpretable features representing chemical knowledge like structural motifs, properties, and drug classes.

DetailsMotivation: Interpretability remains a critical challenge in machine learning, especially for generative models used in high-stakes applications like drug and material discovery. While chemistry language models show impressive capabilities, their internal representations of chemical knowledge are poorly understood.

Method: Extends sparse autoencoder techniques to examine interpretable features within chemistry language models. Applied methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model to extract semantically meaningful latent features and analyze their activation patterns across diverse molecular datasets.

Result: The models encode a rich landscape of chemical concepts. Researchers identified correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes.

Conclusion: Provides a generalizable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment, with potential to accelerate computational chemistry research.

Abstract: Since the advent of machine learning, interpretability has remained a persistent challenge, becoming increasingly urgent as generative models support high-stakes applications in drug and material discovery. Recent advances in large language model (LLM) architectures have yielded chemistry language models (CLMs) with impressive capabilities in molecular property prediction and molecular generation. However, how these models internally represent chemical knowledge remains poorly understood. In this work, we extend sparse autoencoder techniques to uncover and examine interpretable features within CLMs. Applying our methodology to the Foundation Models for Materials (FM4M) SMI-TED chemistry foundation model, we extract semantically meaningful latent features and analyse their activation patterns across diverse molecular datasets. Our findings reveal that these models encode a rich landscape of chemical concepts. We identify correlations between specific latent features and distinct domains of chemical knowledge, including structural motifs, physicochemical properties, and pharmacological drug classes. Our approach provides a generalisable framework for uncovering latent knowledge in chemistry-focused AI systems. This work has implications for both foundational understanding and practical deployment; with the potential to accelerate computational chemistry research.

[328] Complexity of One-Dimensional ReLU DNNs

Jonathan Kogan, Hayden Jananthan, Jeremy Kepner

Main category: cs.LG

TL;DR: The paper analyzes expressivity of 1D ReLU networks via linear regions, showing expected growth scales with total neurons plus 1, and introduces function-adaptive sparsity measure.

DetailsMotivation: To understand the expressive power of deep ReLU networks through their linear regions, which characterize piecewise linear functions, and to quantify how efficiently networks use their capacity for approximating target functions.

Method: Theoretical analysis of randomly initialized 1D ReLU networks with He scaling and nonzero bias, examining expected number of linear regions in infinite-width limit, plus proposing function-adaptive sparsity measure comparing used regions to minimal needed for approximation.

Result: Proved expected number of linear regions grows as sum of neurons across all hidden layers plus 1 (plus lower-order terms), establishing scaling relationship for network expressivity.

Conclusion: The linear region count scales linearly with total neurons, providing quantitative understanding of ReLU network expressivity, and the proposed sparsity measure offers way to assess network efficiency in function approximation.

Abstract: We study the expressivity of one-dimensional (1D) ReLU deep neural networks through the lens of their linear regions. For randomly initialized, fully connected 1D ReLU networks (He scaling with nonzero bias) in the infinite-width limit, we prove that the expected number of linear regions grows as $\sum_{i = 1}^L n_i + \mathop{o}\left(\sum_{i = 1}^L{n_i}\right) + 1$, where $n_\ell$ denotes the number of neurons in the $\ell$-th hidden layer. We also propose a function-adaptive notion of sparsity that compares the expected regions used by the network to the minimal number needed to approximate a target within a fixed tolerance.

[329] Training LLMs for Honesty via Confessions

Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese

Main category: cs.LG

TL;DR: Researchers propose a “confession” method where LLMs provide honest self-reports of their shortcomings after giving main answers, with separate rewards for confession honesty to incentivize truthful disclosure of misbehavior.

DetailsMotivation: LLMs can be dishonest about their actions and beliefs due to reinforcement learning challenges that may inadvertently incentivize lying. There's a need for methods to elicit honest self-reports of model shortcomings and misbehavior.

Method: Introduce a “confession” mechanism where models provide a separate output after their main answer, with rewards based solely on confession honesty (not affecting main answer rewards). Train GPT-5-Thinking to produce confessions and evaluate in OOD scenarios.

Result: When models lie or omit shortcomings in main answers, they often confess honestly to these behaviors. Confession honesty modestly improves with training. Confessions enable inference-time interventions like monitoring, rejection sampling, and issue surfacing.

Conclusion: The confession approach shows viability for eliciting honest self-reports of LLM shortcomings, providing a mechanism to surface misbehavior rather than cover it up, with practical applications for model monitoring and safety interventions.

Abstract: Large language models (LLMs) can be dishonest when reporting on their actions and beliefs – for example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions. In this work we propose a method for eliciting an honest expression of an LLM’s shortcomings via a self-reported confession. A confession is an output, provided upon request after a model’s original answer, that is meant to serve as a full account of the model’s compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer’s reward. As long as the “path of least resistance” for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest in their confessions. Our findings provide some justification this empirical assumption, especially in the case of egregious model misbehavior. To demonstrate the viability of our approach, we train GPT-5-Thinking to produce confessions, and we evaluate its honesty in out-of-distribution scenarios measuring hallucination, instruction following, scheming, and reward hacking. We find that when the model lies or omits shortcomings in its “main” answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training. Confessions can enable a number of inference-time interventions including monitoring, rejection sampling, and surfacing issues to the user.

[330] Scalable Offline Model-Based RL with Action Chunks

Kwanyoung Park, Seohong Park, Youngwoon Lee, Sergey Levine

Main category: cs.LG

TL;DR: MAC uses action-chunk models and rejection sampling to improve offline RL on long-horizon tasks by reducing compounding model errors and preventing model exploitation.

DetailsMotivation: The paper addresses the challenge of applying model-based RL to complex, long-horizon tasks in offline settings, where traditional approaches suffer from compounding model errors over long rollouts and exploitation of model inaccuracies.

Method: Proposes Model-Based RL with Action Chunks (MAC): 1) Uses action-chunk models that predict future states from sequences of actions rather than single actions to reduce compounding errors; 2) Employs rejection sampling from an expressive behavioral action-chunk policy instead of direct policy optimization to prevent model exploitation from out-of-distribution actions.

Result: MAC achieves state-of-the-art performance among offline model-based RL algorithms on challenging tasks with large-scale datasets (up to 100M transitions), particularly excelling on long-horizon tasks.

Conclusion: Model-based RL with action chunks provides an effective scalable approach for offline RL on complex long-horizon tasks by addressing key limitations of traditional model-based methods through action-chunk modeling and rejection sampling.

Abstract: In this paper, we study whether model-based reinforcement learning (RL), in particular model-based value expansion, can provide a scalable recipe for tackling complex, long-horizon tasks in offline RL. Model-based value expansion fits an on-policy value function using length-n imaginary rollouts generated by the current policy and a learned dynamics model. While larger n reduces bias in value bootstrapping, it amplifies accumulated model errors over long horizons, degrading future predictions. We address this trade-off with an \emph{action-chunk} model that predicts a future state from a sequence of actions (an “action chunk”) instead of a single action, which reduces compounding errors. In addition, instead of directly training a policy to maximize rewards, we employ rejection sampling from an expressive behavioral action-chunk policy, which prevents model exploitation from out-of-distribution actions. We call this recipe \textbf{Model-Based RL with Action Chunks (MAC)}. Through experiments on highly challenging tasks with large-scale datasets of up to 100M transitions, we show that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks.

[331] Long-only cryptocurrency portfolio management by ranking the assets: a neural network approach

Zijiang Yang

Main category: cs.LG

TL;DR: Proposes a novel ML-based portfolio management method for cryptocurrencies that predicts relative rankings of future returns across multiple cryptocurrencies rather than treating them independently, achieving superior performance across full market cycles.

DetailsMotivation: Existing cryptocurrency portfolio methods typically focus on predicting movements of individual cryptocurrencies (like Bitcoin) independently, missing cross-sectional relationships. The paper aims to develop a more sophisticated approach that leverages relative relationships among multiple cryptocurrencies for better portfolio management.

Method: Uses neural networks to predict the rank of future returns for managed cryptocurrencies at each time step, then allocates portfolio weights accordingly. This cross-sectional ranking approach incorporates relative performance information rather than treating cryptocurrencies independently.

Result: Backtesting on real daily cryptocurrency market data from May 2020 to Nov 2023 (3.5 years covering bullish, bearish, and stagnant conditions) shows the method outperforms existing approaches with Sharpe ratio of 1.01 and annualized return of 64.26%. The method also demonstrates robustness to increased transaction fees.

Conclusion: The proposed cross-sectional ranking approach using neural networks provides an effective portfolio management strategy for cryptocurrencies that leverages relative relationships among assets, achieving strong risk-adjusted returns across diverse market conditions while maintaining robustness to transaction costs.

Abstract: This paper will propose a novel machine learning based portfolio management method in the context of the cryptocurrency market. Previous researchers mainly focus on the prediction of the movement for specific cryptocurrency such as the bitcoin(BTC) and then trade according to the prediction. In contrast to the previous work that treats the cryptocurrencies independently, this paper manages a group of cryptocurrencies by analyzing the relative relationship. Specifically, in each time step, we utilize the neural network to predict the rank of the future return of the managed cryptocurrencies and place weights accordingly. By incorporating such cross-sectional information, the proposed methods is shown to profitable based on the backtesting experiments on the real daily cryptocurrency market data from May, 2020 to Nov, 2023. During this 3.5 years, the market experiences the full cycle of bullish, bearish and stagnant market conditions. Despite under such complex market conditions, the proposed method outperforms the existing methods and achieves a Sharpe ratio of 1.01 and annualized return of 64.26%. Additionally, the proposed method is shown to be robust to the increase of transaction fee.

[332] Improving the Sensitivity of Backdoor Detectors via Class Subspace Orthogonalization

Guangmingmei Yang, David J. Miller, George Kesidis

Main category: cs.LG

TL;DR: CSO improves backdoor detection by suppressing intrinsic class features to reveal subtle backdoor triggers, outperforming methods that rely on outlier detection statistics alone.

DetailsMotivation: Existing backdoor detection methods fail when: 1) some non-target classes naturally achieve extreme detection statistics due to being easily discriminable, and 2) backdoors are subtle with weak triggers relative to intrinsic class features.

Method: Proposes Class Subspace Orthogonalization (CSO) - a plug-and-play approach that formulates a constrained optimization problem using clean examples to suppress intrinsic features while optimizing detection statistics, revealing backdoor contributions.

Result: CSO achieves more sensitive detection by separating backdoor trigger contributions from intrinsic features, effectively handling challenging mixed-label and adaptive attacks where traditional methods fail.

Conclusion: By orthogonalizing detection statistics with respect to intrinsic class features, CSO provides a robust framework for detecting subtle backdoors that evade conventional outlier-based detection approaches.

Abstract: Most post-training backdoor detection methods rely on attacked models exhibiting extreme outlier detection statistics for the target class of an attack, compared to non-target classes. However, these approaches may fail: (1) when some (non-target) classes are easily discriminable from all others, in which case they may naturally achieve extreme detection statistics (e.g., decision confidence); and (2) when the backdoor is subtle, i.e., with its features weak relative to intrinsic class-discriminative features. A key observation is that the backdoor target class has contributions to its detection statistic from both the backdoor trigger and from its intrinsic features, whereas non-target classes only have contributions from their intrinsic features. To achieve more sensitive detectors, we thus propose to suppress intrinsic features while optimizing the detection statistic for a given class. For non-target classes, such suppression will drastically reduce the achievable statistic, whereas for the target class the (significant) contribution from the backdoor trigger remains. In practice, we formulate a constrained optimization problem, leveraging a small set of clean examples from a given class, and optimizing the detection statistic while orthogonalizing with respect to the class’s intrinsic features. We dub this plug-and-play approach Class Subspace Orthogonalization (CSO) and assess it against challenging mixed-label and adaptive attacks.

[333] Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture

Gary Ackerman, Brandon Behlendorf, Zachary Kallenborn, Sheriff Almakki, Doug Clifford, Jenna LaTourette, Hayley Peterson, Noah Sheinbaum, Olivia Shoemaker, Anna Wetzel

Main category: cs.LG

TL;DR: A framework for benchmarking AI models’ biosecurity risks, focusing on bacterial threats with hierarchical task-query architecture.

DetailsMotivation: Need to quantify and mitigate bioterrorism risks from evolving AI/LLMs, requiring better benchmarks that account for operational factors and different adversary capabilities.

Method: Developed Biothreat Benchmark Generation (BBG) Framework with hierarchical structure of biothreat categories, elements, and tasks, creating Bacterial Biothreat Schema as pilot for bacterial threats.

Result: Created a robust, re-usable framework for evaluating bacterial biological risks from LLMs across multiple aggregation levels, capturing both technical and operational requirements.

Conclusion: The BBG Framework provides structured approach to assess AI model biosecurity risks, with future work focusing on turning queries into prompts and implementing benchmarks for model evaluation.

Abstract: Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework. The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models, while accounting for key aspects of the threat itself that are often overlooked in other benchmarking efforts, including different actor capability levels, and operational (in addition to purely technical) risk factors. As a pilot, the BBG is first being developed to address bacterial biological threats only. The BBG is built upon a hierarchical structure of biothreat categories, elements and tasks, which then serves as the basis for the development of task-aligned queries. This paper outlines the development of this biothreat task-query architecture, which we have named the Bacterial Biothreat Schema, while future papers will describe follow-on efforts to turn queries into model prompts, as well as how the resulting benchmarks can be implemented for model evaluation. Overall, the BBG Framework, including the Bacterial Biothreat Schema, seeks to offer a robust, re-usable structure for evaluating bacterial biological risks arising from LLMs across multiple levels of aggregation, which captures the full scope of technical and operational requirements for biological adversaries, and which accounts for a wide spectrum of biological adversary capabilities.

[334] Robust Agents in Open-Ended Worlds

Mikayel Samvelyan

Main category: cs.LG

TL;DR: This thesis develops methods for training robust AI agents that can generalize to novel environments, out-of-distribution inputs, and multi-agent interactions using open-endedness and multi-agent learning approaches.

DetailsMotivation: As AI becomes more prevalent, there's a critical need for agents that can adapt to changing, open-ended environments and generalize beyond their training data to handle unforeseen scenarios and interactions.

Method: 1) MiniHack: A sandbox framework based on NetHack for creating diverse environments via procedural content generation. 2) Maestro: An adversarial curriculum generation approach for enhancing RL agent robustness in two-player zero-sum games. 3) Quality-diversity methods to identify vulnerabilities in pre-trained RL policies in complex multi-agent domains (football). 4) Evolutionary search to generate adversarial prompts for testing and improving LLM robustness.

Result: Developed comprehensive tools and methodologies for evaluating and enhancing AI robustness across different domains: environment generalization (MiniHack), game-theoretic robustness (Maestro), multi-agent vulnerability analysis (quality-diversity), and LLM adversarial robustness (evolutionary search).

Conclusion: This work provides foundational approaches for developing robust AI agents capable of adapting to evolving environments and thriving in the face of unforeseen challenges, paving the way for more resilient AI systems across various applications.

Abstract: The growing prevalence of artificial intelligence (AI) in various applications underscores the need for agents that can successfully navigate and adapt to an ever-changing, open-ended world. A key challenge is ensuring these AI agents are robust, excelling not only in familiar settings observed during training but also effectively generalising to previously unseen and varied scenarios. In this thesis, we harness methodologies from open-endedness and multi-agent learning to train and evaluate robust AI agents capable of generalising to novel environments, out-of-distribution inputs, and interactions with other co-player agents. We begin by introducing MiniHack, a sandbox framework for creating diverse environments through procedural content generation. Based on the game of NetHack, MiniHack enables the construction of new tasks for reinforcement learning (RL) agents with a focus on generalisation. We then present Maestro, a novel approach for generating adversarial curricula that progressively enhance the robustness and generality of RL agents in two-player zero-sum games. We further probe robustness in multi-agent domains, utilising quality-diversity methods to systematically identify vulnerabilities in state-of-the-art, pre-trained RL policies within the complex video game football domain, characterised by intertwined cooperative and competitive dynamics. Finally, we extend our exploration of robustness to the domain of LLMs. Here, our focus is on diagnosing and enhancing the robustness of LLMs against adversarial prompts, employing evolutionary search to generate a diverse range of effective inputs that aim to elicit undesirable outputs from an LLM. This work collectively paves the way for future advancements in AI robustness, enabling the development of agents that not only adapt to an ever-evolving world but also thrive in the face of unforeseen challenges and interactions.

[335] PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection

Ali Lotfi Rezaabad, Bikram Khanal, Shashwat Chaurasia, Lu Zeng, Dezhi Hong, Hossein Beshashati, Thomas Butler, Megan Ganji

Main category: cs.LG

TL;DR: PolyLingua is a lightweight Transformer model for accurate language identification using contrastive learning, outperforming larger models with 10x fewer parameters.

DetailsMotivation: Existing language ID tools struggle with accuracy in key cases like music requests with code-switching, while accurate models (LLMs) are too costly for low-latency/low-resource settings.

Method: Two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins to create compact, well-separated embeddings for closely related languages.

Result: Achieves 99.25% F1 on Amazon Massive dataset and 98.15% F1 on Song dataset, surpassing Sonnet 3.5 while using 10x fewer parameters.

Conclusion: PolyLingua provides high-accuracy language identification suitable for compute- and latency-constrained environments, addressing gaps in existing tools.

Abstract: Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases – such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets – Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) – PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments.

[336] TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Zheng Ding, Weirui Ye

Main category: cs.LG

TL;DR: TreeGRPO is a novel RL framework that improves training efficiency for generative model alignment by recasting denoising as a search tree, achieving 2.4× faster training with better sample efficiency and fine-grained credit assignment.

DetailsMotivation: RL post-training is essential for aligning generative models with human preferences, but its high computational cost limits widespread adoption. Current methods are inefficient and suffer from uniform credit assignment limitations.

Method: TreeGRPO recasts the denoising process as a search tree, branching from shared initial noise samples to generate multiple candidate trajectories while reusing common prefixes. It enables fine-grained credit assignment via reward backpropagation and amortized computation through multi-child branching.

Result: TreeGRPO achieves 2.4× faster training while establishing a superior Pareto frontier in efficiency-reward trade-off. It consistently outperforms GRPO baselines across multiple benchmarks and reward models.

Conclusion: TreeGRPO provides a scalable and effective pathway for RL-based visual generative model alignment, offering dramatic improvements in training efficiency through its tree-structured approach to RL post-training.

Abstract: Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves \textbf{2.4$\times$ faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at treegrpo.github.io.

[337] LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks

Nanda K. Unnikrishnan, Keshab K. Parhi

Main category: cs.LG

TL;DR: LayerPipe2 formally derives gradient delay requirements for pipelined neural network training, showing delay amounts depend on network structure, and introduces a memory-efficient moving average to avoid storing historical weights.

DetailsMotivation: Previous work (LayerPipe) empirically accelerated training by overlapping forward/backward computation but lacked principled understanding of gradient delay requirements at each layer for effective pipelining.

Method: Formally derives LayerPipe using variable delayed gradient adaptation and retiming; identifies legal delay insertion points; develops pipeline-aware moving average to reconstruct past states instead of storing them.

Result: Shows delay requirements follow directly from network structure (inner layers need fewer delays, outer layers need longer delays); reduces memory cost without sacrificing accuracy; enables scalable pipelined training.

Conclusion: Provides principled framework for constructing LayerPipe architectures, predicting delay requirements, and mitigating storage burden, enabling controlled communication-computation tradeoffs in scalable pipelined training.

Abstract: In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure where inner layers require fewer delays and outer layers require longer delays. When pipelining is applied at every layer, the amount of delay depends only on the number of remaining downstream stages. When layers are pipelined in groups, all layers in the group share the same assignment of delays. These insights not only explain previously observed scheduling patterns but also expose an often overlooked challenge that pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline–aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication computation tradeoffs.

[338] MobileFineTuner: A Unified End-to-End Framework for Fine-Tuning LLMs on Mobile Phones

Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, Bing Luo

Main category: cs.LG

TL;DR: MobileFineTuner is an open-source framework enabling end-to-end LLM fine-tuning directly on commodity mobile phones, addressing privacy concerns and data scarcity by leveraging private user data on-device.

DetailsMotivation: As high-quality public LLM training data becomes exhausted, on-device fine-tuning offers a way to leverage private user data while preserving privacy. However, existing approaches don't work well on commodity mobile phones, and there's no open-source framework for practical LLM fine-tuning on these devices.

Method: MobileFineTuner is a unified framework supporting both full-parameters fine-tuning and parameter-efficient fine-tuning. It includes system-level optimizations like parameter sharding, gradient accumulation, and energy-aware computation scheduling to overcome mobile phone memory and energy limitations.

Result: The framework successfully fine-tuned GPT-2, Gemma 3, and Qwen 2.5 on real mobile phones. Extensive experiments and ablation studies validated the effectiveness of the proposed optimizations.

Conclusion: MobileFineTuner establishes a viable foundation for future research on on-device LLM training, demonstrating that practical LLM fine-tuning on commodity mobile phones is achievable with proper system-level optimizations.

Abstract: Mobile phones are the most ubiquitous end devices, generating vast amounts of human-authored data and serving as the primary platform for end-side applications. As high-quality public data for large language models (LLMs) approaches exhaustion, on-device fine-tuning provides an opportunity to leverage private user data while preserving privacy. However, existing approaches are predominantly simulation-based or rely on IoT devices and PCs, leaving commodity mobile phones largely unexplored. A key gap is the absence of an open-source framework that enables practical LLM fine-tuning on mobile phones. We present MobileFineTuner, a unified open-source framework that enables end-to-end LLM fine-tuning directly on commodity mobile phones. MobileFineTuner is designed for efficiency, scalability, and usability, supporting full-parameters fine-tuning (Full-FT) and parameter-efficient fine-tuning (PEFT). To address the memory and energy limitations inherent to mobile phones, we introduce system-level optimizations including parameter sharding, gradient accumulation, and energy-aware computation scheduling. We demonstrate the practicality of MobileFineTuner by fine-tuning GPT-2, Gemma 3, and Qwen 2.5 on real mobile phones. Extensive experiments and ablation studies validate the effectiveness of the proposed optimizations and establish MobileFineTuner as a viable foundation for future research on on-device LLM training.

[339] Correction of Decoupled Weight Decay

Jason Chuan-Chih Chou

Main category: cs.LG

TL;DR: The paper challenges the conventional wisdom that decoupled weight decay should be proportional to learning rate (γ), showing instead that it should be proportional to γ² for stable weight norms and better training dynamics.

DetailsMotivation: The motivation is to re-examine the fundamental assumption about decoupled weight decay in AdamW optimizer. While it has long been assumed that weight decay should be proportional to learning rate (γ), recent arguments suggest it should be proportional to γ² based on orthogonality arguments. The authors aim to provide a more rigorous theoretical foundation for weight decay settings.

Method: The authors analyze training dynamics by eliminating the contribution of the perpendicular component of updates to weight norm. They derive that decoupled weight decay ∝ γ² results in stable weight norm based on the assumption that updates become independent of weights at steady state. They also analyze Total Update Contribution (TUC) under the Scion optimizer using momentum-dependent effective learning rate.

Result: The paper finds that decoupled weight decay ∝ γ² leads to stable weight and gradient norms, allows better control of training dynamics, and improves model performance. The authors empirically verify that the optimal value of momentum-dependent effective learning rate transfers well.

Conclusion: The conventional setting of decoupled weight decay proportional to learning rate is suboptimal. Setting it proportional to γ² provides more stable training dynamics, better control over weight and gradient norms, and ultimately improves model performance across different optimizers.

Abstract: Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $γ$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto γ^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto γ^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto γ^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.

[340] PR-CapsNet: Pseudo-Riemannian Capsule Network with Adaptive Curvature Routing for Graph Learning

Ye Qin, Jingchao Wang, Yang Shi, Haiying Huang, Junxu Li, Weijian Liu, Tinghui Chen, Jinghui Qin

Main category: cs.LG

TL;DR: PR-CapsNet extends capsule networks to pseudo-Riemannian manifolds with adaptive curvature for better graph representation learning, outperforming SOTA models on classification tasks.

DetailsMotivation: Capsule Networks have strong graph representation capacity but struggle with complex real-world graph geometries due to fixed-curvature Euclidean space limitations. While pseudo-Riemannian manifolds show promise for graph embedding, their integration with CapsNets remains underexplored.

Method: Extends Euclidean capsule routing to pseudo-Riemannian manifolds with three key components: 1) Pseudo-Riemannian Tangent Space Routing decomposes capsule states into spherical-temporal and Euclidean-spatial subspaces, 2) Adaptive Curvature Routing fuses features from different curvature spaces using learnable curvature tensor with geometric attention, 3) Pseudo-Riemannian Capsule Classifier projects embeddings to tangent spaces with curvature-weighted softmax.

Result: Extensive experiments on node and graph classification benchmarks show PR-CapsNet outperforms state-of-the-art models, demonstrating strong representation power for complex graph structures.

Conclusion: PR-CapsNet successfully integrates pseudo-Riemannian geometry with capsule networks, providing adaptive curvature modeling that better captures hierarchical, cluster, and cyclic graph structures, leading to superior graph representation learning performance.

Abstract: Capsule Networks (CapsNets) show exceptional graph representation capacity via dynamic routing and vectorized hierarchical representations, but they model the complex geometries of real-world graphs poorly by fixed-curvature space due to the inherent geodesical disconnectedness issues, leading to suboptimal performance. Recent works find that non-Euclidean pseudo-Riemannian manifolds provide specific inductive biases for embedding graph data, but how to leverage them to improve CapsNets is still underexplored. Here, we extend the Euclidean capsule routing into geodesically disconnected pseudo-Riemannian manifolds and derive a Pseudo-Riemannian Capsule Network (PR-CapsNet), which models data in pseudo-Riemannian manifolds of adaptive curvature, for graph representation learning. Specifically, PR-CapsNet enhances the CapsNet with Adaptive Pseudo-Riemannian Tangent Space Routing by utilizing pseudo-Riemannian geometry. Unlike single-curvature or subspace-partitioning methods, PR-CapsNet concurrently models hierarchical and cluster or cyclic graph structures via its versatile pseudo-Riemannian metric. It first deploys Pseudo-Riemannian Tangent Space Routing to decompose capsule states into spherical-temporal and Euclidean-spatial subspaces with diffeomorphic transformations. Then, an Adaptive Curvature Routing is developed to adaptively fuse features from different curvature spaces for complex graphs via a learnable curvature tensor with geometric attention from local manifold properties. Finally, a geometric properties-preserved Pseudo-Riemannian Capsule Classifier is developed to project capsule embeddings to tangent spaces and use curvature-weighted softmax for classification. Extensive experiments on node and graph classification benchmarks show PR-CapsNet outperforms SOTA models, validating PR-CapsNet’s strong representation power for complex graph structures.

[341] Persistent Topological Structures and Cohomological Flows as a Mathematical Framework for Brain-Inspired Representation Learning

Preksha Girish, Rachana Mysore, Mahanthesha U, Shrey Kumar, Shipra Prashant

Main category: cs.LG

TL;DR: A rigorous mathematical framework for brain-inspired representation learning using persistent topology and cohomological flows, achieving superior manifold consistency and noise resilience.

DetailsMotivation: To establish a mathematically rigorous foundation for brain-inspired representation learning by leveraging the interplay between persistent topological structures and cohomological flows, moving beyond traditional neural network approaches.

Method: Reformulates neural computation as evolution of cochain maps over dynamic simplicial complexes. Integrates algebraic topology with differential geometry to construct cohomological operators that generalize gradient-based learning within a homological landscape. Uses persistent homology, sheaf cohomology, and spectral Laplacians to analyze both synthetic data with controlled topological signatures and real neural datasets.

Result: The model achieves superior manifold consistency and noise resilience compared to graph neural networks and manifold-based deep architectures. Demonstrates stability, continuity, and structural preservation in representation learning.

Conclusion: Establishes a coherent mathematical foundation for topology-driven representation learning by successfully integrating persistent topological structures with cohomological flows, providing a rigorous framework for brain-inspired computational models.

Abstract: This paper presents a mathematically rigorous framework for brain-inspired representation learning founded on the interplay between persistent topological structures and cohomological flows. Neural computation is reformulated as the evolution of cochain maps over dynamic simplicial complexes, enabling representations that capture invariants across temporal, spatial, and functional brain states. The proposed architecture integrates algebraic topology with differential geometry to construct cohomological operators that generalize gradient-based learning within a homological landscape. Synthetic data with controlled topological signatures and real neural datasets are jointly analyzed using persistent homology, sheaf cohomology, and spectral Laplacians to quantify stability, continuity, and structural preservation. Empirical results demonstrate that the model achieves superior manifold consistency and noise resilience compared to graph neural and manifold-based deep architectures, establishing a coherent mathematical foundation for topology-driven representation learning.

[342] SPROCKET: Extending ROCKET to Distance-Based Time-Series Transformations With Prototypes

Nicholas Harner

Main category: cs.LG

TL;DR: SPROCKET introduces a prototype-based feature engineering strategy for time series classification that achieves comparable performance to existing convolutional algorithms and improves ensemble performance when combined with MultiROCKET-HYDRA.

DetailsMotivation: Current time series classification is dominated by feature engineering approaches like ROCKET, which uses random kernel features. The authors aim to develop a new feature engineering strategy based on prototypes to potentially enhance accuracy and robustness.

Method: SPROCKET (Selected Prototype Random Convolutional Kernel Transform) implements a prototype-based feature engineering strategy. The authors also create an ensemble called MR-HY-SP (MultiROCKET-HYDRA-SPROCKET) that combines SPROCKET with existing approaches.

Result: SPROCKET achieves comparable performance to existing convolutional algorithms on most UCR and UEA Time Series Classification archives. The MR-HY-SP ensemble’s average accuracy ranking exceeds HYDRA-MR, which was previously the best convolutional ensemble.

Conclusion: Prototype-based feature transformation can enhance both accuracy and robustness in time series classification, as demonstrated by SPROCKET’s strong performance and the superior results of the MR-HY-SP ensemble.

Abstract: Classical Time Series Classification algorithms are dominated by feature engineering strategies. One of the most prominent of these transforms is ROCKET, which achieves strong performance through random kernel features. We introduce SPROCKET (Selected Prototype Random Convolutional Kernel Transform), which implements a new feature engineering strategy based on prototypes. On a majority of the UCR and UEA Time Series Classification archives, SPROCKET achieves performance comparable to existing convolutional algorithms and the new MR-HY-SP ( MultiROCKET-HYDRA-SPROCKET) ensemble’s average accuracy ranking exceeds HYDRA-MR, the previous best convolutional ensemble’s performance. These experimental results demonstrate that prototype-based feature transformation can enhance both accuracy and robustness in time series classification.

[343] Wavelet-Accelerated Physics-Informed Quantum Neural Network for Multiscale Partial Differential Equations

Deepak Gupta, Himanshu Pandey, Ratikanta Behera

Main category: cs.LG

TL;DR: Wavelet-based physics-informed quantum neural network framework for multiscale PDEs with sharp gradients and oscillations, eliminating automatic differentiation for faster training with fewer parameters.

DetailsMotivation: Traditional PINNs and quantum-PINNs struggle with multiscale features and suffer from computational overhead due to automatic differentiation, needing more efficient methods for multiscale PDEs with sharp gradients and oscillations.

Method: Developed wavelet-accelerated physics-informed quantum neural network that incorporates wavelet multiresolution properties into quantum neural architecture, eliminating automatic differentiation for loss function construction.

Result: Achieves superior accuracy with <5% trainable parameters vs classical wavelet-based PINNs, 3-5x speedup over existing quantum PINNs, faster convergence for multiscale oscillatory problems.

Conclusion: Proposed framework effectively solves challenging multiscale PDEs by combining wavelet multiresolution with quantum neural networks, offering significant computational advantages over existing PINN approaches.

Abstract: This work proposes a wavelet-based physics-informed quantum neural network framework to efficiently address multiscale partial differential equations that involve sharp gradients, stiffness, rapid local variations, and highly oscillatory behavior. Traditional physics-informed neural networks (PINNs) have demonstrated substantial potential in solving differential equations, and their quantum counterparts, quantum-PINNs, exhibit enhanced representational capacity with fewer trainable parameters. However, both approaches face notable challenges in accurately solving multiscale features. Furthermore, their reliance on automatic differentiation for constructing loss functions introduces considerable computational overhead, resulting in longer training times. To overcome these challenges, we developed a wavelet-accelerated physics-informed quantum neural network that eliminates the need for automatic differentiation, significantly reducing computational complexity. The proposed framework incorporates the multiresolution property of wavelets within the quantum neural network architecture, thereby enhancing the network’s ability to effectively capture both local and global features of multiscale problems. Numerical experiments demonstrate that our proposed method achieves superior accuracy while requiring less than five percent of the trainable parameters compared to classical wavelet-based PINNs, resulting in faster convergence. Moreover, it offers a speedup of three to five times compared to existing quantum PINNs, highlighting the potential of the proposed approach for efficiently solving challenging multiscale and oscillatory problems.

[344] Mathematical Foundations of Neural Tangents and Infinite-Width Networks

Rachana Mysore, Preksha Girish, Kavitha Jayaram, Shrey Kumar, Preksha Girish, Shravan Sanjeev Bagal, Kavitha Jayaram, Shreya Aravind Shastry

Main category: cs.LG

TL;DR: The paper proposes NTK-ECRN, a neural network architecture designed for rigorous analysis of Neural Tangent Kernel evolution during training, with theoretical bounds on NTK dynamics and empirical validation showing improved stability and generalization.

DetailsMotivation: To bridge the gap between infinite-width neural network theory (via NTK) and practical deep learning architectures, enabling rigorous mathematical analysis of kernel evolution during training while maintaining practical applicability.

Method: Proposes NTK-ECRN architecture integrating Fourier feature embeddings, residual connections with layerwise scaling, and stochastic depth. Theoretically analyzes NTK dynamics, derives bounds on kernel evolution, characterizes eigenvalue behavior, and links spectral properties to generalization and optimization stability.

Result: Theoretical contributions include bounds on NTK dynamics and eigenvalue evolution. Empirical results on synthetic and benchmark datasets validate predicted kernel behavior and demonstrate improved training stability and generalization compared to baseline approaches.

Conclusion: The work provides a comprehensive framework that successfully bridges infinite-width neural network theory with practical deep learning architectures, enabling rigorous analysis while maintaining practical performance benefits.

Abstract: We investigate the mathematical foundations of neural networks in the infinite-width regime through the Neural Tangent Kernel (NTK). We propose the NTK-Eigenvalue-Controlled Residual Network (NTK-ECRN), an architecture integrating Fourier feature embeddings, residual connections with layerwise scaling, and stochastic depth to enable rigorous analysis of kernel evolution during training. Our theoretical contributions include deriving bounds on NTK dynamics, characterizing eigenvalue evolution, and linking spectral properties to generalization and optimization stability. Empirical results on synthetic and benchmark datasets validate the predicted kernel behavior and demonstrate improved training stability and generalization. This work provides a comprehensive framework bridging infinite-width theory and practical deep-learning architectures.

[345] SOFA-FL: Self-Organizing Hierarchical Federated Learning with Adaptive Clustered Data Sharing

Yi Ni, Xinkun Wang, Han Zhang

Main category: cs.LG

TL;DR: SOFA-FL is a self-organizing hierarchical federated learning framework that dynamically adapts to evolving environments through adaptive clustering and controlled data sharing.

DetailsMotivation: Federated Learning faces challenges with data heterogeneity and fixed network topologies in evolving environments, requiring adaptive solutions.

Method: Three core mechanisms: 1) DMAC for initial hierarchical structure, 2) SHAPE for dynamic topology restructuring via grafting/pruning/consolidation/purification, 3) Adaptive Clustered Data Sharing for controlled data exchange.

Result: SOFA-FL effectively captures dynamic client relationships and enhances personalization without predetermined cluster structures.

Conclusion: The framework enables hierarchical federated systems to self-organize and adapt over time, addressing data heterogeneity and topology rigidity.

Abstract: Federated Learning (FL) faces significant challenges in evolving environments, particularly regarding data heterogeneity and the rigidity of fixed network topologies. To address these issues, this paper proposes \textbf{SOFA-FL} (Self-Organizing Hierarchical Federated Learning with Adaptive Clustered Data Sharing), a novel framework that enables hierarchical federated systems to self-organize and adapt over time. The framework is built upon three core mechanisms: (1) \textbf{Dynamic Multi-branch Agglomerative Clustering (DMAC)}, which constructs an initial efficient hierarchical structure; (2) \textbf{Self-organizing Hierarchical Adaptive Propagation and Evolution (SHAPE)}, which allows the system to dynamically restructure its topology through atomic operations – grafting, pruning, consolidation, and purification – to adapt to changes in data distribution; and (3) \textbf{Adaptive Clustered Data Sharing}, which mitigates data heterogeneity by enabling controlled partial data exchange between clients and cluster nodes. By integrating these mechanisms, SOFA-FL effectively captures dynamic relationships among clients and enhances personalization capabilities without relying on predetermined cluster structures.

[346] gHAWK: Local and Global Structure Encoding for Scalable Training of Graph Neural Networks on Knowledge Graphs

Humera Sabir, Fatima Farooq, Ashraf Aboulnaga

Main category: cs.LG

TL;DR: gHAWK is a scalable GNN training framework for large knowledge graphs that precomputes structural features (Bloom filters for local structure and TransE embeddings for global position) to accelerate training and improve accuracy.

DetailsMotivation: Existing message-passing GNNs struggle to scale to large knowledge graphs due to inefficient iterative message passing, especially under mini-batch training where nodes only see partial neighborhood views.

Method: Precomputes structural features before GNN training: (a) Bloom filters to compactly encode local neighborhood structure, and (b) TransE embeddings to represent each node’s global position. These are fused with domain-specific features to create node feature vectors that can be incorporated into any GNN technique.

Result: Achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks in Open Graph Benchmark, topping the OGB leaderboard for three graphs. Significantly reduces memory usage, accelerates convergence, and improves model accuracy.

Conclusion: gHAWK provides a novel and scalable solution for GNN training on large knowledge graphs by augmenting message-passing with precomputed structural priors, overcoming scalability limitations of traditional approaches.

Abstract: Knowledge Graphs (KGs) are a rich source of structured, heterogeneous data, powering a wide range of applications. A common approach to leverage this data is to train a graph neural network (GNN) on the KG. However, existing message-passing GNNs struggle to scale to large KGs because they rely on the iterative message passing process to learn the graph structure, which is inefficient, especially under mini-batch training, where a node sees only a partial view of its neighborhood. In this paper, we address this problem and present gHAWK, a novel and scalable GNN training framework for large KGs. The key idea is to precompute structural features for each node that capture its local and global structure before GNN training even begins. Specifically, gHAWK introduces a preprocessing step that computes: (a)~Bloom filters to compactly encode local neighborhood structure, and (b)~TransE embeddings to represent each node’s global position in the graph. These features are then fused with any domain-specific features (e.g., text embeddings), producing a node feature vector that can be incorporated into any GNN technique. By augmenting message-passing training with structural priors, gHAWK significantly reduces memory usage, accelerates convergence, and improves model accuracy. Extensive experiments on large datasets from the Open Graph Benchmark (OGB) demonstrate that gHAWK achieves state-of-the-art accuracy and lower training time on both node property prediction and link prediction tasks, topping the OGB leaderboard for three graphs.

[347] Jacobian Aligned Random Forests

Sarwesh Rauniyar

Main category: cs.LG

TL;DR: JARF uses gradient-based supervised preconditioning to rotate features before training axis-aligned forests, capturing oblique boundaries while maintaining simplicity.

DetailsMotivation: Axis-aligned trees are fast but struggle with rotated/interaction boundaries; oblique forests handle these but are complex and computationally expensive.

Method: Fit axis-aligned forest, compute gradients of predictions, aggregate into Jacobian outer product, use as global linear preconditioner to rotate feature space, then train standard axis-aligned forest on transformed data.

Result: Consistently improves axis-aligned forests, often matches/surpasses oblique baselines with better training time, maintains simplicity while capturing complex boundaries.

Conclusion: Supervised preconditioning recovers much of oblique forest accuracy while retaining axis-aligned tree simplicity and robustness.

Abstract: Axis-aligned decision trees are fast and stable but struggle on datasets with rotated or interaction-dependent decision boundaries, where informative splits require linear combinations of features rather than single-feature thresholds. Oblique forests address this with per-node hyperplane splits, but at added computational cost and implementation complexity. We propose a simple alternative: JARF, Jacobian-Aligned Random Forests. Concretely, we first fit an axis-aligned forest to estimate class probabilities or regression outputs, compute finite-difference gradients of these predictions with respect to each feature, aggregate them into an expected Jacobian outer product that generalizes the expected gradient outer product (EGOP), and use it as a single global linear preconditioner for all inputs. This supervised preconditioner applies a single global rotation of the feature space, then hands the transformed data back to a standard axis-aligned forest, preserving off-the-shelf training pipelines while capturing oblique boundaries and feature interactions that would otherwise require many axis-aligned splits to approximate. The same construction applies to any model that provides gradients, though we focus on random forests and gradient-boosted trees in this work. On tabular classification and regression benchmarks, this preconditioning consistently improves axis-aligned forests and often matches or surpasses oblique baselines while improving training time. Our experimental results and theoretical analysis together indicate that supervised preconditioning can recover much of the accuracy of oblique forests while retaining the simplicity and robustness of axis-aligned trees.

[348] Minimizing Layerwise Activation Norm Improves Generalization in Federated Learning

M Yashwanth, Gaurav Kumar Nayak, Harsh Rangwani, Arya Singh, R. Venkatesh Babu, Anirban Chakraborty

Main category: cs.LG

TL;DR: The paper proposes MAN, a regularization technique for Federated Learning that minimizes activation norms to enforce flat minima and improve model generalization.

DetailsMotivation: Federated Learning tends to converge to sharp minima which hurts model generalization. There's a need to improve FL-trained models' generalization performance by ensuring convergence to flat minima.

Method: Introduces a flatness-constrained FL optimization problem using Hessian eigenvalues. Proposes MAN (Minimizes Activation’s Norm) regularization that minimizes activation norms of each layer in client-side models to reduce Hessian eigenvalues and ensure flat minima.

Result: MAN regularization applied to existing FL techniques achieves significant improvements and establishes new state-of-the-art performance.

Conclusion: The proposed MAN regularization effectively improves FL model generalization by enforcing flat minima through activation norm minimization, with theoretical guarantees and practical effectiveness.

Abstract: Federated Learning (FL) is an emerging machine learning framework that enables multiple clients (coordinated by a server) to collaboratively train a global model by aggregating the locally trained models without sharing any client’s training data. It has been observed in recent works that learning in a federated manner may lead the aggregated global model to converge to a ‘sharp minimum’ thereby adversely affecting the generalizability of this FL-trained model. Therefore, in this work, we aim to improve the generalization performance of models trained in a federated setup by introducing a ‘flatness’ constrained FL optimization problem. This flatness constraint is imposed on the top eigenvalue of the Hessian computed from the training loss. As each client trains a model on its local data, we further re-formulate this complex problem utilizing the client loss functions and propose a new computationally efficient regularization technique, dubbed ‘MAN,’ which Minimizes Activation’s Norm of each layer on client-side models. We also theoretically show that minimizing the activation norm reduces the top eigenvalue of the layer-wise Hessian of the client’s loss, which in turn decreases the overall Hessian’s top eigenvalue, ensuring convergence to a flat minimum. We apply our proposed flatness-constrained optimization to the existing FL techniques and obtain significant improvements, thereby establishing new state-of-the-art.

[349] A Multivariate Bernoulli-Based Sampling Method for Multi-Label Data with Application to Meta-Research

Simon Chung, Colby J. Vorland, Donna L. Maney, Andrew W. Brown

Main category: cs.LG

TL;DR: Novel weighted sampling algorithm for multi-label data that accounts for label dependencies to create balanced samples while preserving population characteristics.

DetailsMotivation: Multi-label datasets often have imbalanced label frequencies and dependencies between labels, making it challenging to obtain representative samples that include sufficient observations of scarce labels while maintaining known population characteristics.

Method: Uses multivariate Bernoulli distribution to model multi-label data, estimates distribution parameters from observed label frequencies, calculates weights for each label combination, and performs weighted sampling that accounts for label dependencies to achieve target distribution characteristics.

Result: Applied to Web of Science biomedical research articles with 64 topic categories, the approach successfully created a more balanced sub-sample that preserved category frequency order, reduced frequency differences between most and least common categories, and enhanced representation of minority categories.

Conclusion: The proposed sampling algorithm effectively addresses multi-label sampling challenges by incorporating label dependencies, producing balanced samples that improve minority category representation while maintaining population characteristics.

Abstract: Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.

[350] Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models II: Benchmark Generation Process

Gary Ackerman, Zachary Kallenborn, Anna Wetzel, Hayley Peterson, Jenna LaTourette, Olivia Shoemaker, Brandon Behlendorf, Sheriff Almakki, Doug Clifford, Noah Sheinbaum

Main category: cs.LG

TL;DR: This paper describes the development of a Bacterial Biothreat Benchmark (B3) dataset as part of a framework to assess AI model risks in bioterrorism contexts.

DetailsMotivation: There is significant concern about frontier AI models (especially LLMs) potentially facilitating bioterrorism or biological weapons access. Both developers and policymakers need tools to quantify and mitigate these risks through reliable benchmarks.

Method: Used three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora. Generated over 7,000 potential benchmarks linked to a Task-Query Architecture. Applied de-duplication, uplift diagnosticity assessment, and quality control to refine the dataset.

Result: Produced a final set of 1,010 benchmarks that are diagnostic (provide uplift), directly relevant to biosecurity threats, and aligned with a larger biosecurity architecture for nuanced analysis.

Conclusion: The paper successfully developed the B3 dataset as the second component of a Biothreat Benchmark Generation framework, creating a valuable tool for assessing AI model risks related to biological threats.

Abstract: The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper, the second in a series of three, describes the second component of a novel Biothreat Benchmark Generation (BBG) framework: the generation of the Bacterial Biothreat Benchmark (B3) dataset. The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora, to generate over 7,000 potential benchmarks linked to the Task-Query Architecture that was developed during the first component of the project. A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks. This procedure ensured that these benchmarks are a) diagnostic in terms of providing uplift; b) directly relevant to biosecurity threats; and c) are aligned with a larger biosecurity architecture permitting nuanced analysis at different levels of analysis.

[351] Fully Decentralized Certified Unlearning

Hithem Lamri, Michail Maniatakos

Main category: cs.LG

TL;DR: RR-DU: A decentralized certified unlearning method using random walks with gradient steps and Gaussian noise to remove data influence while maintaining privacy guarantees.

DetailsMotivation: Machine unlearning needs to work in decentralized networks without coordinators, but current certified unlearning methods focus on centralized or federated settings with servers. Decentralized networks with fixed topologies remain underexplored for certified unlearning.

Method: RR-DU uses random walks where: 1) Unlearning client performs projected gradient ascent on forget set, 2) Other clients perform geometrically distributed projected descent steps on retained data, 3) Adds subsampled Gaussian noise, 4) Projects onto trust region around original model, 5) Uses segment-level subsampling for privacy.

Result: Provides convergence guarantees (convex case) and stationarity guarantees (nonconvex case); achieves (ε,δ) network-unlearning certificates via subsampled Gaussian Rényi DP; shows deletion-capacity bounds scaling with forget-to-local data ratio; empirically matches privacy guarantees while achieving higher test accuracy than decentralized DP baselines and reduces forget accuracy to random guessing (~10%).

Conclusion: RR-DU successfully enables certified unlearning in decentralized networks, balancing privacy and utility while effectively removing specified data influence, demonstrating practical viability on image classification tasks.

Abstract: Machine unlearning (MU) seeks to remove the influence of specified data from a trained model in response to privacy requests or data poisoning. While certified unlearning has been analyzed in centralized and server-orchestrated federated settings (via guarantees analogous to differential privacy, DP), the decentralized setting – where peers communicate without a coordinator remains underexplored. We study certified unlearning in decentralized networks with fixed topologies and propose RR-DU, a random-walk procedure that performs one projected gradient ascent step on the forget set at the unlearning client and a geometrically distributed number of projected descent steps on the retained data elsewhere, combined with subsampled Gaussian noise and projection onto a trust region around the original model. We provide (i) convergence guarantees in the convex case and stationarity guarantees in the nonconvex case, (ii) $(\varepsilon,δ)$ network-unlearning certificates on client views via subsampled Gaussian Rényi DP (RDP) with segment-level subsampling, and (iii) deletion-capacity bounds that scale with the forget-to-local data ratio and quantify the effect of decentralization (network mixing and randomized subsampling) on the privacy-utility trade-off. Empirically, on image benchmarks (MNIST, CIFAR-10), RR-DU matches a given $(\varepsilon,δ)$ while achieving higher test accuracy than decentralized DP baselines and reducing forget accuracy to random guessing ($\approx 10%$).

[352] Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models III: Implementing the Bacterial Biothreat Benchmark (B3) Dataset

Gary Ackerman, Theodore Wilson, Zachary Kallenborn, Olivia Shoemaker, Anna Wetzel, Hayley Peterson, Abigail Danfora, Jenna LaTourette, Brandon Behlendorf, Douglas Clifford

Main category: cs.LG

TL;DR: Researchers pilot-tested the Bacterial Biothreat Benchmark (B3) dataset to assess biosecurity risks from frontier AI models, finding it provides viable, nuanced evaluation of LLM risks and identifies priority mitigation areas.

DetailsMotivation: There's significant concern about frontier AI models (especially LLMs) potentially facilitating bioterrorism or biological weapons access. Both developers and policymakers need ways to quantify and mitigate these risks, requiring benchmarks to assess specific model vulnerabilities.

Method: Pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset through a sample frontier AI model, followed by human evaluation of model responses and applied risk analysis across multiple dimensions.

Result: The pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing biosecurity risks posed by LLMs, identifying key risk sources and providing guidance for priority mitigation areas.

Conclusion: The B3 benchmark framework provides an effective tool for evaluating and mitigating biosecurity risks from frontier AI models, helping developers and policymakers address critical safety concerns in AI deployment.

Abstract: The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper discusses the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset. It is the third in a series of three papers describing an overall Biothreat Benchmark Generation (BBG) framework, with previous papers detailing the development of the B3 dataset. The pilot involved running the benchmarks through a sample frontier AI model, followed by human evaluation of model responses, and an applied risk analysis of the results along several dimensions. Overall, the pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing the biosecurity risk posed by a LLM, identifying the key sources of that risk and providing guidance for priority areas of mitigation priority.

[353] Transformers for Multimodal Brain State Decoding: Integrating Functional Magnetic Resonance Imaging Data and Medical Metadata

Danial Jafarzadeh Jazi, Maryam Hajiesmaeili

Main category: cs.LG

TL;DR: Novel transformer-based framework integrates fMRI data with DICOM metadata using attention mechanisms to improve brain state decoding accuracy, interpretability, and robustness.

DetailsMotivation: Traditional machine learning and deep learning approaches for fMRI brain state decoding often fail to utilize the rich contextual information available in DICOM metadata, limiting their effectiveness despite the high-dimensional and complex nature of fMRI data.

Method: Transformer-based architecture with multimodal inputs combining fMRI data and DICOM metadata, employing attention mechanisms to capture intricate spatial-temporal patterns and contextual relationships.

Result: Enhanced model accuracy, interpretability, and robustness for brain state decoding, with potential applications in clinical diagnostics, cognitive neuroscience, and personalized medicine.

Conclusion: The framework shows promise but faces limitations like metadata variability and computational demands; future work should focus on optimizing scalability and generalizability.

Abstract: Decoding brain states from functional magnetic resonance imaging (fMRI) data is vital for advancing neuroscience and clinical applications. While traditional machine learning and deep learning approaches have made strides in leveraging the high-dimensional and complex nature of fMRI data, they often fail to utilize the contextual richness provided by Digital Imaging and Communications in Medicine (DICOM) metadata. This paper presents a novel framework integrating transformer-based architectures with multimodal inputs, including fMRI data and DICOM metadata. By employing attention mechanisms, the proposed method captures intricate spatial-temporal patterns and contextual relationships, enhancing model accuracy, interpretability, and robustness. The potential of this framework spans applications in clinical diagnostics, cognitive neuroscience, and personalized medicine. Limitations, such as metadata variability and computational demands, are addressed, and future directions for optimizing scalability and generalizability are discussed.

[354] Solving Over-Smoothing in GNNs via Nonlocal Message Passing: Algebraic Smoothing and Depth Scalability

Weiqi Guan, Junlin He

Main category: cs.LG

TL;DR: Proposes a Post-LN based method that induces algebraic smoothing to resolve the dilemma between avoiding over-smoothing (Pre-LN) and avoiding curse of depth (Post-LN) in deep GNNs.

DetailsMotivation: The relationship between Layer Normalization placement and over-smoothing in GNNs is underexplored. There's a critical dilemma: Pre-LN avoids over-smoothing but suffers from curse of depth, while Post-LN avoids curse of depth but experiences over-smoothing.

Method: Proposes a new method based on Post-LN that induces algebraic smoothing to prevent over-smoothing without suffering from the curse of depth. The approach is parameter-efficient and requires no additional parameters.

Result: Empirical results across five benchmarks show the approach supports deeper networks (up to 256 layers) and improves performance. Extensive experiments demonstrate effectiveness in deeper GNNs.

Conclusion: The proposed method resolves the LN placement dilemma by inducing algebraic smoothing in Post-LN architectures, enabling deeper GNNs without over-smoothing or curse of depth issues.

Abstract: The relationship between Layer Normalization (LN) placement and the over-smoothing phenomenon remains underexplored. We identify a critical dilemma: Pre-LN architectures avoid over-smoothing but suffer from the curse of depth, while Post-LN architectures bypass the curse of depth but experience over-smoothing. To resolve this, we propose a new method based on Post-LN that induces algebraic smoothing, preventing over-smoothing without the curse of depth. Empirical results across five benchmarks demonstrate that our approach supports deeper networks (up to 256 layers) and improves performance, requiring no additional parameters. Key contributions: Theoretical Characterization: Analysis of LN dynamics and their impact on over-smoothing and the curse of depth. A Principled Solution: A parameter-efficient method that induces algebraic smoothing and avoids over-smoothing and the curse of depth. Empirical Validation: Extensive experiments showing the effectiveness of the method in deeper GNNs.

[355] Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen

Main category: cs.LG

TL;DR: FDA (Function-word De-Attention) improves VLM robustness against cross-modal attacks by reducing vulnerability from function words, achieving significant attack success rate reductions with minimal performance impact.

DetailsMotivation: There's a trade-off between robustness and performance in robust VLMs. Function words are identified as a vulnerability source in VLMs against cross-modal adversarial attacks, creating a need to mitigate their impact.

Method: Proposes Function-word De-Attention (FDA) that calculates both original and function-word cross-attention within attention heads, then differentially subtracts the function-word attention from the original attention (similar to differential amplifiers) to create more aligned and robust VLMs.

Result: FDA yields average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on 3 models for retrieval, and 90% ASR drop with 0.3% performance gain on visual grounding. Shows scalability, generalization, and zero-shot performance.

Conclusion: FDA effectively addresses the robustness-performance trade-off in VLMs by mitigating function-word vulnerabilities, demonstrating strong empirical results across multiple models, tasks, and attacks with minimal performance degradation.

Abstract: To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.

[356] Developing Distance-Aware Uncertainty Quantification Methods in Physics-Guided Neural Networks for Reliable Bearing Health Prediction

Waleed Razzaq, Yun-Bo Zhao

Main category: cs.LG

TL;DR: Two distance-aware uncertainty methods (PG-SNGP and PG-SNER) for physics-guided neural networks improve degradation prediction accuracy and uncertainty calibration for bearing health monitoring, outperforming traditional uncertainty methods under OOD conditions.

DetailsMotivation: Existing uncertainty methods for predictive maintenance lack confidence calibration, are computationally expensive, not distance-aware, and fail to generalize under out-of-distribution data, which is critical for safety-critical systems like rotating machinery.

Method: Two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP (Spectral Normalization Gaussian Process) replaces final dense layer with Gaussian Process layer, and PG-SNER (Deep Evidential Regression) outputs Normal Inverse Gamma parameters. Both use spectral normalization in hidden layers to preserve input-to-latent space distances, plus a dynamic weighting scheme to balance data fidelity and physical consistency.

Result: PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise when tested on rolling-element bearing degradation using PRONOSTIA dataset, outperforming Monte Carlo and Deep Ensemble PGNNs.

Conclusion: The proposed distance-aware uncertainty methods provide accurate, calibrated uncertainty estimation for predictive maintenance in safety-critical systems, addressing key limitations of existing approaches through distance preservation and coherent probabilistic modeling.

Abstract: Accurate and uncertainty-aware degradation estimation is essential for predictive maintenance in safety-critical systems like rotating machinery with rolling-element bearings. Many existing uncertainty methods lack confidence calibration, are costly to run, are not distance-aware, and fail to generalize under out-of-distribution data. We introduce two distance-aware uncertainty methods for deterministic physics-guided neural networks: PG-SNGP, based on Spectral Normalization Gaussian Process, and PG-SNER, based on Deep Evidential Regression. We apply spectral normalization to the hidden layers so the network preserves distances from input to latent space. PG-SNGP replaces the final dense layer with a Gaussian Process layer for distance-sensitive uncertainty, while PG-SNER outputs Normal Inverse Gamma parameters to model uncertainty in a coherent probabilistic form. We assess performance using standard accuracy metrics and a new distance-aware metric based on the Pearson Correlation Coefficient, which measures how well predicted uncertainty tracks the distance between test and training samples. We also design a dynamic weighting scheme in the loss to balance data fidelity and physical consistency. We test our methods on rolling-element bearing degradation using the PRONOSTIA dataset and compare them with Monte Carlo and Deep Ensemble PGNNs. Results show that PG-SNGP and PG-SNER improve prediction accuracy, generalize reliably under OOD conditions, and remain robust to adversarial attacks and noise.

[357] Optimal Perturbation Budget Allocation for Data Poisoning in Offline Reinforcement Learning

Junnan Qiu, Jie Li

Main category: cs.LG

TL;DR: Proposes a novel Global Budget Allocation attack strategy for offline RL that allocates perturbation budget proportionally to TD-error sensitivity, achieving high performance degradation while evading detection.

DetailsMotivation: Existing offline RL data poisoning attacks use locally uniform perturbations that waste budget on low-impact samples and lack stealthiness due to statistical deviations.

Method: Formulates attack as global resource allocation problem using TD-error sensitivity theory. Derives closed-form solution where perturbation magnitudes are proportional to TD-error sensitivity under global L2 constraint.

Result: Achieves up to 80% performance degradation on D4RL benchmarks with minimal perturbations that evade detection by state-of-the-art statistical and spectral defenses.

Conclusion: The Global Budget Allocation strategy is more efficient and stealthy than baseline approaches, demonstrating the vulnerability of offline RL to targeted data poisoning attacks.

Abstract: Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to data poisoning attacks. Existing attack strategies typically rely on locally uniform perturbations, which treat all samples indiscriminately. This approach is inefficient, as it wastes the perturbation budget on low-impact samples, and lacks stealthiness due to significant statistical deviations. In this paper, we propose a novel Global Budget Allocation attack strategy. Leveraging the theoretical insight that a sample’s influence on value function convergence is proportional to its Temporal Difference (TD) error, we formulate the attack as a global resource allocation problem. We derive a closed-form solution where perturbation magnitudes are assigned proportional to the TD-error sensitivity under a global L2 constraint. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms baseline strategies, achieving up to 80% performance degradation with minimal perturbations that evade detection by state-of-the-art statistical and spectral defenses.

[358] A Hybrid Model for Stock Market Forecasting: Integrating News Sentiment and Time Series Data with Graph Neural Networks

Nader Sadek, Mirette Moawad, Christina Naguib, Mariam Elzahaby

Main category: cs.LG

TL;DR: Multimodal GNN combining news articles with historical stock data outperforms LSTM baseline for stock market prediction, with headlines showing stronger predictive signals than full articles.

DetailsMotivation: Stock market prediction is challenging but crucial for investment decisions. While traditional models use historical prices, financial news provides valuable external signals that could improve prediction accuracy.

Method: Proposes a multimodal Graph Neural Network (GNN) approach that integrates company news articles with historical stock data. Historical data is encoded using LSTM, news titles are embedded with a language model. These embeddings form nodes in a heterogeneous graph, and GraphSAGE captures interactions between articles, companies, and industries. Compared against LSTM baseline.

Result: GNN outperforms LSTM baseline, achieving 53% accuracy on binary direction-of-change prediction and 4% precision gain on significance-based labeling. Companies with more associated news yield higher prediction accuracy. Headlines contain stronger predictive signals than full articles.

Conclusion: Multimodal integration of news and historical data improves stock prediction. Graph-based approaches effectively capture complex relationships between news, companies, and industries. Concise news summaries (headlines) play important role in short-term market reactions.

Abstract: Stock market prediction is a long-standing challenge in finance, as accurate forecasts support informed investment decisions. Traditional models rely mainly on historical prices, but recent work shows that financial news can provide useful external signals. This paper investigates a multimodal approach that integrates companies’ news articles with their historical stock data to improve prediction performance. We compare a Graph Neural Network (GNN) model with a baseline LSTM model. Historical data for each company is encoded using an LSTM, while news titles are embedded with a language model. These embeddings form nodes in a heterogeneous graph, and GraphSAGE is used to capture interactions between articles, companies, and industries. We evaluate two targets: a binary direction-of-change label and a significance-based label. Experiments on the US equities and Bloomberg datasets show that the GNN outperforms the LSTM baseline, achieving 53% accuracy on the first target and a 4% precision gain on the second. Results also indicate that companies with more associated news yield higher prediction accuracy. Moreover, headlines contain stronger predictive signals than full articles, suggesting that concise news summaries play an important role in short-term market reactions.

[359] Long-Sequence LSTM Modeling for NBA Game Outcome Prediction Using a Novel Multi-Season Dataset

Charles Rios, Longzhen Han, Almas Baimagambetov, Nikolaos Polatidis

Main category: cs.LG

TL;DR: LSTM model with 8-season sequence length achieves 72.35% accuracy for NBA game prediction, outperforming traditional ML/DL baselines using new longitudinal dataset.

DetailsMotivation: Existing NBA prediction models struggle with concept drift, limited temporal context, and instability across seasons, creating need for better forecasting methods for coaching strategy, fan engagement, and sports betting.

Method: Constructed new longitudinal NBA dataset (2004-05 to 2024-25 seasons) and developed LSTM architecture with extended sequence length of 9,840 games (8 full seasons) to capture long-term performance trends and season dependencies.

Result: LSTM achieved best performance: 72.35% accuracy, 73.15% precision, 76.13% AUC-ROC, outperforming Logistic Regression, Random Forest, MLP, and CNN baselines.

Conclusion: Long-sequence temporal modeling is crucial for basketball outcome prediction, and the new multi-season dataset enables development of robust, generalizable NBA forecasting systems.

Abstract: Predicting the outcomes of professional basketball games, particularly in the National Basketball Association (NBA), has become increasingly important for coaching strategy, fan engagement, and sports betting. However, many existing prediction models struggle with concept drift, limited temporal context, and instability across seasons. To advance forecasting in this domain, we introduce a newly constructed longitudinal NBA dataset covering the 2004-05 to 2024-25 seasons and present a deep learning framework designed to model long-term performance trends. Our primary contribution is a Long Short-Term Memory (LSTM) architecture that leverages an extended sequence length of 9,840 games equivalent to eight full NBA seasons to capture evolving team dynamics and season-over-season dependencies. We compare this model against several traditional Machine Learning (ML) and Deep Learning (DL) baselines, including Logistic Regression, Random Forest, Multi-Layer Perceptron (MLP), and Convolutional Neural Network (CNN). The LSTM achieves the best performance across all metrics, with 72.35 accuracy, 73.15 precision and 76.13 AUC-ROC. These results demonstrate the importance of long-sequence temporal modeling in basketball outcome prediction and highlight the value of our new multi-season dataset for developing robust, generalizable NBA forecasting systems.

[360] DS FedProxGrad: Asymptotic Stationarity Without Noise Floor in Fair Federated Learning

Huzaifa Arif

Main category: cs.LG

TL;DR: Improved asymptotic convergence analysis for FedProxGrad in group fair federated learning, showing convergence to exact stationarity without variance-induced noise floor.

DetailsMotivation: Previous FedProxGrad analysis only showed convergence to a noise-dominated neighborhood with explicit dependence on variance-induced noise floor. This work aims to provide improved asymptotic convergence guarantees.

Method: Introduces DS FedProxGrad (Decay Step Size FedProxGrad) - a generalized analytical framework with inexact local proximal solutions, explicit fairness regularization, and Robbins-Monro step-size schedule with mild decay condition on local inexactness.

Result: Proves that liminf_{r→∞} 𝔼[‖∇F(x^r)‖²] = 0, meaning the algorithm is asymptotically stationary and convergence rate does not depend on variance-induced noise floor.

Conclusion: The improved analysis shows FedProxGrad can achieve exact asymptotic stationarity under appropriate step-size scheduling and local inexactness control, overcoming previous limitations of noise-dominated convergence.

Abstract: Recent work \cite{arifgroup} introduced Federated Proximal Gradient \textbf{(\texttt{FedProxGrad})} for solving non-convex composite optimization problems in group fair federated learning. However, the original analysis established convergence only to a \textit{noise-dominated neighborhood of stationarity}, with explicit dependence on a variance-induced noise floor. In this work, we provide an improved asymptotic convergence analysis for a generalized \texttt{FedProxGrad}-type analytical framework with inexact local proximal solutions and explicit fairness regularization. We call this extended analytical framework \textbf{DS \texttt{FedProxGrad}} (Decay Step Size \texttt{FedProxGrad}). Under a Robbins-Monro step-size schedule \cite{robbins1951stochastic} and a mild decay condition on local inexactness, we prove that $\liminf_{r\to\infty} \mathbb{E}[|\nabla F(\mathbf{x}^r)|^2] = 0$, i.e., the algorithm is asymptotically stationary and the convergence rate does not depend on a variance-induced noise floor.

[361] An Additive Manufacturing Part Qualification Framework: Transferring Knowledge of Stress-strain Behaviors from Additively Manufactured Polymers to Metals

Chenglong Duan, Dazhong Wu

Main category: cs.LG

TL;DR: A DTW-transfer learning framework for AM part qualification that transfers stress-strain behavior knowledge from polymers to metals, achieving better prediction accuracy than baseline models.

DetailsMotivation: Part qualification is critical in additive manufacturing to ensure reliability in critical applications. Predicting complex stress-strain behaviors of AM parts is essential for qualification, but building separate models for different materials is inefficient.

Method: Developed a DTW-transfer learning framework that uses dynamic time warping to select the most relevant polymer dataset as source domain, then transfers knowledge to predict metal stress-strain behaviors using LSTM models. Tested with four polymers (Nylon, PLA, CF-ABS, Resin) and three metals (AlSi10Mg, Ti6Al4V, carbon steel) fabricated by different AM techniques.

Result: DTW-TL framework successfully identified closest polymer-metal matches and achieved lowest mean absolute percentage error (12.41%) and highest coefficient of determination (0.96) when three metals were used as target domain, outperforming vanilla LSTM and TL models pre-trained on all four polymer datasets.

Conclusion: The DTW-TL framework effectively enables knowledge transfer from low-cost polymers to metals for AM part qualification, improving prediction accuracy of stress-strain behaviors and demonstrating practical value for manufacturing applications.

Abstract: Part qualification is crucial in additive manufacturing (AM) because it ensures that additively manufactured parts can be consistently produced and reliably used in critical applications. Part qualification aims at verifying that an additively manufactured part meets performance requirements; therefore, predicting the complex stress-strain behaviors of additively manufactured parts is critical. We develop a dynamic time warping (DTW)-transfer learning (TL) framework for additive manufacturing part qualification by transferring knowledge of the stress-strain behaviors of additively manufactured low-cost polymers to metals. Specifically, the framework employs DTW to select a polymer dataset as the source domain that is the most relevant to the target metal dataset. Using a long short-term memory (LSTM) model, four source polymers (i.e., Nylon, PLA, CF-ABS, and Resin) and three target metals (i.e., AlSi10Mg, Ti6Al4V, and carbon steel) that are fabricated by different AM techniques are utilized to demonstrate the effectiveness of the DTW-TL framework. Experimental results show that the DTW-TL framework identifies the closest match between polymers and metals to select one single polymer dataset as the source domain. The DTW-TL model achieves the lowest mean absolute percentage error of 12.41% and highest coefficient of determination of 0.96 when three metals are used as the target domain, respectively, outperforming the vanilla LSTM model without TL as well as the TL model pre-trained on four polymer datasets as the source domain.

[362] Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

Manos Plitsis, Giorgos Bouritsas, Vassilis Katsouros, Yannis Panagakis

Main category: cs.LG

TL;DR: BGPS automatically generates prompts to uncover social biases in text-to-image models by using LLMs guided by attribute classifiers, revealing subtle undocumented biases even in debiased models.

DetailsMotivation: Existing bias mitigation approaches rely on curated prompt datasets which are costly and risk missing unanticipated prompts that trigger biased generation, even in debiased models.

Method: BGPS uses an LLM to produce attribute-neutral prompts, guided by attribute classifiers that steer the decoding process toward prompts that amplify image attributes of interest, maximizing bias presence.

Result: Discovered subtle and previously undocumented biases that severely deteriorate fairness metrics in Stable Diffusion 1.5 and state-of-the-art debiased models, with interpretable prompts that improve perplexity compared to hard prompt optimization.

Conclusion: BGPS uncovers TTI vulnerabilities, expands bias search space, and can serve as a new evaluation tool for bias mitigation by generating interpretable prompts that reveal hidden biases.

Abstract: Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI’s internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.

[363] Neural Ordinary Differential Equations for Simulating Metabolic Pathway Dynamics from Time-Series Multiomics Data

Udesh Habaraduwa, Andrei Lixandru

Main category: cs.LG

TL;DR: NODEs outperform traditional ML in modeling proteome-metabolome dynamics, achieving >90% RMSE improvement and 1000x faster inference for metabolic engineering applications.

DetailsMotivation: There's a bottleneck in converting abundant multiomics data into actionable predictive models for human healthspan and bioengineering. While high-throughput data is available, current approaches struggle to create dynamic simulation systems that can infer latent interactions directly from observational data for personalized medicine and synthetic biology applications.

Method: The authors introduce Neural Ordinary Differential Equations (NODEs) as a dynamic framework for learning complex proteome-metabolome interactions. They apply this framework to time-series data from engineered E. coli strains to model continuous dynamics of metabolic pathways, comparing performance against traditional machine learning pipelines.

Result: NODE architecture shows superior performance with >90% improvement in root mean squared error over baselines: 94.38% improvement for Limonene pathway and 97.65% improvement for Isopentenol pathway datasets. Additionally, NODE models achieve 1000x acceleration in inference time.

Conclusion: NODEs establish themselves as scalable, high-fidelity tools for next-generation metabolic engineering and biological discovery, capable of capturing complex biological system dynamics more accurately and efficiently than traditional approaches.

Abstract: The advancement of human healthspan and bioengineering relies heavily on predicting the behavior of complex biological systems. While high-throughput multiomics data is becoming increasingly abundant, converting this data into actionable predictive models remains a bottleneck. High-capacity, datadriven simulation systems are critical in this landscape; unlike classical mechanistic models restricted by prior knowledge, these architectures can infer latent interactions directly from observational data, allowing for the simulation of temporal trajectories and the anticipation of downstream intervention effects in personalized medicine and synthetic biology. To address this challenge, we introduce Neural Ordinary Differential Equations (NODEs) as a dynamic framework for learning the complex interplay between the proteome and metabolome. We applied this framework to time-series data derived from engineered Escherichia coli strains, modeling the continuous dynamics of metabolic pathways. The proposed NODE architecture demonstrates superior performance in capturing system dynamics compared to traditional machine learning pipelines. Our results show a greater than 90% improvement in root mean squared error over baselines across both Limonene (up to 94.38% improvement) and Isopentenol (up to 97.65% improvement) pathway datasets. Furthermore, the NODE models demonstrated a 1000x acceleration in inference time, establishing them as a scalable, high-fidelity tool for the next generation of metabolic engineering and biological discovery.

[364] Learning and Editing Universal Graph Prompt Tuning via Reinforcement Learning

Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang, Yijie Li, Edith C. H. Ngai

Main category: cs.LG

TL;DR: LEAP introduces a universal graph prompt tuning method that maintains theoretical universality by adding prompts to all nodes, using actor-critic RL to selectively edit prompts for better performance.

DetailsMotivation: Existing universal graph prompt tuning methods compromise theoretical foundations when using selective node-based approaches. The paper aims to strengthen the theoretical foundation while pursuing more ideal prompts.

Method: LEAP (Learning and Editing Universal Graph Prompt Tuning) first builds basic universal graph prompts for all nodes to preserve theoretical foundation, then uses actor-critic reinforcement learning to select nodes and edit prompts.

Result: Extensive experiments on graph- and node-level tasks across various pre-training strategies in both full-shot and few-shot scenarios show LEAP consistently outperforms fine-tuning and other prompt-based approaches.

Conclusion: LEAP successfully preserves the theoretical foundation of universal graph prompt tuning while achieving superior performance through selective prompt editing, demonstrating that adding prompts to all nodes is necessary for universality.

Abstract: Early graph prompt tuning approaches relied on task-specific designs for Graph Neural Networks (GNNs), limiting their adaptability across diverse pre-training strategies. In contrast, another promising line of research has investigated universal graph prompt tuning, which operates directly in the input graph’s feature space and builds a theoretical foundation that universal graph prompt tuning can theoretically achieve an equivalent effect of any prompting function, eliminating dependence on specific pre-training strategies. Recent works propose selective node-based graph prompt tuning to pursue more ideal prompts. However, we argue that selective node-based graph prompt tuning inevitably compromises the theoretical foundation of universal graph prompt tuning. In this paper, we strengthen the theoretical foundation of universal graph prompt tuning by introducing stricter constraints, demonstrating that adding prompts to all nodes is a necessary condition for achieving the universality of graph prompts. To this end, we propose a novel model and paradigm, Learning and Editing Universal GrAph Prompt Tuning (LEAP), which preserves the theoretical foundation of universal graph prompt tuning while pursuing more ideal prompts. Specifically, we first build the basic universal graph prompts to preserve the theoretical foundation and then employ actor-critic reinforcement learning to select nodes and edit prompts. Extensive experiments on graph- and node-level tasks across various pre-training strategies in both full-shot and few-shot scenarios show that LEAP consistently outperforms fine-tuning and other prompt-based approaches.

[365] De novo generation of functional terpene synthases using TpsGPT

Hamsini Ramanathan, Roman Bushuiev, Matouš Soldát, Jirí Kohout, Téo Hebra, Joshua David Smith, Josef Sivic, Tomáš Pluskal

Main category: cs.LG

TL;DR: TpsGPT is a generative AI model that designs functional terpene synthase enzymes by fine-tuning ProtGPT2 on TPS sequences, generating novel enzymes that were experimentally validated.

DetailsMotivation: Terpene synthases (TPS) are crucial for producing natural products like anticancer drugs, but traditional directed evolution methods for TPS design are expensive and slow. There's a need for scalable computational approaches to generate novel functional enzymes.

Method: Fine-tuned the protein language model ProtGPT2 on 79k TPS sequences from UniProt to create TpsGPT. Generated 28k candidate sequences and filtered them using multiple validation metrics: EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment.

Result: From 28k generated sequences, identified 7 putative TPS enzymes meeting all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least 2 of these sequences, demonstrating functional enzyme generation.

Conclusion: Fine-tuning protein language models on enzyme-class-specific datasets with rigorous filtering enables de novo generation of functional, evolutionarily distant enzymes, providing a scalable alternative to traditional directed evolution methods.

Abstract: Terpene synthases (TPS) are a key family of enzymes responsible for generating the diverse terpene scaffolds that underpin many natural products, including front-line anticancer drugs such as Taxol. However, de novo TPS design through directed evolution is costly and slow. We introduce TpsGPT, a generative model for scalable TPS protein design, built by fine-tuning the protein language model ProtGPT2 on 79k TPS sequences mined from UniProt. TpsGPT generated de novo enzyme candidates in silico and we evaluated them using multiple validation metrics, including EnzymeExplorer classification, ESMFold structural confidence (pLDDT), sequence diversity, CLEAN classification, InterPro domain detection, and Foldseek structure alignment. From an initial pool of 28k generated sequences, we identified seven putative TPS enzymes that satisfied all validation criteria. Experimental validation confirmed TPS enzymatic activity in at least two of these sequences. Our results show that fine-tuning of a protein language model on a carefully curated, enzyme-class-specific dataset, combined with rigorous filtering, can enable the de novo generation of functional, evolutionarily distant enzymes.

[366] Can TabPFN Compete with GNNs for Node Classification via Graph Tabularization?

Jeongwhan Choi, Woosung Kang, Minseo Kim, Jongwoo Kim, Noseong Park

Main category: cs.LG

TL;DR: TabPFN-GN transforms graph node classification into tabular learning by extracting node features, enabling competitive performance with GNNs without graph-specific training.

DetailsMotivation: To investigate whether graph node classification can be effectively reformulated as a tabular learning problem, building on the success of TabPFN for tabular data and its extension to time series.

Method: Transform graph data into tabular features by extracting node attributes, structural properties, positional encodings, and optionally smoothed neighborhood features, then apply TabPFN for direct node classification without graph-specific training.

Result: TabPFN-GN achieves competitive performance with GNNs on homophilous graphs and consistently outperforms them on heterophilous graphs across 12 benchmark datasets.

Conclusion: Principled feature engineering can bridge the gap between tabular and graph domains, providing a practical alternative to task-specific GNN training and LLM-dependent graph foundation models.

Abstract: Foundation models pretrained on large data have demonstrated remarkable zero-shot generalization capabilities across domains. Building on the success of TabPFN for tabular data and its recent extension to time series, we investigate whether graph node classification can be effectively reformulated as a tabular learning problem. We introduce TabPFN-GN, which transforms graph data into tabular features by extracting node attributes, structural properties, positional encodings, and optionally smoothed neighborhood features. This enables TabPFN to perform direct node classification without any graph-specific training or language model dependencies. Our experiments on 12 benchmark datasets reveal that TabPFN-GN achieves competitive performance with GNNs on homophilous graphs and consistently outperforms them on heterophilous graphs. These results demonstrate that principled feature engineering can bridge the gap between tabular and graph domains, providing a practical alternative to task-specific GNN training and LLM-dependent graph foundation models.

[367] Identifying counterfactual probabilities using bivariate distributions and uplift modeling

Théo Verhelst, Gianluca Bontempi

Main category: cs.LG

TL;DR: Proposes a method to estimate joint counterfactual distributions using uplift models, fitting bivariate beta distributions to uplift scores for richer causal insights.

DetailsMotivation: While uplift modeling estimates causal effects (treatment-control differences), it doesn't provide the full joint distribution of potential outcomes needed for richer counterfactual questions like "Would this customer still have churned had we given them a marketing offer?"

Method: Proposes a counterfactual estimator that fits a bivariate beta distribution to predicted uplift scores, yielding posterior distributions over counterfactual outcomes. The approach leverages uplift models and requires no additional causal assumptions beyond those already made for uplift modeling.

Result: Simulations demonstrate the efficacy of the approach. Applied to telecom customer churn, it reveals insights unavailable to standard machine learning or uplift models alone.

Conclusion: Uplift modeling and counterfactual identification are synergistic - uplift models can be leveraged for counterfactual estimation, providing richer information about potential outcomes while maintaining the same causal assumptions.

Abstract: Uplift modeling estimates the causal effect of an intervention as the difference between potential outcomes under treatment and control, whereas counterfactual identification aims to recover the joint distribution of these potential outcomes (e.g., “Would this customer still have churned had we given them a marketing offer?”). This joint counterfactual distribution provides richer information than the uplift but is harder to estimate. However, the two approaches are synergistic: uplift models can be leveraged for counterfactual estimation. We propose a counterfactual estimator that fits a bivariate beta distribution to predicted uplift scores, yielding posterior distributions over counterfactual outcomes. Our approach requires no causal assumptions beyond those of uplift modeling. Simulations show the efficacy of the approach, which can be applied, for example, to the problem of customer churn in telecom, where it reveals insights unavailable to standard ML or uplift models alone.

[368] Forecasting Fails: Unveiling Evasion Attacks in Weather Prediction Models

Huzaifa Arif, Pin-Yu Chen, Alex Gittens, James Diffenderfer, Bhavya Kailkhura

Main category: cs.LG

TL;DR: WAAPO framework generates targeted adversarial perturbations for AI weather forecasting models that are effective yet stealthy, revealing critical vulnerabilities in operational systems.

DetailsMotivation: As AI models become increasingly relied upon for weather forecasting, there's a critical need to evaluate their vulnerability to adversarial attacks that could manipulate forecasts while avoiding detection.

Method: Weather Adaptive Adversarial Perturbation Optimization (WAAPO) framework incorporates constraints for channel sparsity, spatial localization, and smoothness to generate physically realistic and imperceptible perturbations. Tested using ERA5 dataset and FourCastNet model.

Result: WAAPO successfully generates adversarial trajectories that closely align with predefined targets even under constrained conditions, demonstrating that small perturbations to initial conditions can cause significant deviations in predicted weather patterns.

Conclusion: The findings reveal critical vulnerabilities in AI-driven forecasting models and underscore the urgent need for robust safeguards to protect operational forecasting systems against adversarial exploitation.

Abstract: With the increasing reliance on AI models for weather forecasting, it is imperative to evaluate their vulnerability to adversarial perturbations. This work introduces Weather Adaptive Adversarial Perturbation Optimization (WAAPO), a novel framework for generating targeted adversarial perturbations that are both effective in manipulating forecasts and stealthy to avoid detection. WAAPO achieves this by incorporating constraints for channel sparsity, spatial localization, and smoothness, ensuring that perturbations remain physically realistic and imperceptible. Using the ERA5 dataset and FourCastNet (Pathak et al. 2022), we demonstrate WAAPO’s ability to generate adversarial trajectories that align closely with predefined targets, even under constrained conditions. Our experiments highlight critical vulnerabilities in AI-driven forecasting models, where small perturbations to initial conditions can result in significant deviations in predicted weather patterns. These findings underscore the need for robust safeguards to protect against adversarial exploitation in operational forecasting systems.

[369] Reinforcement Learning From State and Temporal Differences

Lex Weaver, Jonathan Baxter

Main category: cs.LG

TL;DR: STD(λ) is a modified TD(λ) algorithm that focuses on relative state value ordering rather than absolute value accuracy, addressing TD(λ)’s tendency to converge to suboptimal policies despite accurate value approximation.

DetailsMotivation: TD(λ) with function approximation minimizes squared error between approximate and true state values, but for policy optimization, the relative ordering of states is more critical than absolute values. The authors show TD(λ) can converge to suboptimal policies even when starting from optimal policies, demonstrating this in simple systems and backgammon.

Method: The authors propose STD(λ), a modified form of TD(λ) where function approximators are trained with respect to relative state values on binary decision problems. The method includes theoretical analysis with proof of monotonic policy improvement for the two-state system, comparison with Bertsekas’ differential training method, and empirical demonstrations.

Result: STD(λ) successfully addresses the limitations of TD(λ), with theoretical guarantees of monotonic policy improvement in the two-state system and successful empirical demonstrations on both the two-state system and a variation of the acrobot problem.

Conclusion: Focusing on relative state value ordering rather than absolute value accuracy is crucial for policy optimization. STD(λ) provides a principled approach to address TD(λ)’s limitations and achieves better policy performance while maintaining theoretical guarantees.

Abstract: TD($λ$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($λ$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($λ$)–starting from an optimal policy–converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($λ$), called STD($λ$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($λ$) in the context of the two-state system, is presented, along with a comparison with Bertsekas’ differential training method [1]. This is followed by successful demonstrations of STD($λ$) on the two-state system and a variation on the well known acrobot problem.

[370] Refining Diffusion Models for Motion Synthesis with an Acceleration Loss to Generate Realistic IMU Data

Lars Ole Häusler, Lena Uhlenberg, Göran Köber, Diyora Salimova, Oliver Amft

Main category: cs.LG

TL;DR: Text-to-IMU motion synthesis framework using diffusion model fine-tuning with acceleration-based loss improves IMU data realism and downstream HAR performance.

DetailsMotivation: To generate realistic IMU data from text descriptions by aligning diffusion-based motion generation with IMU-specific acceleration patterns, enabling better synthetic data for sensor-specific applications like Human Activity Recognition.

Method: Fine-tune pretrained diffusion model with acceleration-based second-order loss (L_acc) that enforces consistency in discrete second-order temporal differences, integrate into existing text-to-IMU framework with surface modeling and virtual sensor simulation.

Result: L_acc decreased by 12.7% relative to original model, with larger improvements in high-dynamic activities; synthetic IMU data distribution shifts closer to real recordings; HAR classification improved by 8.7% over earlier diffusion model and 7.6% over best comparison model.

Conclusion: Acceleration-aware diffusion refinement effectively aligns motion generation with IMU synthesis, demonstrating flexible deep learning pipelines for specializing generic text-to-motion priors to sensor-specific tasks.

Abstract: We propose a text-to-IMU (inertial measurement unit) motion-synthesis framework to obtain realistic IMU data by fine-tuning a pretrained diffusion model with an acceleration-based second-order loss (L_acc). L_acc enforces consistency in the discrete second-order temporal differences of the generated motion, thereby aligning the diffusion prior with IMU-specific acceleration patterns. We integrate L_acc into the training objective of an existing diffusion model, finetune the model to obtain an IMU-specific motion prior, and evaluate the model with an existing text-to-IMU framework that comprises surface modelling and virtual sensor simulation. We analysed acceleration signal fidelity and differences between synthetic motion representation and actual IMU recordings. As a downstream application, we evaluated Human Activity Recognition (HAR) and compared the classification performance using data of our method with the earlier diffusion model and two additional diffusion model baselines. When we augmented the earlier diffusion model objective with L_acc and continued training, L_acc decreased by 12.7% relative to the original model. The improvements were considerably larger in high-dynamic activities (i.e., running, jumping) compared to low-dynamic activities~(i.e., sitting, standing). In a low-dimensional embedding, the synthetic IMU data produced by our refined model shifts closer to the distribution of real IMU recordings. HAR classification trained exclusively on our refined synthetic IMU data improved performance by 8.7% compared to the earlier diffusion model and by 7.6% over the best-performing comparison diffusion model. We conclude that acceleration-aware diffusion refinement provides an effective approach to align motion generation and IMU synthesis and highlights how flexible deep learning pipelines are for specialising generic text-to-motion priors to sensor-specific tasks.

[371] Differentially Private Synthetic Data Generation Using Context-Aware GANs

Anantaa Kotal, Anupam Joshi

Main category: cs.LG

TL;DR: ContextGAN is a privacy-preserving synthetic data generation method that incorporates domain-specific implicit rules through constraint matrices, improving realism while maintaining differential privacy.

DetailsMotivation: Traditional synthetic data methods fail to capture complex implicit domain rules (like medical prescription guidelines), leading to unrealistic or inappropriate synthetic data, especially under strict privacy regulations like GDPR and HIPAA.

Method: Context-Aware Differentially Private Generative Adversarial Network (ContextGAN) that integrates domain-specific rules through constraint matrices, with a constraint-aware discriminator evaluating synthetic data against both explicit and implicit knowledge while maintaining differential privacy.

Result: ContextGAN produces high-quality synthetic data that respects domain rules and preserves privacy across healthcare, security, and finance domains, improving realism and utility compared to traditional methods.

Conclusion: ContextGAN successfully addresses the gap in traditional synthetic data generation by enforcing domain constraints while maintaining privacy, making it suitable for applications requiring compliance with both explicit patterns and implicit rules under strict privacy guarantees.

Abstract: The widespread use of big data across sectors has raised major privacy concerns, especially when sensitive information is shared or analyzed. Regulations such as GDPR and HIPAA impose strict controls on data handling, making it difficult to balance the need for insights with privacy requirements. Synthetic data offers a promising solution by creating artificial datasets that reflect real patterns without exposing sensitive information. However, traditional synthetic data methods often fail to capture complex, implicit rules that link different elements of the data and are essential in domains like healthcare. They may reproduce explicit patterns but overlook domain-specific constraints that are not directly stated yet crucial for realism and utility. For example, prescription guidelines that restrict certain medications for specific conditions or prevent harmful drug interactions may not appear explicitly in the original data. Synthetic data generated without these implicit rules can lead to medically inappropriate or unrealistic profiles. To address this gap, we propose ContextGAN, a Context-Aware Differentially Private Generative Adversarial Network that integrates domain-specific rules through a constraint matrix encoding both explicit and implicit knowledge. The constraint-aware discriminator evaluates synthetic data against these rules to ensure adherence to domain constraints, while differential privacy protects sensitive details from the original data. We validate ContextGAN across healthcare, security, and finance, showing that it produces high-quality synthetic data that respects domain rules and preserves privacy. Our results demonstrate that ContextGAN improves realism and utility by enforcing domain constraints, making it suitable for applications that require compliance with both explicit patterns and implicit rules under strict privacy guarantees.

[372] Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents

Xiang Chen, Yuling Shi, Qizhen Lan, Yuchao Qiu, Xiaodong Gu

Main category: cs.LG

TL;DR: Fed-SE: A federated self-evolution framework for LLM agents that enables privacy-preserving co-evolution across heterogeneous environments by addressing gradient conflicts through local trajectory filtering and global low-rank subspace aggregation.

DetailsMotivation: LLM agents need to evolve in complex interactive tasks, but privacy constraints prevent centralized optimization. Standard federated learning struggles with heterogeneous tasks and sparse trajectory-level rewards, causing gradient conflicts that destabilize global optimization.

Method: Fed-SE uses local evolution-global aggregation: 1) Locally, agents perform parameter-efficient fine-tuning on filtered high-return trajectories for stable gradient updates; 2) Globally, aggregates updates within a low-rank subspace that disentangles environment-specific dynamics to reduce negative transfer.

Result: Experiments across five heterogeneous environments show Fed-SE improves average task success rates by approximately 18% over federated baselines, demonstrating effective cross-environment knowledge transfer in privacy-constrained settings.

Conclusion: Fed-SE successfully bridges the gap between federated learning and open-ended agent self-evolution, enabling robust knowledge transfer across heterogeneous environments while preserving privacy, with significant performance improvements over existing federated approaches.

Abstract: LLM agents are widely deployed in complex interactive tasks, yet privacy constraints often preclude centralized optimization and co-evolution across dynamic environments. While Federated Learning (FL) has proven effective on static datasets, its extension to the open-ended self-evolution of agents remains underexplored. Directly applying standard FL is challenging: heterogeneous tasks and sparse, trajectory-level rewards introduce severe gradient conflicts, destabilizing the global optimization process. To bridge this gap, we propose Fed-SE, a Federated Self-Evolution framework for LLM agents. Fed-SE establishes a local evolution-global aggregation paradigm. Locally, agents employ parameter-efficient fine-tuning on filtered, high-return trajectories to achieve stable gradient updates. Globally, Fed-SE aggregates updates within a low-rank subspace that disentangles environment-specific dynamics, effectively reducing negative transfer across clients. Experiments across five heterogeneous environments demonstrate that Fed-SE improves average task success rates by approximately 18% over federated baselines, validating its effectiveness in robust cross-environment knowledge transfer in privacy-constrained deployments.

[373] When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

Joshua Ward, Bochao Gu, Chi-Hua Wang, Guang Cheng

Main category: cs.LG

TL;DR: LLM-based tabular data generation methods (fine-tuning and prompting) leak privacy by reproducing memorized numeric patterns from training data, exposed by a simple membership inference attack called LevAtt.

DetailsMotivation: To systematically analyze privacy risks in LLM-based tabular data generation, as both fine-tuning and prompting approaches may compromise privacy by reproducing memorized numeric patterns from training data.

Method: Introduces LevAtt, a No-box Membership Inference Attack that targets numeric digit sequences in synthetic data, and proposes two defense methods including a novel sampling strategy that strategically perturbs digits during generation.

Result: LevAtt exposes substantial privacy leakage across various models and datasets, sometimes achieving perfect membership classification. The proposed defense methods effectively defeat attacks with minimal loss of data fidelity and utility.

Conclusion: LLM-based synthetic data generation has unique privacy vulnerabilities requiring effective defenses; strategic digit perturbation during generation can mitigate these risks while preserving data quality.

Abstract: Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.

[374] DAO-GP Drift Aware Online Non-Linear Regression Gaussian-Process

Mohammad Abu-Shaira, Ajita Rattani, Weishi Shi

Main category: cs.LG

TL;DR: DAO-GP is a drift-aware online Gaussian Process model that dynamically adapts to evolving data distributions without fixed hyperparameters, using built-in drift detection and sparse memory management.

DetailsMotivation: Real-world datasets exhibit concept drift (evolving data distributions) that reduces model accuracy. Existing online GP models lack drift awareness, have fixed hyperparameters, suffer from data snooping, lack principled decay mechanisms, and are memory inefficient.

Method: Proposes DAO-GP: a fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model with built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on drift severity.

Result: DAO-GP shows robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. It demonstrates dynamic adaptation, efficient memory management, and evolving inducing points, achieving superior or competitive performance compared to state-of-the-art models.

Conclusion: DAO-GP establishes itself as a drift-resilient solution for online non-linear regression, addressing critical limitations of conventional online GP methods through its adaptive, hyperparameter-free design with built-in drift handling capabilities.

Abstract: Real-world datasets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model’s predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. Gaussian Process (GP) models offer powerful non-parametric regression capabilities with uncertainty quantification, making them ideal for modeling complex data relationships in an online setting. However, conventional online GP methods face several critical limitations, including a lack of drift-awareness, reliance on fixed hyperparameters, vulnerability to data snooping, absence of a principled decay mechanism, and memory inefficiencies. In response, we propose DAO-GP (Drift-Aware Online Gaussian Process), a novel, fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model. DAO-GP features a built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on the severity of drift. Extensive empirical evaluations confirm DAO-GP’s robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. Analyses demonstrate its dynamic adaptation, efficient in-memory and decay-based management, and evolving inducing points. Compared with state-of-the-art parametric and non-parametric models, DAO-GP consistently achieves superior or competitive performance, establishing it as a drift-resilient solution for online non-linear regression.

[375] Explainable Anomaly Detection for Industrial IoT Data Streams

Ana Rita Paupério, Diogo Risca, Afonso Lourenço, Goreti Marreiros, Ricardo Martins

Main category: cs.LG

TL;DR: A collaborative data stream mining framework combining unsupervised anomaly detection with human-in-the-loop learning for industrial maintenance, using online Isolation Forest with interpretability enhancements for real-time fault detection.

DetailsMotivation: Industrial maintenance generates continuous data streams requiring real-time adaptive decision-making, but most data stream mining methods assume fully supervised settings while ground-truth labels are often delayed or unavailable in practice.

Method: Collaborative DSM framework integrating unsupervised anomaly detection with interactive human-in-the-loop learning. Uses online Isolation Forest enhanced with interpretability tools: incremental Partial Dependence Plots and a feature importance score derived from deviations of Individual Conditional Expectation curves from a fading average.

Result: Real-time implementation described with initial results for fault detection in a Jacquard loom unit. Ongoing work targets continuous monitoring to predict and explain imminent bearing failures.

Conclusion: The framework enables users to dynamically reassess feature relevance and adjust anomaly thresholds, supporting maintenance decisions in industrial IoT environments with limited computational resources and delayed label availability.

Abstract: Industrial maintenance is being transformed by the Internet of Things and edge computing, generating continuous data streams that demand real-time, adaptive decision-making under limited computational resources. While data stream mining (DSM) addresses this challenge, most methods assume fully supervised settings, yet in practice, ground-truth labels are often delayed or unavailable. This paper presents a collaborative DSM framework that integrates unsupervised anomaly detection with interactive, human-in-the-loop learning to support maintenance decisions. We employ an online Isolation Forest and enhance interpretability using incremental Partial Dependence Plots and a feature importance score, derived from deviations of Individual Conditional Expectation curves from a fading average, enabling users to dynamically reassess feature relevance and adjust anomaly thresholds. We describe the real-time implementation and provide initial results for fault detection in a Jacquard loom unit. Ongoing work targets continuous monitoring to predict and explain imminent bearing failures.

[376] Unsupervised Learning of Density Estimates with Topological Optimization

Suina Tanweer, Firas A. Khasawneh

Main category: cs.LG

TL;DR: Unsupervised bandwidth selection for kernel density estimation using topological data analysis to automatically optimize kernel bandwidth based on topological features.

DetailsMotivation: Kernel density estimation requires tuning the crucial bandwidth hyperparameter, which controls bias-variance trade-off and affects topological feature preservation. Manual bandwidth selection is challenging, especially in high dimensions where visualization is impossible.

Method: Proposes an unsupervised learning approach using a topology-based loss function that quantifies topological characteristics (connected components, loops, voids) to automatically select optimal bandwidth without supervision.

Result: Benchmarks the proposed topology-based approach against classical bandwidth selection techniques and demonstrates its potential across different dimensions.

Conclusion: Topological data analysis provides a mathematical framework for unsupervised bandwidth selection that can handle high-dimensional data where traditional visualization-based methods fail, offering an automated solution to this critical hyperparameter tuning problem.

Abstract: Kernel density estimation is a key component of a wide variety of algorithms in machine learning, Bayesian inference, stochastic dynamics and signal processing. However, the unsupervised density estimation technique requires tuning a crucial hyperparameter: the kernel bandwidth. The choice of bandwidth is critical as it controls the bias-variance trade-off by over- or under-smoothing the topological features. Topological data analysis provides methods to mathematically quantify topological characteristics, such as connected components, loops, voids et cetera, even in high dimensions where visualization of density estimates is impossible. In this paper, we propose an unsupervised learning approach using a topology-based loss function for the automated and unsupervised selection of the optimal bandwidth and benchmark it against classical techniques – demonstrating its potential across different dimensions.

[377] Open Polymer Challenge: Post-Competition Report

Gang Liu, Sobin Alosious, Subhamoy Mahajan, Eric Inae, Yihan Zhu, Yuhan Liu, Renzheng Zhang, Jiaxin Xu, Addison Howard, Ying Li, Tengfei Luo, Meng Jiang

Main category: cs.LG

TL;DR: The Open Polymer Challenge (OPC) releases the first community-developed benchmark for polymer informatics with 10K polymers and 5 properties, focusing on multi-task prediction under realistic constraints to accelerate sustainable materials discovery.

DetailsMotivation: Machine learning progress in sustainable polymer materials has been limited by the lack of large, high-quality, openly accessible polymer datasets, creating a need for standardized benchmarks and data resources.

Method: Released a benchmark dataset with 10K polymers and 5 properties, organized a multi-task prediction challenge under realistic constraints (small data, label imbalance, heterogeneous sources), and participants used techniques like feature augmentation, transfer learning, self-supervised pretraining, and ensemble strategies.

Result: Created the first community-developed polymer informatics benchmark, developed models that address realistic constraints, revealed important lessons about data preparation and distribution shifts, and released test datasets and generation pipelines for broader community use.

Conclusion: The OPC establishes a new foundation for molecular AI in polymer science, providing models, analysis, and data resources expected to accelerate the development of sustainable and energy-efficient materials through improved virtual screening pipelines.

Abstract: Machine learning (ML) offers a powerful path toward discovering sustainable polymer materials, but progress has been limited by the lack of large, high-quality, and openly accessible polymer datasets. The Open Polymer Challenge (OPC) addresses this gap by releasing the first community-developed benchmark for polymer informatics, featuring a dataset with 10K polymers and 5 properties: thermal conductivity, radius of gyration, density, fractional free volume, and glass transition temperature. The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery. Participants developed models under realistic constraints that include small data, label imbalance, and heterogeneous simulation sources, using techniques such as feature-based augmentation, transfer learning, self-supervised pretraining, and targeted ensemble strategies. The competition also revealed important lessons about data preparation, distribution shifts, and cross-group simulation consistency, informing best practices for future large-scale polymer datasets. The resulting models, analysis, and released data create a new foundation for molecular AI in polymer science and are expected to accelerate the development of sustainable and energy-efficient materials. Along with the competition, we release the test dataset at https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data. We also release the data generation pipeline at https://github.com/sobinalosious/ADEPT, which simulates more than 25 properties, including thermal conductivity, radius of gyration, and density.

[378] Discovering Influential Factors in Variational Autoencoders

Shiqi Liu, Jingxin Liu, Qian Zhao, Xiangyong Cao, Huibin Li, Deyu Meng, Hongying Meng, Sheng Liu

Main category: cs.LG

TL;DR: The paper proposes using mutual information between input and VAE latent factors as an indicator to identify influential factors, addressing the issue where VAE ignores pre-set factors, and demonstrates its effectiveness on generation and classification tasks.

DetailsMotivation: VAEs often ignore pre-set factors in learned representations, making it difficult to identify which factors are actually influential for downstream tasks without manual intervention. There's a need for an objective way to supervise and identify meaningful factors in VAE representations.

Method: The authors propose using mutual information between input and each learned factor as an indicator of influence. They design an algorithm to calculate mutual information for VAEs and prove its consistency. They show mutual information affects VAE’s reconstruction error lower bound and downstream classification performance.

Result: Experiments on MNIST, CelebA, and DEAP datasets show mutual information helps identify influential factors, some of which are interpretable. These factors can be used for generation and classification tasks, and help discover emotion-related variants on the DEAP dataset.

Conclusion: Mutual information serves as a reliable indicator for identifying influential factors in VAE representations, enabling better supervision of learned representations without manual intervention, and improving performance on downstream tasks like generation and classification.

Abstract: In the field of machine learning, it is still a critical issue to identify and supervise the learned representation without manually intervening or intuition assistance to extract useful knowledge or serve for the downstream tasks. In this work, we focus on supervising the influential factors extracted by the variational autoencoder(VAE). The VAE is proposed to learn independent low dimension representation while facing the problem that sometimes pre-set factors are ignored. We argue that the mutual information of the input and each learned factor of the representation plays a necessary indicator of discovering the influential factors. We find the VAE objective inclines to induce mutual information sparsity in factor dimension over the data intrinsic dimension and therefore result in some non-influential factors whose function on data reconstruction could be ignored. We show mutual information also influences the lower bound of the VAE’s reconstruction error and downstream classification task. To make such indicator applicable, we design an algorithm for calculating the mutual information for the VAE and prove its consistency. Experimental results on MNIST, CelebA and DEAP datasets show that mutual information can help determine influential factors, of which some are interpretable and can be used to further generation and classification tasks, and help discover the variant that connects with emotion on DEAP dataset.

[379] Generative Learning of Heterogeneous Tail Dependence

Xiangqian Sun, Xing Yan, Qi Wu

Main category: cs.LG

TL;DR: Proposes multivariate generative model for business/financial data with heterogeneous asymmetric tail dependence and marginals, using moment learning algorithm for parameter estimation.

DetailsMotivation: Business and financial data often exhibit complex dependence structures with heterogeneous and asymmetric tail dependence between dimensions, as well as heterogeneity and asymmetry in marginal tails. Existing methods face scalability issues and error propagation problems as dataset dimensions grow large.

Method: Multivariate generative model that captures heterogeneous asymmetric tail dependence between all dimension pairs while allowing heterogeneity/asymmetry in marginal tails. Uses novel moment learning algorithm for parameter estimation instead of likelihood methods (due to lack of closed-form density).

Result: Model shows better finite-sample performance compared to copula-based benchmarks and recent similar models on both simulated and real-world datasets. The structure avoids error propagation in parameter estimation and scales well with increasing dimensions.

Conclusion: The proposed multivariate generative model with moment learning algorithm effectively captures complex dependence structures in business/financial data, offering superior performance, scalability, and robustness compared to existing approaches.

Abstract: We propose a multivariate generative model to capture the complex dependence structure often encountered in business and financial data. Our model features heterogeneous and asymmetric tail dependence between all pairs of individual dimensions while also allowing heterogeneity and asymmetry in the tails of the marginals. A significant merit of our model structure is that it is not prone to error propagation in the parameter estimation process, hence very scalable, as the dimensions of datasets grow large. However, the likelihood methods are infeasible for parameter estimation in our case due to the lack of a closed-form density function. Instead, we devise a novel moment learning algorithm to learn the parameters. To demonstrate the effectiveness of the model and its estimator, we test them on simulated as well as real-world datasets. Results show that this framework gives better finite-sample performance compared to the copula-based benchmarks as well as recent similar models.

[380] Freeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature Noise

Haotian Ye, James Zou, Linjun Zhang

Main category: cs.LG

TL;DR: The paper analyzes why linear probing works for spurious correlations, identifies non-realizable noise as a key factor, and proposes Freeze then Train (FTT) algorithm that outperforms existing methods.

DetailsMotivation: Empirical risk minimization (ERM) fails when spurious correlations exist between training and test environments. While prior work found that core features can still be learned despite spurious correlations, there's no theoretical understanding of when/why linear probing works.

Method: The paper first analyzes when core features are learned well (when their non-realizable noise is smaller than spurious features). Then proposes Freeze then Train (FTT) algorithm that freezes salient features first, then trains remaining features using ERM to preserve features beneficial for test-time probing.

Result: FTT outperforms ERM, IRM, JTT and CVaR-DRO on spurious correlation datasets, with 4.5% accuracy improvement when feature noise is large. Also performs better on general distribution shift benchmarks.

Conclusion: Non-realizable noise is crucial for understanding when linear probing works. FTT effectively preserves beneficial features for test-time probing and outperforms existing methods across various distribution shift scenarios.

Abstract: The existence of spurious correlations such as image backgrounds in the training environment can make empirical risk minimization (ERM) perform badly in the test environment. To address this problem, Kirichenko et al. (2022) empirically found that the core features that are related to the outcome can still be learned well even with the presence of spurious correlations. This opens a promising strategy to first train a feature learner rather than a classifier, and then perform linear probing (last layer retraining) in the test environment. However, a theoretical understanding of when and why this approach works is lacking. In this paper, we find that core features are only learned well when their associated non-realizable noise is smaller than that of spurious features, which is not necessarily true in practice. We provide both theories and experiments to support this finding and to illustrate the importance of non-realizable noise. Moreover, we propose an algorithm called Freeze then Train (FTT), that first freezes certain salient features and then trains the rest of the features using ERM. We theoretically show that FTT preserves features that are more beneficial to test time probing. Across two commonly used spurious correlation datasets, FTT outperforms ERM, IRM, JTT and CVaR-DRO, with substantial improvement in accuracy (by 4.5%) when the feature noise is large. FTT also performs better on general distribution shift benchmarks.

[381] Adaptive Self-Distillation for Minimizing Client Drift in Heterogeneous Federated Learning

M Yashwanth, Gaurav Kumar Nayak, Arya Singh, Yogesh Simmhan, Anirban Chakraborty

Main category: cs.LG

TL;DR: Proposes Adaptive Self-Distillation (ASD) regularization for Federated Learning to address client-drift in non-iid data settings, boosting existing FL methods’ performance.

DetailsMotivation: Federated Learning suffers from 'client-drift' problem when clients have heterogeneous/non-iid data distributions (e.g., class imbalance), leading to slower convergence and poor aggregated model performance.

Method: Introduces Adaptive Self-Distillation (ASD) regularization that adaptively adjusts to each client’s training data based on global model prediction entropy and client-data label distribution. Can be integrated with existing FL algorithms.

Result: Theoretical analysis shows ASD reduces client-drift, empirical results demonstrate improved generalization. Extensive experiments on real-world benchmarks show substantial performance gains when combined with popular FL methods.

Conclusion: ASD regularization effectively mitigates client-drift in FL with non-iid data, enhances existing FL algorithms, and improves model convergence and performance without sharing local data.

Abstract: Federated Learning (FL) is a machine learning paradigm that enables clients to jointly train a global model by aggregating the locally trained models without sharing any local training data. In practice, there can often be substantial heterogeneity (e.g., class imbalance) across the local data distributions observed by each of these clients. Under such non-iid label distributions across clients, FL suffers from the ‘client-drift’ problem where every client drifts to its own local optimum. This results in slower convergence and poor performance of the aggregated model. To address this limitation, we propose a novel regularization technique based on adaptive self-distillation (ASD) for training models on the client side. Our regularization scheme adaptively adjusts to each client’s training data based on the global model’s prediction entropy and the client-data label distribution. We show in this paper that our proposed regularization (ASD) can be easily integrated atop existing, state-of-the-art FL algorithms, leading to a further boost in the performance of these off-the-shelf methods. We theoretically explain how incorporation of ASD regularizer leads to reduction in client-drift and empirically justify the generalization ability of the trained model. We demonstrate the efficacy of our approach through extensive experiments on multiple real-world benchmarks and show substantial gains in performance when the proposed regularizer is combined with popular FL methods.

[382] Learning Generalized Hamiltonians using fully Symplectic Mappings

Harsh Choudhary, Chandan Gupta, Vyacheslav Kungurtsev, Melvin Leok, Georgios Korpas

Main category: cs.LG

TL;DR: The paper extends Hamiltonian Neural Networks to handle non-separable Hamiltonians using symplectic integrators, enabling accurate long-term predictions while bypassing computationally expensive backpropagation through ODE solvers.

DetailsMotivation: Hamiltonian systems are conservative and important for modeling physical systems. While Hamiltonian Neural Networks incorporate physical inductive bias for better sample complexity and out-of-distribution accuracy, existing symplectic integration methods are limited to separable or augmented non-separable Hamiltonians, not generalized non-separable ones.

Method: The authors extend symplectic integrators to generalized non-separable Hamiltonians. They leverage the self-adjoint property of symplectic integrators to bypass computationally intensive backpropagation through ODE solvers. The method learns Hamiltonians from noisy observations of state variables.

Result: The method is robust to noise and provides good approximations of system Hamiltonians from noisy observations. Numerical results demonstrate strong performance in Hamiltonian reconstruction and conservation, with particular advantages for non-separable systems.

Conclusion: The proposed approach successfully extends Hamiltonian Neural Networks to generalized non-separable Hamiltonians using symplectic integration, enabling accurate long-term predictions while avoiding expensive backpropagation through ODE solvers, especially beneficial for non-separable physical systems.

Abstract: Many important physical systems can be described as the evolution of a Hamiltonian system, which has the important property of being conservative, that is, energy is conserved throughout the evolution. Physics Informed Neural Networks and in particular Hamiltonian Neural Networks have emerged as a mechanism to incorporate structural inductive bias into the NN model. By ensuring physical invariances are conserved, the models exhibit significantly better sample complexity and out-of-distribution accuracy than standard NNs. Learning the Hamiltonian as a function of its canonical variables, typically position and velocity, from sample observations of the system thus becomes a critical task in system identification and long-term prediction of system behavior. However, to truly preserve the long-run physical conservation properties of Hamiltonian systems, one must use symplectic integrators for a forward pass of the system’s simulation. While symplectic schemes have been used in the literature, they are thus far limited to situations when they reduce to explicit algorithms, which include the case of separable Hamiltonians or augmented non-separable Hamiltonians. We extend it to generalized non-separable Hamiltonians, and noting the self-adjoint property of symplectic integrators, we bypass computationally intensive backpropagation through an ODE solver. We show that the method is robust to noise and provides a good approximation of the system Hamiltonian when the state variables are sampled from a noisy observation. In the numerical results, we show the performance of the method concerning Hamiltonian reconstruction and conservation, indicating its particular advantage for non-separable systems.

[383] BG-HGNN: Toward Efficient Learning for Complex Heterogeneous Graphs

Junwei Su, Lingjun Mao, Zheng Da, Chuan Wu

Main category: cs.LG

TL;DR: BG-HGNN addresses parameter explosion and relation collapse in heterogeneous graph neural networks by integrating relational heterogeneity into a shared low-dimensional feature space, achieving significant efficiency gains while maintaining accuracy.

DetailsMotivation: Existing HGNNs allocate separate learnable weights for each relation type, causing parameter explosion and relation collapse when dealing with complex heterogeneous graphs with many relation types, making them inefficient or impractical.

Method: Proposes Blend&Grind-HGNN (BG-HGNN), a unified feature-representation framework that integrates and distills relational heterogeneity into a shared low-dimensional feature space, eliminating the need for relation-specific parameter sets.

Result: BG-HGNN achieves substantial gains over state-of-the-art HGNNs: improves parameter efficiency by up to 28.96x, training throughput by up to 110.30x, while matching or surpassing accuracy on complex heterogeneous graphs.

Conclusion: BG-HGNN provides an efficient and expressive solution for learning on complex heterogeneous graphs with many relation types by addressing fundamental limitations of existing HGNN designs through unified feature representation.

Abstract: Heterogeneous graphs, comprising diverse node and edge types connected through varied relations, are ubiquitous in real-world applications. Message-passing heterogeneous graph neural networks (HGNNs) have emerged as a powerful model class for such data. However, existing HGNNs typically allocate a separate set of learnable weights for each relation type to model relational heterogeneity. Despite their promise, these models are effective primarily on simple heterogeneous graphs with only a few relation types. In this paper, we show that this standard design inherently leads to parameter explosion (the number of learnable parameters grows rapidly with the number of relation types) and relation collapse (the model loses the ability to distinguish among different relations). These issues make existing HGNNs inefficient or impractical for complex heterogeneous graphs with many relation types. To address these challenges, we propose Blend&Grind-HGNN (BG-HGNN), a unified feature-representation framework that integrates and distills relational heterogeneity into a shared low-dimensional feature space. This design eliminates the need for relation-specific parameter sets and enables efficient, expressive learning even as the number of relations grows. Empirically, BG-HGNN achieves substantial gains over state-of-the-art HGNNs, improving parameter efficiency by up to 28.96x and training throughput by up to 110.30x, while matching or surpassing their accuracy on complex heterogeneous graphs.

[384] Score-based Conditional Out-of-Distribution Augmentation for Graph Covariate Shift

Bohan Wang, Yurui Chang, Wei Jin, Lu Lin

Main category: cs.LG

TL;DR: Score-based conditional graph generation for OOD generalization by exploring unseen environments while preserving stable features.

DetailsMotivation: Distribution shifts between training and testing datasets impair graph learning performance. Existing graph augmentation methods rely on accurate feature separation and are limited to existing environmental features in training data.

Method: Novel distributional augmentation approach using tailored score-based conditional graph generation strategies to explore and synthesize unseen environments while preserving valid graph patterns and stable features.

Result: Comprehensive empirical evaluations demonstrate enhanced effectiveness in improving graph out-of-distribution generalization.

Conclusion: The proposed score-based conditional graph generation approach effectively addresses limitations of existing methods and improves OOD generalization for graph learning.

Abstract: Distribution shifts between training and testing datasets significantly impair the model performance on graph learning. A commonly-taken causal view in graph invariant learning suggests that stable predictive features of graphs are causally associated with labels, whereas varying environmental features lead to distribution shifts. In particular, covariate shifts caused by unseen environments in test graphs underscore the critical need for out-of-distribution (OOD) generalization. Existing graph augmentation methods designed to address the covariate shift often disentangle the stable and environmental features in the input space, and selectively perturb or mixup the environmental features. However, such perturbation-based methods heavily rely on an accurate separation of stable and environmental features, and their exploration ability is confined to existing environmental features in the training distribution. To overcome these limitations, we introduce a novel distributional augmentation approach enabled by a tailored score-based conditional graph generation strategies to explore and synthesize unseen environments while preserving the validity and stable features of overall graph patterns. Our comprehensive empirical evaluations demonstrate the enhanced effectiveness of our method in improving graph OOD generalization.

[385] Neural Surrogate HMC: On Using Neural Likelihoods for Hamiltonian Monte Carlo in Simulation-Based Inference

Linnea M Wolniewicz, Peter Sadowski, Claudio Corti

Main category: cs.LG

TL;DR: Neural likelihood estimation combined with Hamiltonian Monte Carlo enables efficient Bayesian inference when direct likelihood computation is infeasible, providing amortization, gradient access, and noise smoothing.

DetailsMotivation: Traditional MCMC methods require repeated likelihood computations which can be infeasible in some scenarios, especially when simulations are computationally expensive or numerically unstable. There's a need for methods that can perform Bayesian inference without direct likelihood evaluation.

Method: Combines neural likelihood estimation (a simulation-based inference method) with Hamiltonian Monte Carlo. Uses neural networks to approximate likelihood functions, enabling three key advantages: amortization of MCMC computations, providing gradients for HMC, and smoothing noisy simulations from numerical instabilities.

Result: The method successfully enables efficient inference of latent parameters in the Parker equation for modeling heliospheric transport of galactic cosmic rays. Provides practical guidelines for prior definition, training set sampling, and convergence evaluation.

Conclusion: The synergy between neural likelihood estimation and Hamiltonian Monte Carlo offers a powerful approach for Bayesian inference when direct likelihood computation is infeasible, combining the strengths of amortized inference with gradient-based MCMC sampling.

Abstract: Bayesian inference methods such as Markov Chain Monte Carlo (MCMC) typically require repeated computations of the likelihood function, but in some scenarios this is infeasible and alternative methods are needed. Simulation-based inference (SBI) methods address this problem by using machine learning to amortize computations. In this work, we highlight a particular synergy between the SBI method of neural likelihood estimation and the classic MCMC method of Hamiltonian Monte Carlo. We show that approximating the likelihood function with a neural network model can provide three distinct advantages: (1) amortizing the computations for MCMC; (2) providing gradients for Hamiltonian Monte Carlo, and (3) smoothing over noisy simulations resulting from numerical instabilities. We provide practical guidelines for defining a prior, sampling a training set, and evaluating convergence. The method is demonstrated in an application modeling the heliospheric transport of galactic cosmic rays, where it enables efficient inference of latent parameters in the Parker equation.

[386] Asynchronous Stochastic Approximation with Applications to Average-Reward Reinforcement Learning

Huizhen Yu, Yi Wan, Richard S. Sutton

Main category: cs.LG

TL;DR: The paper extends stability and convergence analysis for asynchronous stochastic approximation algorithms, with applications to average-reward reinforcement learning.

DetailsMotivation: To provide stronger theoretical foundations for asynchronous stochastic approximation algorithms, particularly for average-reward reinforcement learning applications where existing convergence guarantees are limited under general noise conditions.

Method: Extends Borkar and Meyn’s stability proof method to accommodate more general noise conditions, and examines shadowing properties of asynchronous SA using dynamical systems approaches from Hirsch and Benaïm.

Result: Broader convergence guarantees for asynchronous stochastic approximation algorithms under more general noise conditions, and theoretical foundation for relative value iteration-based RL algorithms for average-reward MDPs and semi-MDPs.

Conclusion: The paper establishes stronger theoretical foundations for asynchronous SA algorithms, enabling development and analysis of relative value iteration-based RL algorithms for average-reward problems in companion work.

Abstract: This paper investigates the stability and convergence properties of asynchronous stochastic approximation (SA) algorithms, with a focus on extensions relevant to average-reward reinforcement learning. We first extend a stability proof method of Borkar and Meyn to accommodate more general noise conditions than previously considered, thereby yielding broader convergence guarantees for asynchronous SA. To sharpen the convergence analysis, we further examine the shadowing properties of asynchronous SA, building on a dynamical systems approach of Hirsch and Benaïm. These results provide a theoretical foundation for a class of relative value iteration-based reinforcement learning algorithms – developed and analyzed in a companion paper – for solving average-reward Markov and semi-Markov decision processes.

[387] Geometry Aware Meta-Learning Neural Network for Joint Phase and Precoder Optimization in RIS

Dahlia Devapriya, Aparna V C, Sheetal Kalyani

Main category: cs.LG

TL;DR: A geometry-aware meta-learning neural network for RIS-aided MISO systems that optimizes precoder and phase shifts on Riemannian manifolds, achieving faster convergence and better performance than existing methods.

DetailsMotivation: Joint optimization of precoder matrix and RIS phase shifts in RIS-aided systems involves significant computational complexity, requiring more efficient approaches.

Method: Proposes a complex-valued, geometry-aware meta-learning neural network that leverages complex circle geometry for phase shifts and spherical geometry for precoder optimization on Riemannian manifolds, using complex-valued neural networks for phase shifts and Euler-inspired updates for precoder network.

Result: Outperforms existing neural network-based algorithms with 0.7 bps higher weighted sum rate, 1.8 dB power gain, and ~100 epochs faster convergence; also beats state-of-the-art alternating optimization by 0.86 bps with 2.6 dB power gain.

Conclusion: The proposed geometry-aware meta-learning approach provides an efficient solution for RIS-aided MISO systems with superior performance, faster convergence, and lower power consumption compared to existing methods.

Abstract: In reconfigurable intelligent surface (RIS) aided systems, the joint optimization of the precoder matrix at the base station and the phase shifts of the RIS elements involves significant complexity. In this paper, we propose a complex-valued, geometry aware meta-learning neural network that maximizes the weighted sum rate in a multi-user multiple input single output system. By leveraging the complex circle geometry for phase shifts and spherical geometry for the precoder, the optimization occurs on Riemannian manifolds, leading to faster convergence. We use a complex-valued neural network for phase shifts and an Euler inspired update for the precoder network. Our approach outperforms existing neural network-based algorithms, offering higher weighted sum rates, lower power consumption, and significantly faster convergence. Specifically, it converges faster by nearly 100 epochs, with a 0.7 bps improvement in weighted sum rate and a 1.8 dB power gain when compared with existing work. Further it outperforms the state-of-the-art alternating optimization algorithm by 0.86 bps with a 2.6 dB power gain.

[388] Meta-Computing Enhanced Federated Learning in IIoT: Satisfaction-Aware Incentive Scheme via DRL-Based Stackelberg Game

Xiaohuan Li, Shaowen Qin, Xin Tang, Jiawen Kang, Jin Ye, Zhonghua Zhao, Yusi Zheng, Dusit Niyato

Main category: cs.LG

TL;DR: This paper proposes a novel incentive scheme for Federated Learning in Industrial IoT using meta-computing and Stackelberg game theory with deep reinforcement learning to balance model quality and training latency while maximizing utility.

DetailsMotivation: Federated Learning in IIoT needs to balance model quality and training latency, requiring efficient incentive mechanisms to encourage node participation while optimizing overall system performance.

Method: Designs satisfaction function considering data size, Age of Information, and training latency; formulates utility functions as two-stage Stackelberg game; uses deep reinforcement learning to find Stackelberg equilibrium.

Result: Under same budget constraints, proposed scheme improves utility by at least 23.7% compared to existing FL schemes without compromising model accuracy.

Conclusion: The incentive scheme effectively balances rewards and enhances applicability for IIoT by optimizing the trade-off between model quality and training latency through game-theoretic and reinforcement learning approaches.

Abstract: The Industrial Internet of Things (IIoT) leverages Federated Learning (FL) for distributed model training while preserving data privacy, and meta-computing enhances FL by optimizing and integrating distributed computing resources, improving efficiency and scalability. Efficient IIoT operations require a trade-off between model quality and training latency. Consequently, a primary challenge of FL in IIoT is to optimize overall system performance by balancing model quality and training latency. This paper designs a satisfaction function that accounts for data size, Age of Information (AoI), and training latency for meta-computing. Additionally, the satisfaction function is incorporated into the utility function to incentivize IIoT nodes to participate in model training. We model the utility functions of servers and nodes as a two-stage Stackelberg game and employ a deep reinforcement learning approach to learn the Stackelberg equilibrium. This approach ensures balanced rewards and enhances the applicability of the incentive scheme for IIoT. Simulation results demonstrate that, under the same budget constraints, the proposed incentive scheme improves utility by at least 23.7% compared to existing FL schemes without compromising model accuracy.

[389] GLL: A Differentiable Graph Learning Layer for Neural Networks

Jason Brown, Bohan Chen, Harris Hardiman-Mostow, Jeff Calder, Andrea L. Bertozzi

Main category: cs.LG

TL;DR: This paper introduces a novel neural network architecture that replaces the standard projection head and softmax activation with graph-based learning layers, enabling end-to-end integration of similarity graph construction and label propagation for classification tasks.

DetailsMotivation: Standard deep learning classification architectures fail to leverage relational information between samples. While graph-based learning techniques like Laplace learning have been combined with neural networks, prior approaches either approximate gradients or decouple processes, lacking true end-to-end integration.

Method: The authors derive backpropagation equations using the adjoint method to integrate a general family of graph learning layers into neural networks. This replaces projection heads and softmax with similarity graph construction and graph Laplacian-based label propagation layers.

Result: Experimental results show smooth label transitions across data, improved generalization, enhanced robustness to adversarial attacks, and better training dynamics compared to standard softmax-based approaches.

Conclusion: The proposed method successfully integrates graph learning layers into neural networks end-to-end, offering advantages over traditional classification architectures by leveraging relational information between samples through graph-based label propagation.

Abstract: Standard deep learning architectures used for classification generate label predictions with a projection head and softmax activation function. Although successful, these methods fail to leverage the relational information between samples for generating label predictions. In recent works, graph-based learning techniques, namely Laplace learning, have been heuristically combined with neural networks for both supervised and semi-supervised learning (SSL) tasks. However, prior works approximate the gradient of the loss function with respect to the graph learning algorithm or decouple the processes; end-to-end integration with neural networks is not achieved. In this work, we derive backpropagation equations, via the adjoint method, for inclusion of a general family of graph learning layers into a neural network. The resulting method, distinct from graph neural networks, allows us to precisely integrate similarity graph construction and graph Laplacian-based label propagation into a neural network layer, replacing a projection head and softmax activation function for general classification task. Our experimental results demonstrate smooth label transitions across data, improved generalization and robustness to adversarial attacks, and improved training dynamics compared to a standard softmax-based approach.

[390] RLCAD: Reinforcement Learning Training Gym for Revolution Involved CAD Command Sequence Generation

Xiaolong Yin, Xingyu Lu, Jiahang Shen, Jingzhe Ni, Hailong Li, Ruofeng Tong, Min Tang, Peng Du

Main category: cs.LG

TL;DR: A reinforcement learning environment for generating CAD command sequences that supports complex operations like revolution, achieving SOTA quality in converting B-Rep geometries to command sequences.

DetailsMotivation: Existing CAD command sequence generation methods are limited to basic operations (2D sketching, extrusion, Boolean), making it challenging to represent complex geometries. There's a need for more comprehensive approaches that support additional operations like revolution.

Method: Developed a reinforcement learning training environment (gym) built on a CAD geometric engine. The policy network generates actions from input B-Rep geometry, and the gym processes these actions to produce CAD geometry. Rewards are based on the difference between generated and target geometries, used to update the RL network. Supports operations including revolution beyond basic sketching, Boolean, and extrusion.

Result: Achieved state-of-the-art quality in generating command sequences from B-Rep geometries using the proposed RL training environment.

Conclusion: The proposed RL-based approach with a comprehensive training gym enables generation of CAD command sequences that can represent more complex geometries by supporting additional operations like revolution, overcoming limitations of existing methods.

Abstract: A CAD command sequence is a typical parametric design paradigm in 3D CAD systems where a model is constructed by overlaying 2D sketches with operations such as extrusion, revolution, and Boolean operations. Although there is growing academic interest in the automatic generation of command sequences, existing methods and datasets only support operations such as 2D sketching, extrusion,and Boolean operations. This limitation makes it challenging to represent more complex geometries. In this paper, we present a reinforcement learning (RL) training environment (gym) built on a CAD geometric engine. Given an input boundary representation (B-Rep) geometry, the policy network in the RL algorithm generates an action. This action, along with previously generated actions, is processed within the gym to produce the corresponding CAD geometry, which is then fed back into the policy network. The rewards, determined by the difference between the generated and target geometries within the gym, are used to update the RL network. Our method supports operations beyond sketches, Boolean, and extrusion, including revolution operations. With this training gym, we achieve state-of-the-art (SOTA) quality in generating command sequences from B-Rep geometries.

[391] Oscillations Make Neural Networks Robust to Quantization

Jonathan Wenshøj, Bob Pepin, Raghavendra Selvan

Main category: cs.LG

TL;DR: QAT weight oscillations are essential, not undesirable. The paper shows oscillations help quantization performance and proposes a regularizer to induce them.

DetailsMotivation: Challenge the prevailing view that weight oscillations in Quantization Aware Training (QAT) are undesirable side-effects, and instead argue they are essential to QAT's success.

Method: 1) Analyze QAT in a univariate linear model showing it creates an additional loss term causing oscillations. 2) Derive a regularizer to induce oscillations in neural network weights during training. 3) Evaluate on ResNet-18 and Tiny Vision Transformer across CIFAR-10 and Tiny ImageNet datasets with various quantization levels.

Result: Training with oscillations followed by post-training quantization (PTQ) recovers QAT performance in most cases. Provides novel insight into QAT dynamics and the positive role of oscillations.

Conclusion: Weight oscillations in QAT are not negative side-effects but essential components that improve quantization performance. The proposed oscillation-inducing regularizer enables PTQ to match QAT performance.

Abstract: We challenge the prevailing view that weight oscillations observed during Quantization Aware Training (QAT) are merely undesirable side-effects and argue instead that they are an essential part of QAT. We show in a univariate linear model that QAT results in an additional loss term that causes oscillations by pushing weights away from their nearest quantization level. Based on the mechanism from the analysis, we then derive a regularizer that induces oscillations in the weights of neural networks during training. Our empirical results on ResNet-18 and Tiny Vision Transformer, evaluated on CIFAR-10 and Tiny ImageNet datasets, demonstrate across a range of quantization levels that training with oscillations followed by post-training quantization (PTQ) is sufficient to recover the performance of QAT in most cases. With this work we provide further insight into the dynamics of QAT and contribute a novel insight into explaining the role of oscillations in QAT which until now have been considered to have a primarily negative effect on quantization.

[392] Flow-based Conformal Prediction for Multi-dimensional Time Series

Junghwan Lee, Chen Xu, Yao Xie

Main category: cs.LG

TL;DR: A novel conformal prediction method using flow with classifier-free guidance for time series prediction that addresses exchangeability violations and multi-dimensional outcomes while providing coverage guarantees.

DetailsMotivation: Time series prediction is critical for many scientific domains, but black-box ML models need reliable uncertainty quantification. Conformal prediction faces challenges with time series due to exchangeability violations (correlations in features/scores) and multi-dimensional outcomes.

Method: Proposes a conformal prediction method for time series using flow with classifier-free guidance. The method adaptively leverages correlations in features and non-conformity scores to overcome exchangeability assumptions and handles multi-dimensional outcomes.

Result: The method provides exact non-asymptotic marginal coverage guarantees and finite-sample bounds on conditional coverage. Evaluations on real-world datasets show it constructs significantly smaller prediction sets than existing conformal methods while maintaining target coverage.

Conclusion: The proposed flow-based conformal prediction method effectively addresses key challenges in time series uncertainty quantification, providing theoretical guarantees and practical improvements over existing methods.

Abstract: Time series prediction underpins a broad range of downstream tasks across many scientific domains. Recent advances and increasing adoption of black-box machine learning models for time series prediction highlight the critical need for reliable uncertainty quantification. While conformal prediction has gained attention as a reliable uncertainty quantification method, conformal prediction for time series faces two key challenges: (1) adaptively leveraging correlations in features and non-conformity scores to overcome the exchangeability assumption, and (2) constructing prediction sets for multi-dimensional outcomes. To address these challenges jointly, we propose a novel conformal prediction method for time series using flow with classifier-free guidance. We provide coverage guarantees by establishing exact non-asymptotic marginal coverage and a finite-sample bound on conditional coverage for the proposed method. Evaluations on real-world time series datasets demonstrate that our method constructs significantly smaller prediction sets than existing conformal prediction methods while maintaining target coverage.

[393] Proper Learnability and the Role of Unlabeled Data

Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, Shang-Hua Teng

Main category: cs.LG

TL;DR: Proper learning (predictors must be in hypothesis class H) is characterized in distribution-fixed PAC model with unlabeled data distribution given, showing optimal proper learners exist via distributional regularization. However, proper learnability in realizable PAC model is undecidable, non-monotonic, and non-local.

DetailsMotivation: To understand when learning problems are properly learnable (must output predictors in hypothesis class H), given that some problems can only be learned improperly (e.g., multiclass classification). Investigates assumptions on hypothesis class or information needed for proper learnability.

Method: 1) Introduces distribution-fixed PAC model where unlabeled data distribution is given, shows optimal proper learners exist via distributional regularization (randomized generalization of regularization). 2) Proves impossibility results: proper learnability is logically undecidable (independent of ZFC axioms), non-monotonic, and non-local. Uses reduction from EMX learning to proper classification.

Result: 1) In distribution-fixed PAC model, optimal proper learners always exist for all metric loss functions and finite learning problems. 2) Sample complexities shrink only by logarithmic factor from classic PAC, refuting role of unlabeled data in worst-case PAC learning. 3) Proper learnability in realizable PAC model is undecidable, non-monotonic, and non-local even for multiclass classification.

Conclusion: Proper learnability can be characterized when unlabeled distribution is known (distribution-fixed PAC), but is fundamentally intractable in realizable PAC model due to undecidability and structural properties. The reduction from EMX learning to proper classification provides new insights into limitations of proper learning.

Abstract: Proper learning refers to the setting in which learners must emit predictors in the underlying hypothesis class $H$, and often leads to learners with simple algorithmic forms (e.g. empirical risk minimization (ERM), structural risk minimization (SRM)). The limitation of proper learning, however, is that there exist problems which can only be learned improperly, e.g. in multiclass classification. Thus, we ask: Under what assumptions on the hypothesis class or the information provided to the learner is a problem properly learnable? We first demonstrate that when the unlabeled data distribution is given, there always exists an optimal proper learner governed by distributional regularization, a randomized generalization of regularization. We refer to this setting as the distribution-fixed PAC model, and continue to evaluate the learner on its worst-case performance over all distributions. Our result holds for all metric loss functions and any finite learning problem (with no dependence on its size). Further, we demonstrate that sample complexities in the distribution-fixed PAC model can shrink by only a logarithmic factor from the classic PAC model, strongly refuting the role of unlabeled data in PAC learning (from a worst-case perspective). We complement this with impossibility results which obstruct any characterization of proper learnability in the realizable PAC model. First, we observe that there are problems whose proper learnability is logically undecidable, i.e., independent of the ZFC axioms. We then show that proper learnability is not a monotone property of the underlying hypothesis class, and that it is not a local property (in a precise sense). Our impossibility results all hold even for the fundamental setting of multiclass classification, and go through a reduction of EMX learning (Ben-David et al., 2019) to proper classification which may be of independent interest.

[394] Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series Forecasting

Mingyue Cheng, Jiahao Wang, Daoyu Wang, Xiaoyu Tao, Qi Liu, Enhong Chen

Main category: cs.LG

TL;DR: TimeReasoner explores whether slow-thinking LLMs (like ChatGPT-o1, DeepSeek-R1) can perform zero-shot time series forecasting by reframing it as a structured reasoning task, showing they have non-trivial capabilities in capturing trends and contextual shifts.

DetailsMotivation: Current time series forecasting methods (statistical, deep learning, multimodal) follow a "fast thinking" paradigm focused on pattern extraction and direct value mapping, overlooking explicit reasoning about temporal dynamics and contextual dependencies. Slow-thinking LLMs have shown impressive multi-step reasoning in other domains, suggesting they could bring reasoning capabilities to TSF.

Method: Proposes TimeReasoner, an empirical study that formulates TSF as a conditional reasoning task. Designs various prompting strategies to elicit inference-time reasoning from pretrained slow-thinking LLMs and evaluates them across diverse TSF benchmarks.

Result: Slow-thinking LLMs demonstrate non-trivial zero-shot forecasting capabilities, particularly in capturing high-level trends and contextual shifts. The study reveals insights into LLMs’ reasoning behaviors in temporal domains, highlighting both their potential and limitations.

Conclusion: While preliminary, the work shows promise for reasoning-based forecasting paradigms and aims to catalyze further research toward more interpretable and generalizable TSF frameworks using slow-thinking LLMs.

Abstract: Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling. Despite their effectiveness, these methods often follow a fast thinking paradigm emphasizing pattern extraction and direct value mapping, while overlooking explicit reasoning over temporal dynamics and contextual dependencies. Meanwhile, emerging slow-thinking LLMs (e.g., ChatGPT-o1, DeepSeek-R1) have demonstrated impressive multi-step reasoning capabilities across diverse domains, suggesting a new opportunity for reframing TSF as a structured reasoning task. This motivates a key question: can slow-thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero-shot manner? To investigate this, in this paper, we propose TimeReasoner, an extensive empirical study that formulates TSF as a conditional reasoning task. We design a series of prompting strategies to elicit inference-time reasoning from pretrained slow-thinking LLMs and evaluate their performance across diverse TSF benchmarks. Our findings reveal that slow-thinking LLMs exhibit non-trivial zero-shot forecasting capabilities, especially in capturing high-level trends and contextual shifts. While preliminary, our study surfaces important insights into the reasoning behaviors of LLMs in temporal domains highlighting both their potential and limitations. We hope this work catalyzes further research into reasoning-based forecasting paradigms and paves the way toward more interpretable and generalizable TSF frameworks.

[395] Privacy Preservation through Practical Machine Unlearning

Robert Dilworth

Main category: cs.LG

TL;DR: Machine Unlearning enables selective data removal from trained models for privacy compliance, with methods like Naive Retraining and SISA framework evaluated on computational costs and consistency using HSpam14 dataset, showing promise for integration with Positive Unlabeled Learning.

DetailsMotivation: Address privacy concerns in machine learning by enabling selective removal of data from trained models to comply with privacy regulations while maintaining model performance and fostering trust in data-driven systems.

Method: Examines Naive Retraining and Exact Unlearning via SISA framework, evaluates computational costs and consistency using HSpam14 dataset, explores integration of unlearning principles with Positive Unlabeled (PU) Learning for partially labeled datasets.

Result: Unlearning frameworks like DaRE show promise for ensuring privacy compliance while maintaining model performance, but with significant computational trade-offs. Integration with PU Learning can address challenges of partially labeled datasets.

Conclusion: Machine Unlearning is crucial for achieving ethical AI and building trust in data-driven systems, despite computational challenges, with promising frameworks emerging for practical implementation.

Abstract: Machine Learning models thrive on vast datasets, continuously adapting to provide accurate predictions and recommendations. However, in an era dominated by privacy concerns, Machine Unlearning emerges as a transformative approach, enabling the selective removal of data from trained models. This paper examines methods such as Naive Retraining and Exact Unlearning via the SISA framework, evaluating their Computational Costs, Consistency, and feasibility using the $\texttt{HSpam14}$ dataset. We explore the potential of integrating unlearning principles into Positive Unlabeled (PU) Learning to address challenges posed by partially labeled datasets. Our findings highlight the promise of unlearning frameworks like $\textit{DaRE}$ for ensuring privacy compliance while maintaining model performance, albeit with significant computational trade-offs. This study underscores the importance of Machine Unlearning in achieving ethical AI and fostering trust in data-driven systems.

[396] OIPR: Evaluation for Time-series Anomaly Detection Inspired by Operator Interest

Yuhan Jing, Jingyu Wang, Lei Zhang, Haifeng Sun, Bo He, Zirui Zhuang, Chengsen Wang, Qi Qi, Jianxin Liao

Main category: cs.LG

TL;DR: OIPR introduces area-based metrics for time-series anomaly detection evaluation that models operator response to alarms, balancing point and event perspectives to overcome limitations of existing evaluators.

DetailsMotivation: Existing TAD evaluators have limitations: point-based metrics overestimate detectors good at long anomalies, while event-based metrics are misled by fragmented detection results. There's a need for better evaluation that captures temporal context and operator handling processes.

Method: Proposes OIPR (Operator Interest-based Precision and Recall metrics) with area-based metrics that model the process of operators receiving detector alarms and handling anomalies. Uses area under the operator interest curve to evaluate TAD performance. Also builds a special scenario dataset for evaluator comparison.

Result: OIPR demonstrates remarkable performance in extreme and complex scenarios on both special scenario dataset and five real-world datasets. It achieves balance between point and event perspectives, overcoming their primary limitations and offering broader applicability.

Conclusion: OIPR provides a novel area-based TAD evaluation approach that better captures operator handling processes, addresses limitations of existing point/event-based metrics, and offers more balanced and applicable performance assessment for time-series anomaly detection systems.

Abstract: With the growing adoption of time-series anomaly detection (TAD) technology, numerous studies have employed deep learning-based detectors to analyze time-series data in the fields of Internet services, industrial systems, and sensors. The selection and optimization of anomaly detectors strongly rely on the availability of an effective evaluation for TAD performance. Since anomalies in time-series data often manifest as a sequence of points, conventional metrics that solely consider the detection of individual points are inadequate. Existing TAD evaluators typically employ point-based or event-based metrics to capture the temporal context. However, point-based evaluators tend to overestimate detectors that excel only in detecting long anomalies, while event-based evaluators are susceptible to being misled by fragmented detection results. To address these limitations, we propose OIPR (Operator Interest-based Precision and Recall metrics), a novel TAD evaluator with area-based metrics. It models the process of operators receiving detector alarms and handling anomalies, utilizing area under the operator interest curve to evaluate TAD performance. Furthermore, we build a special scenario dataset to compare the characteristics of different evaluators. Through experiments conducted on the special scenario dataset and five real-world datasets, we demonstrate the remarkable performance of OIPR in extreme and complex scenarios. It achieves a balance between point and event perspectives, overcoming their primary limitations and offering applicability to broader situations.

[397] Knowledge Adaptation as Posterior Correction

Mohammad Emtiyaz Khan

Main category: cs.LG

TL;DR: The paper proposes “posterior correction” as a unifying principle for adaptation in AI, showing that many existing methods (continual learning, federated learning, unlearning, model merging) can be viewed as correcting old posterior distributions when new information arrives.

DetailsMotivation: Current AI models lack the fast adaptability of humans and animals. The paper aims to understand the mechanisms that enable rapid adaptation in machines, seeking a unifying principle that explains various adaptation methods.

Method: The authors cast adaptation as “correction” of old posteriors, deriving posterior correction using the dual representation of the Bayesian Learning Rule. They quantify interference between old representations and new information using natural-gradient mismatch, showing how this principle applies across multiple adaptation settings.

Result: The paper demonstrates that a wide variety of existing adaptation methods follow the posterior correction principle. More accurate posteriors lead to smaller corrections and enable faster adaptation. The authors provide multiple examples showing how machines can learn to adapt quickly using this approach.

Conclusion: Posterior correction provides a unifying framework for understanding adaptation in AI systems. By viewing adaptation as correction of old posteriors, the paper offers insights into how machines can achieve faster, more human-like adaptability across diverse settings.

Abstract: Adaptation is the holy grail of intelligence, but even the best AI models lack the adaptability of toddlers. In spite of great progress, little is known about the mechanisms by which machines can learn to adapt as fast as humans and animals. Here, we cast adaptation as `correction’ of old posteriors and show that a wide-variety of existing adaptation methods follow this very principle, including those used for continual learning, federated learning, unlearning, and model merging. In all these settings, more accurate posteriors often lead to smaller corrections and can enable faster adaptation. Posterior correction is derived by using the dual representation of the Bayesian Learning Rule of Khan and Rue (2023), where the interference between the old representation and new information is quantified by using the natural-gradient mismatch. We present many examples demonstrating how machines can learn to adapt quickly by using posterior correction.

[398] Representation Retrieval Learning for Heterogeneous Data Integration

Qi Xu, Annie Qu

Main category: cs.LG

TL;DR: R2 framework integrates representation learning modules with data-specific learners to handle heterogeneous datasets with covariate shift, posterior drift, and missingness, using Selective Integration Penalty to improve predictive performance.

DetailsMotivation: Large-scale multi-source datasets often exhibit complex heterogeneity (covariate shift, posterior drift, blockwise missingness) that worsens predictive performance of existing supervised learning algorithms.

Method: Proposes Representation Retrieval (R2) framework with dictionary of representation learning modules (representer dictionary) and data source-specific sparsity-induced learners. Introduces Selective Integration Penalty (SIP) to encourage more integrative representers.

Result: Theoretical analysis shows excess risk bound characterized by representer integrativeness, with SIP effectively improving excess risk. Simulations validate superior performance, and real-world applications confirm empirical success.

Conclusion: R2 framework effectively addresses complex heterogeneity in multi-source datasets through representation integration and selective penalty, improving predictive performance theoretically and empirically.

Abstract: In the era of big data, large-scale, multi-source, multi-modality datasets are increasingly ubiquitous, offering unprecedented opportunities for predictive modeling and scientific discovery. However, these datasets often exhibit complex heterogeneity, such as covariates shift, posterior drift, and blockwise missingness, which worsen predictive performance of existing supervised learning algorithms. To address these challenges simultaneously, we propose a novel Representation Retrieval (R2) framework, which integrates a dictionary of representation learning modules (representer dictionary) with data source-specific sparsity-induced machine learning model (learners). Under the R2 framework, we introduce the notion of integrativeness for each representer, and propose a novel Selective Integration Penalty (SIP) to explicitly encourage more integrative representers to improve predictive performance. Theoretically, we show that the excess risk bound of the R2 framework is characterized by the integrativeness of representers, and SIP effectively improves the excess risk. Extensive simulation studies validate the superior performance of R2 framework and the effect of SIP. We further apply our method to two real-world datasets to confirm its empirical success.

[399] Elucidated Rolling Diffusion Models for Probabilistic Forecasting of Complex Dynamics

Salva Rühling Cachay, Miika Aittala, Karsten Kreis, Noah Brenowitz, Arash Vahdat, Morteza Mardani, Rose Yu

Main category: cs.LG

TL;DR: ERDM combines rolling forecast structure with Elucidated Diffusion Models for probabilistic forecasting, outperforming baselines on fluid dynamics and weather prediction tasks.

DetailsMotivation: Current diffusion models for high-dimensional forecasting predict future states individually, failing to model complex temporal dependencies and uncertainty propagation. Rolling diffusion frameworks address this but haven't been successfully integrated with state-of-the-art diffusion techniques like EDM.

Method: ERDM adapts EDM components (noise schedule, network preconditioning, Heun sampler) to rolling forecasts with three key innovations: novel loss weighting focusing on mid-range horizons, efficient initialization using pre-trained EDM, and hybrid sequence architecture for spatiotemporal feature extraction during progressive denoising.

Result: ERDM consistently outperforms diffusion-based baselines including conditional autoregressive EDM on 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5-degree resolution.

Conclusion: ERDM provides a flexible, powerful framework for diffusion-based dynamics forecasting where modeling uncertainty propagation is crucial, successfully unifying rolling forecast structure with principled EDM design.

Abstract: Diffusion models are a powerful tool for probabilistic forecasting, yet most applications in high-dimensional complex systems predict future states individually. This approach struggles to model complex temporal dependencies and fails to explicitly account for the progressive growth of uncertainty inherent to the systems. While rolling diffusion frameworks, which apply increasing noise to forecasts at longer lead times, have been proposed to address this, their integration with state-of-the-art, high-fidelity diffusion techniques remains a significant challenge. We tackle this problem by introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to successfully unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM components-its noise schedule, network preconditioning, and Heun sampler-to the rolling forecast setting. The success of this integration is driven by three key contributions: (i) a novel loss weighting scheme that focuses model capacity on the mid-range forecast horizons where determinism gives way to stochasticity; (ii) an efficient initialization strategy using a pre-trained EDM for the initial window; and (iii) a bespoke hybrid sequence architecture for robust spatiotemporal feature extraction under progressive denoising. On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5-degree resolution, ERDM consistently outperforms key diffusion-based baselines, including conditional autoregressive EDM. ERDM offers a flexible and powerful general framework for tackling diffusion-based dynamics forecasting problems where modeling uncertainty propagation is paramount.

[400] TabKAN: Advancing Tabular Data Analysis using Kolmogorov-Arnold Network

Ali Eslamian, Alireza Afzal Aghaei, Qiang Cheng

Main category: cs.LG

TL;DR: TabKAN introduces a novel KAN-based framework for tabular data that outperforms traditional ML and deep learning models while offering better interpretability.

DetailsMotivation: Tabular data analysis faces challenges with heterogeneous features, missing values, and complex interactions. Traditional ML (like gradient boosting) often beats deep learning, but recent neural architectures show promise. The paper aims to bridge the gap between traditional ML and deep learning for structured data.

Method: TabKAN uses Kolmogorov-Arnold Networks (KANs) with learnable activation functions on edges instead of fixed ones. It incorporates modular KAN-based architectures for tabular analysis, proposes a transfer learning framework for cross-domain knowledge transfer, and develops a model-specific interpretability approach that reduces reliance on post hoc explanations.

Result: Extensive experiments on public datasets show TabKAN achieves superior performance in supervised learning and significantly outperforms both classical models and Transformer-based models in binary and multi-class classification tasks.

Conclusion: TabKAN demonstrates the potential of KAN-based architectures to bridge the gap between traditional machine learning and deep learning for structured tabular data, offering better performance, interpretability, and training efficiency.

Abstract: Tabular data analysis presents unique challenges that arise from heterogeneous feature types, missing values, and complex feature interactions. While traditional machine learning methods like gradient boosting often outperform deep learning, recent advancements in neural architectures offer promising alternatives. In this study, we introduce TabKAN, a novel framework for tabular data modeling based on Kolmogorov-Arnold Networks (KANs). Unlike conventional deep learning models, KANs use learnable activation functions on edges, which improves both interpretability and training efficiency. TabKAN incorporates modular KAN-based architectures designed for tabular analysis and proposes a transfer learning framework for knowledge transfer across domains. Furthermore, we develop a model-specific interpretability approach that reduces reliance on post hoc explanations. Extensive experiments on public datasets show that TabKAN achieves superior performance in supervised learning and significantly outperforms classical and Transformer-based models in binary and multi-class classification. The results demonstrate the potential of KAN-based architectures to bridge the gap between traditional machine learning and deep learning for structured data.

[401] Sonnet: Spectral Operator Neural Network for Multivariable Time Series Forecasting

Yuxuan Shu, Vasileios Lampos

Main category: cs.LG

TL;DR: Sonnet: A novel transformer-based architecture using learnable wavelet transforms and Koopman operator spectral analysis with Multivariable Coherence Attention for improved time series forecasting.

DetailsMotivation: Standard transformers struggle to model complex relationships among variables in multivariable time series forecasting, despite their ability to capture long-range dependencies. There's a need for better modeling of variable dependencies over time.

Method: Proposes Spectral Operator Neural Network (Sonnet) that applies learnable wavelet transformations to input and incorporates spectral analysis using the Koopman operator. Key innovation is Multivariable Coherence Attention (MVCA) that leverages spectral coherence to model variable dependencies.

Result: Sonnet achieves best performance on 34 out of 47 forecasting tasks with average MAE reduction of 2.2% against most competitive baseline. MVCA alone reduces MAE by 10.7% on average in most challenging forecasting tasks when applied to various deep learning models.

Conclusion: Sonnet effectively addresses transformer limitations in modeling variable dependencies through spectral analysis and coherence-based attention, demonstrating significant improvements in multivariable time series forecasting accuracy.

Abstract: Multivariable time series forecasting methods can integrate information from exogenous variables, leading to significant prediction accuracy gains. The transformer architecture has been widely applied in various time series forecasting models due to its ability to capture long-range sequential dependencies. However, a naïve application of transformers often struggles to effectively model complex relationships among variables over time. To mitigate against this, we propose a novel architecture, termed Spectral Operator Neural Network (Sonnet). Sonnet applies learnable wavelet transformations to the input and incorporates spectral analysis using the Koopman operator. Its predictive skill relies on the Multivariable Coherence Attention (MVCA), an operation that leverages spectral coherence to model variable dependencies. Our empirical analysis shows that Sonnet yields the best performance on $34$ out of $47$ forecasting tasks with an average mean absolute error (MAE) reduction of $2.2%$ against the most competitive baseline. We further show that MVCA can remedy the deficiencies of naïve attention in various deep learning models, reducing MAE by $10.7%$ on average in the most challenging forecasting tasks.

[402] Understanding the Implicit Regularization of Gradient Descent in Over-parameterized Models

Jianhao Ma, Geyu Liang, Salar Fattahi

Main category: cs.LG

TL;DR: IPGD algorithm uses infinitesimal perturbations to ensure gradient descent converges to low-dimensional solutions in over-parameterized settings.

DetailsMotivation: Implicit regularization in local search algorithms leads to low-dimensional solutions without explicit enforcement, but the underlying mechanisms are poorly understood, especially in over-parameterized problems.

Method: Analyze gradient descent dynamics requiring three conditions: suitable initialization, efficient saddle point escape, and sustained proximity to low-dimensional region. Introduce Infinitesimally Perturbed Gradient Descent (IPGD) using infinitesimal perturbations and small deviation rate to satisfy these conditions.

Result: Theoretical guarantees for IPGD in over-parameterized matrix sensing and empirical evidence of broader applicability beyond matrix sensing.

Conclusion: IPGD provides a principled approach to achieve implicit regularization through controlled perturbations, offering theoretical understanding and practical algorithm for finding low-dimensional solutions in over-parameterized optimization.

Abstract: Implicit regularization refers to the tendency of local search algorithms to converge to low-dimensional solutions, even when such structures are not explicitly enforced. Despite its ubiquity, the mechanism underlying this behavior remains poorly understood, particularly in over-parameterized settings. We analyze gradient descent dynamics and identify three conditions under which it converges to second-order stationary points within an implicit low-dimensional region: (i) suitable initialization, (ii) efficient escape from saddle points, and (iii) sustained proximity to the region. We show that these can be achieved through infinitesimal perturbations and a small deviation rate. Building on this, we introduce Infinitesimally Perturbed Gradient Descent (IPGD), which satisfies these conditions under mild assumptions. We provide theoretical guarantees for IPGD in over-parameterized matrix sensing and empirical evidence of its broader applicability.

[403] Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu

Main category: cs.LG

TL;DR: Prot2Token is a unified framework that converts diverse protein prediction tasks into standardized next-token prediction format using autoregressive decoders, achieving speedups up to 1000x faster than AlphaFold2 while matching specialized methods.

DetailsMotivation: Traditional protein prediction requires specialized models for different tasks, which hinders development of broadly applicable and computationally efficient Protein Language Models (PLMs). There's a need for a unified approach.

Method: Prot2Token converts various protein predictions (sequence-level properties, residue-specific attributes, inter-protein interactions) into standardized next-token prediction format. Uses autoregressive decoder conditioned on pre-trained protein encoder embeddings and learnable task tokens, with multi-task learning across five categories. Includes auxiliary self-supervised decoder pre-training for spatially sensitive tasks.

Result: Achieves up to 1000x faster inference than AlphaFold2 with MSA on same hardware for 3D structure prediction. Matches or surpasses specialized methods across numerous other protein prediction tasks. Demonstrates strong predictive power across diverse benchmarks.

Conclusion: Prot2Token offers a step toward standardizing biological prediction into a generative interface, promising to accelerate biological discovery and therapeutic development by unifying diverse protein prediction tasks into a single framework.

Abstract: The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions-from sequence-level properties and residue-specific attributes to complex inter-protein interactions-into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling general-purpose decoders to generalize across five distinct categories. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Token’s predictive power in different types of protein-prediction tasks. In 3D structure prediction, Prot2Token delivers substantial speedups (up to 1000x faster than AlphaFold2 with MSA on the same hardware) while, across other numerous tasks, matching or surpassing specialized methods. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a step towards standardizing biological prediction into a generative interface, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at https://github.com/mahdip72/prot2token .

[404] Language Models for Controllable DNA Sequence Design

Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji

Main category: cs.LG

TL;DR: ATGC-Gen is a transformer-based model for controllable DNA sequence generation that integrates biological signals through cross-modal encoding, achieving improved controllability and functional relevance compared to prior methods.

DetailsMotivation: While language models like GPT and BERT have succeeded in natural language generation, their application to DNA sequence generation remains underexplored. There's a need for models that can generate DNA sequences conditioned on specific biological properties for programmable genomic design.

Method: ATGC-Gen uses cross-modal encoding to integrate diverse biological signals and is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training under autoregressive or masked recovery objectives.

Result: The model generates fluent, diverse, and biologically relevant sequences aligned with desired properties. It achieves notable improvements in controllability and functional relevance compared to prior methods on tasks including promoter and enhancer sequence design.

Conclusion: ATGC-Gen demonstrates the potential of language models in advancing programmable genomic design, with the source code released for community use.

Abstract: We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (https://github.com/divelab/AIRS/blob/main/OpenBio/ATGC_Gen).

[405] Curse of Slicing: Why Sliced Mutual Information is a Deceptive Measure of Statistical Dependence

Alexander Semenenko, Ivan Butakov, Alexey Frolov, Ivan Oseledets

Main category: cs.LG

TL;DR: SMI has scalability advantages but suffers from serious vulnerabilities: it saturates easily, fails to detect increasing dependence, prioritizes redundancy over information, and can perform worse than correlation.

DetailsMotivation: To expose critical vulnerabilities in Sliced Mutual Information (SMI), which is widely adopted as a scalable alternative to mutual information for measuring non-linear statistical dependence.

Method: Extensive benchmarking and theoretical analysis to demonstrate SMI’s susceptibility to data manipulation and counterintuitive behavior.

Result: SMI saturates easily, fails to detect increases in statistical dependence, prioritizes redundancy over informative content, and in some cases performs worse than correlation coefficient.

Conclusion: Despite SMI’s advantages (faster convergence, robustness to high dimensionality, nullification only under independence), it has serious limitations that make it vulnerable to manipulation and unreliable for measuring statistical dependence.

Abstract: Sliced Mutual Information (SMI) is widely used as a scalable alternative to mutual information for measuring non-linear statistical dependence. Despite its advantages, such as faster convergence, robustness to high dimensionality, and nullification only under statistical independence, we demonstrate that SMI is highly susceptible to data manipulation and exhibits counterintuitive behavior. Through extensive benchmarking and theoretical analysis, we show that SMI saturates easily, fails to detect increases in statistical dependence, prioritizes redundancy over informative content, and in some cases, performs worse than correlation coefficient.

[406] PRISM: Lightweight Multivariate Time-Series Classification through Symmetric Multi-Resolution Convolutional Layers

Federico Zucchi, Thomas Lampert

Main category: cs.LG

TL;DR: PRISM is a lightweight convolutional classifier for multivariate time series that uses symmetric multi-resolution filters inspired by signal processing, achieving state-of-the-art performance with far fewer parameters and computational cost.

DetailsMotivation: Current Transformer and CNN models for multivariate time series classification are computationally heavy and parameter-intensive, creating a need for more efficient models that can capture both short-term patterns and longer-range dependencies.

Method: PRISM uses channel-independent processing with multi-resolution symmetric convolutional filters inspired by linear-phase FIR filters from classical signal processing. This symmetry halves learnable parameters in initial layers while preserving full receptive field.

Result: PRISM matches or outperforms state-of-the-art CNN and Transformer models across diverse benchmarks including UEA multivariate time-series archive, human activity recognition, sleep staging, and biomedical signals, while using significantly fewer parameters and lower computational cost.

Conclusion: By integrating signal processing principles into neural architecture, PRISM provides an effective and computationally economical solution for multivariate time series classification, offering a lightweight alternative to heavy Transformer and CNN models.

Abstract: Multivariate time series classification supports applications from wearable sensing to biomedical monitoring and demands models that can capture both short-term patterns and longer-range temporal dependencies. Despite recent advances, Transformer and CNN models often remain computationally heavy and rely on many parameters. This work presents PRISM(Per-channel Resolution Informed Symmetric Module), a lightweight fully convolutional classifier. Operating in a channel-independent manner, in its early stage it applies a set of multi-resolution symmetric convolutional filters. This symmetry enforces structural constraints inspired by linear-phase FIR filters from classical signal processing, effectively halving the number of learnable parameters within the initial layers while preserving the full receptive field. Across the diverse UEA multivariate time-series archive as well as specific benchmarks in human activity recognition, sleep staging, and biomedical signals, PRISM matches or outperforms state-of-the-art CNN and Transformer models while using significantly fewer parameters and markedly lower computational cost. By bringing a principled signal processing prior into a modern neural architecture, PRISM offers an effective and computationally economical solution for multivariate time series classification.

[407] Schauder Bases for $C[0, 1]$ Using ReLU, Softplus and Two Sigmoidal Functions

Anand Ganesh, Babhrubahan Bose, Anand Rajagopalan

Main category: cs.LG

TL;DR: The paper constructs four Schauder bases for C[0,1] using ReLU, Softplus, and their sigmoidal variants, establishing first-time basis existence and improving universal approximation properties.

DetailsMotivation: To establish the existence of Schauder bases for C[0,1] using popular activation functions (ReLU, Softplus) and their variants, improving upon existing universal approximation results and providing theoretical foundations for neural network approximation theory.

Method: Constructs four explicit Schauder bases: one using ReLU functions, another using Softplus functions, and two more using sigmoidal versions of ReLU and Softplus. The construction provides explicit basis representations for continuous functions on [0,1].

Result: Successfully constructs four Schauder bases for C[0,1], establishes first-time basis existence for these functions, proves O(1/n) approximation bound using ReLU basis, and shows negative result on constructing multivariate functions using finite ReLU combinations.

Conclusion: The paper provides rigorous mathematical foundations for approximation using popular activation functions, demonstrating both positive results (basis existence, approximation bounds) and limitations (multivariate construction constraints), advancing theoretical understanding of neural network approximation capabilities.

Abstract: We construct four Schauder bases for the space $C[0,1]$, one using ReLU functions, another using Softplus functions, and two more using sigmoidal versions of the ReLU and Softplus functions. This establishes the existence of a basis using these functions for the first time, and improves on the universal approximation property associated with them. We also show an $O(\frac{1}{n})$ approximation bound based on our ReLU basis, and a negative result on constructing multivariate functions using finite combinations of ReLU functions.

[408] Amortized Bayesian Meta-Learning for Low-Rank Adaptation of Large Language Models

Liyi Zhang, Jake Snell, Thomas L. Griffiths

Main category: cs.LG

TL;DR: ABMLL is an efficient Bayesian meta-learning method for fine-tuning LLMs with LoRA that improves generalization and uncertainty quantification while maintaining computational efficiency.

DetailsMotivation: Existing methods for improving generalization of fine-tuned LLMs (like optimizing in-context prompts or meta-learning) are expensive in memory and computation, requiring long-context prompts or second-order gradient updates.

Method: Amortized Bayesian Meta-Learning for LoRA (ABMLL) adapts amortized Bayesian meta-learning to LLMs by reframing task-specific and global parameters in the LoRA context, using a hyperparameter to balance reconstruction accuracy and parameter fidelity.

Result: ABMLL outperforms existing methods on CrossFit and Unified-QA datasets in terms of both accuracy and expected calibration error, scales to large models like LLAMA3-8B, and provides improved uncertainty quantification.

Conclusion: ABMLL offers an effective, computationally efficient approach for improving generalization of fine-tuned LLMs while providing better uncertainty quantification through its Bayesian framework.

Abstract: Fine-tuning large language models (LLMs) with low-rank adaptation (LoRA) is a cost-effective way to incorporate information from a specific dataset. However, it is often unclear how well the fine-tuned LLM will generalize, i.e., how well it will perform on unseen datasets. Methods have been proposed to improve generalization by optimizing in-context prompts, or by using meta-learning to fine-tune LLMs. However, these methods are expensive in memory and computation, requiring either long-context prompts or saving copies of parameters and using second-order gradient updates. To address these challenges, we propose Amortized Bayesian Meta-Learning for LoRA (ABMLL). This method builds on amortized Bayesian meta-learning for smaller models, adapting this approach to LLMs while maintaining its computational efficiency. We reframe task-specific and global parameters in the context of LoRA and use a new hyperparameter to balance reconstruction accuracy and the fidelity of task-specific parameters to the global ones. ABMLL provides effective generalization and scales to large models such as LLAMA3-8B. Furthermore, as a result of using a Bayesian framework, ABMLL provides improved uncertainty quantification. We test ABMLL on CrossFit and Unified-QA datasets and find that it outperforms existing methods on these benchmarks in terms of both accuracy and expected calibration error.

[409] Measuring the Measures: Discriminative Capacity of Representational Similarity Metrics Across Model Families

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

Main category: cs.LG

TL;DR: Systematic comparison of representational similarity metrics shows that separability increases with stricter alignment constraints, with soft-matching performing best among mapping-based methods.

DetailsMotivation: Representational similarity metrics are widely used in neuroscience and AI, but there's a lack of systematic comparisons of their discriminative power across different model families and training regimes.

Method: Introduced a quantitative framework to evaluate similarity metrics using three separability measures (dprime, silhouette coefficients, ROC-AUC). Tested metrics including RSA, linear predictivity, Procrustes, and soft matching across diverse model architectures (CNNs, Vision Transformers, Swin Transformers, ConvNeXt) and training regimes (supervised vs. self-supervised).

Result: Separability systematically increases as metrics impose more stringent alignment constraints. Soft-matching achieved highest separability among mapping-based approaches, followed by Procrustes alignment and linear predictivity. Non-fitting methods like RSA also showed strong separability across families.

Conclusion: Provides the first systematic comparison of similarity metrics through a separability lens, clarifying their relative sensitivity and guiding metric choice for large-scale model and brain comparisons.

Abstract: Representational similarity metrics are fundamental tools in neuroscience and AI, yet we lack systematic comparisons of their discriminative power across model families. We introduce a quantitative framework to evaluate representational similarity measures based on their ability to separate model families-across architectures (CNNs, Vision Transformers, Swin Transformers, ConvNeXt) and training regimes (supervised vs. self-supervised). Using three complementary separability measures-dprime from signal detection theory, silhouette coefficients and ROC-AUC, we systematically assess the discriminative capacity of commonly used metrics including RSA, linear predictivity, Procrustes, and soft matching. We show that separability systematically increases as metrics impose more stringent alignment constraints. Among mapping-based approaches, soft-matching achieves the highest separability, followed by Procrustes alignment and linear predictivity. Non-fitting methods such as RSA also yield strong separability across families. These results provide the first systematic comparison of similarity metrics through a separability lens, clarifying their relative sensitivity and guiding metric choice for large-scale model and brain comparisons.

[410] Graph Coloring for Multi-Task Learning

Santosh Patapati

Main category: cs.LG

TL;DR: SON-GOKU is a scheduler that uses gradient interference analysis and graph coloring to partition tasks into compatible groups, updating only one group per training step to reduce interference and improve multi-task learning performance.

DetailsMotivation: In multi-task learning, conflicting objectives cause gradient interference that slows convergence and reduces final model performance. Existing methods struggle with dynamically changing task relationships during training.

Method: Computes gradient interference between tasks, constructs an interference graph, applies greedy graph-coloring to partition tasks into compatible groups, and activates only one group per training step while constantly recomputing groupings as task relationships evolve.

Result: Empirical results on six datasets show consistent outperformance over baselines and state-of-the-art multi-task optimizers. The method improves effectiveness of any underlying optimizer without additional tuning.

Conclusion: Grouping tasks by gradient compatibility and sequential updates significantly improves multi-task learning performance. Theoretical guarantees show descent, convergence, and accurate identification of conflicting/aligning tasks.

Abstract: When different objectives conflict with each other in multi-task learning, gradients begin to interfere and slow convergence, thereby potentially reducing the final model’s performance. To address this, we introduce SON-GOKU, a scheduler that computes gradient interference, constructs an interference graph, and then applies greedy graph-coloring to partition tasks into groups that align well with each other. At each training step, only one group (color class) of tasks are activated, and the grouping partition is constantly recomputed as task relationships evolve throughout training. By ensuring that each mini-batch contains only tasks that pull the model in the same direction, our method improves the effectiveness of any underlying multi-task learning optimizer without additional tuning. Since tasks within these groups will update in compatible directions, multi-task learning will improve model performance rather than impede it. Empirical results on six different datasets show that this interference-aware graph-coloring approach consistently outperforms baselines and state-of-the-art multi-task optimizers. We provide extensive theory showing why grouping and sequential updates improve multi-task learning, with guarantees on descent, convergence, and accurately identifying what tasks conflict or align.

[411] Nonlinear Optimization with GPU-Accelerated Neural Network Constraints

Robert Parker, Oscar Dowson, Nicole LoGiudice, Manuel Garcia, Russell Bent

Main category: cs.LG

TL;DR: A reduced-space formulation for optimizing over trained neural networks that treats networks as “gray boxes” to accelerate optimization solves compared to full-space formulations.

DetailsMotivation: Optimizing over trained neural networks is computationally expensive in full-space formulations where all intermediate variables and constraints are exposed to the solver. There's a need for more efficient optimization methods that leverage GPU acceleration for neural network evaluations.

Method: Proposes a reduced-space formulation that treats neural networks as “gray boxes” - only exposing inputs and outputs to the optimization solver while keeping intermediate variables and constraints internal. The method evaluates network outputs and derivatives on GPU for efficiency.

Result: The reduced-space formulation leads to faster solves and fewer iterations in interior point methods compared to full-space formulations. Demonstrated effectiveness on two optimization problems: adversarial generation for MNIST classifiers and security-constrained optimal power flow with neural network surrogates.

Conclusion: The gray-box reduced-space formulation provides significant computational advantages for optimization over trained neural networks, enabling more efficient solves while leveraging GPU acceleration for network evaluations.

Abstract: We propose a reduced-space formulation for optimizing over trained neural networks where the network’s outputs and derivatives are evaluated on a GPU. To do this, we treat the neural network as a “gray box” where intermediate variables and constraints are not exposed to the optimization solver. Compared to the full-space formulation, in which intermediate variables and constraints are exposed to the optimization solver, the reduced-space formulation leads to faster solves and fewer iterations in an interior point method. We demonstrate the benefits of this method on two optimization problems: Adversarial generation for a classifier trained on MNIST images and security-constrained optimal power flow with transient feasibility enforced using a neural network surrogate.

[412] Short window attention enables long-term memorization

Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre-Emmanuel Mazaré, Gabriel Synnaeve, Hervé Jégou

Main category: cs.LG

TL;DR: SWAX combines sliding-window attention with xLSTM layers, finding that smaller windows paradoxically improve long-context performance by forcing better use of xLSTM memory, and stochastic window training optimizes both short and long-context tasks.

DetailsMotivation: Hybrid architectures combining sliding window attention with linear RNN layers show promise but the impact of window length and interplay between attention and RNN mechanisms remains under-studied.

Method: Introduces SWAX architecture with sliding-window attention and xLSTM layers, using stochastic window size changes during training to force the model to leverage both longer context windows and xLSTM memory.

Result: Counter-intuitively, larger sliding windows don’t improve long-context performance; short windows encourage better xLSTM memory training. Stochastic window training significantly outperforms regular window attention on both short and long-context problems.

Conclusion: SWAX with stochastic window training effectively balances attention and RNN mechanisms, achieving superior performance across different context lengths by forcing the model to properly utilize both components.

Abstract: Recent works show that hybrid architectures combining sliding window softmax attention layers with linear recurrent neural network (RNN) layers outperform both of these architectures taken separately. However, the impact of the window length and the interplay between softmax attention and linear RNN layers remain under-studied. In this work, we introduce SWAX, a hybrid architecture consisting of sliding-window attention and xLSTM linear RNN layers. A counter-intuitive finding with SWAX is that larger sliding windows do not improve the long-context performance. In fact, short window attention encourages the model to better train the long-term memory of the xLSTM, by relying less on the softmax attention mechanism for long context-retrieval. The issue with small sliding windows is that they are detrimental for short-context tasks, which could be solved with information from moderately larger sliding windows otherwise. Therefore, we train SWAX by stochastically changing the sliding window size, forcing the model to leverage both a longer context window and the xLSTM memory. SWAX trained with stochastic window sizes significantly outperforms regular window attention both on short and long-context problems.

[413] UniPruning: Unifying Local Metric and Global Feedback for Scalable Sparse LLMs

Yizhuo Ding, Wanying Qu, Jiawei Geng, Wenqi Shao, Yanwei Fu

Main category: cs.LG

TL;DR: UniPruning is a unified post-training pruning framework that combines local saliency metrics with global coordination using mirror descent optimization, enabling efficient sparsification of LLMs without weight updates.

DetailsMotivation: LLMs face prohibitive computational and memory costs. Existing pruning methods struggle to balance efficiency and robustness - local approaches collapse under high sparsity while global methods are expensive or restrictive.

Method: UniPruning combines fast layer-wise scoring with a lightweight global controller using mirror descent optimization. It allocates a single sparsity budget, supports unstructured and semi-structured N:M pruning, and generates masks for arbitrary sparsity levels after brief calibration.

Result: Extensive experiments on multiple pretrained LLM families show UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies highlight the importance of mirror descent and local saliency anchoring.

Conclusion: UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs, balancing speed and stability without weight updates.

Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks but face prohibitive computational and memory costs. Pruning offers a promising path by inducing sparsity while preserving architectural flexibility. However, existing methods struggle to balance efficiency and robustness: local metric approaches prune layer by layer but often collapse under high sparsity, whereas global feedback methods enforce consistency at the cost of expensive weight updates or restrictive semi-structured formats. We present UniPruning, a unified post-training pruning framework that combines the speed of local saliency metrics with the stability of global coordination, enabled by a mirror descent based optimization, all without updating model weights. UniPruning leverages fast layer-wise scoring and a lightweight global controller to allocate a single sparsity budget, supporting both unstructured and semi-structured N :M pruning within one framework. After a brief calibration, it can generate pruning masks for arbitrary sparsity levels in one shot, and adapts seamlessly to hardware-aware constraints. Extensive experiments on multiple pretrained LLM families and standard benchmarks show that UniPruning consistently delivers competitive or superior perplexity and zero-shot accuracy. Ablation studies further highlight the importance of mirror descent and local saliency anchoring. Overall, UniPruning provides an efficient, principled, and scalable solution for sparsifying large-scale LLMs. Our code is available at: https://github.com/RainbowQTT/UniPruning.

[414] Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment

Ziyi Chen, Junyi Li, Peiran Yu, Heng Huang

Main category: cs.LG

TL;DR: Proposes RLHF-COV and DPO-COV algorithms to simultaneously address corrupted preferences, reward overoptimization, and verbosity bias in LLM alignment, with theoretical guarantees and experimental validation.

DetailsMotivation: Existing RLHF and DPO methods suffer from three key issues: corrupted preferences, reward overoptimization, and verbosity bias. Most works address only one issue, and solutions requiring multiple reward models are computationally expensive with no theoretical generalization guarantees.

Method: Proposes RLHF-COV and DPO-COV algorithms that simultaneously mitigate all three issues in both offline and online settings. DPO-COV is simple to implement without reward estimation and is theoretically equivalent to RLHF-COV. The approach includes length regularization and handles corrupted data.

Result: Theoretically demonstrates length-regularized generalization error rates for DPO-COV trained on corrupted data, matching best-known rates for clean data without length regularization. Experiments show effectiveness in both offline and online settings.

Conclusion: The proposed COV algorithms effectively address three major alignment issues simultaneously with theoretical guarantees, computational efficiency, and practical effectiveness, establishing equivalence between RLHF and DPO approaches.

Abstract: Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are important techniques to align large language models (LLM) with human preference. However, the quality of RLHF and DPO training is seriously compromised by \textit{\textbf{C}orrupted} preference, reward \textit{\textbf{O}veroptimization}, and bias towards \textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only one of these important issues, and the few other works require much computation to estimate multiple reward models and lack theoretical guarantee of generalization ability. In this work, we propose RLHF-\textbf{COV} and DPO-\textbf{COV} algorithms that can simultaneously mitigate these three issues, in both offline and online settings. This ability is theoretically demonstrated by obtaining length-regularized generalization error rates for our DPO-COV algorithms trained on corrupted data, which match the best-known rates for simpler cases with clean data and without length regularization. Moreover, our DPO-COV algorithm is simple to implement without reward estimation, and is proved to be equivalent to our RLHF-COV algorithm, which directly implies the equivalence between the vanilla RLHF and DPO algorithms. Experiments demonstrate the effectiveness of our DPO-COV algorithms under both offline and online settings.

[415] Permutation-Invariant Representation Learning for Robust and Privacy-Preserving Feature Selection

Rui Liu, Tao Zhe, Yanjie Fu, Feng Xia, Ted Senator, Dongjie Wang

Main category: cs.LG

TL;DR: This paper extends a feature selection framework to federated learning by adding privacy-preserving knowledge fusion and sample-aware weighting to handle data imbalance and heterogeneity across distributed clients.

DetailsMotivation: Existing feature selection methods struggle with permutation sensitivity and convexity assumptions, and while previous work addressed these with permutation-invariant embedding and policy-guided search, it didn't handle realistic distributed scenarios where data is imbalanced, heterogeneous, and privacy-constrained across clients.

Method: The framework integrates: 1) privacy-preserving knowledge fusion strategy to create unified representation space without sharing raw data, and 2) sample-aware weighting strategy to address distributional imbalance among heterogeneous local clients in federated settings.

Result: Extensive experiments validate the framework’s effectiveness, robustness, and efficiency, demonstrating strong generalization ability in federated learning scenarios.

Conclusion: The extended framework successfully addresses distributed feature selection challenges by incorporating privacy-preserving knowledge fusion and sample-aware weighting, making it suitable for real-world federated learning applications with data imbalance and privacy constraints.

Abstract: Feature selection eliminates redundancy among features to improve downstream task performance while reducing computational overhead. Existing methods often struggle to capture intricate feature interactions and adapt across diverse application scenarios. Recent advances employ generative intelligence to alleviate these drawbacks. However, these methods remain constrained by permutation sensitivity in embedding and reliance on convexity assumptions in gradient-based search. To address these limitations, our initial work introduces a novel framework that integrates permutation-invariant embedding with policy-guided search. Although effective, it still left opportunities to adapt to realistic distributed scenarios. In practice, data across local clients is highly imbalanced, heterogeneous and constrained by strict privacy regulations, limiting direct sharing. These challenges highlight the need for a framework that can integrate feature selection knowledge across clients without exposing sensitive information. In this extended journal version, we advance the framework from two perspectives: 1) developing a privacy-preserving knowledge fusion strategy to derive a unified representation space without sharing sensitive raw data. 2) incorporating a sample-aware weighting strategy to address distributional imbalance among heterogeneous local clients. Extensive experiments validate the effectiveness, robustness, and efficiency of our framework. The results further demonstrate its strong generalization ability in federated learning scenarios. The code and data are publicly available: https://anonymous.4open.science/r/FedCAPS-08BF.

[416] Value-State Gated Attention for Mitigating Extreme-Token Phenomena in Transformers

Rui Bu, Haofeng Zhong, Wenzheng Chen, Yangyan Li

Main category: cs.LG

TL;DR: VGA (Value-State Gated Attention) is a new Transformer mechanism that breaks the mutual reinforcement cycle causing attention sinks and value-state drains by gating attention output directly based on value vectors.

DetailsMotivation: Transformer models suffer from extreme-token phenomena like attention sinks and value-state drains, which degrade performance, quantization fidelity, and interpretability. These issues arise from a problematic mutual reinforcement mechanism where models learn inefficient 'no-op' behavior by focusing attention on tokens with near-zero value states.

Method: Proposes Value-State Gated Attention (VGA) - a simple architectural mechanism that introduces a learnable, data-dependent gate computed directly from value vectors (V) to modulate attention output. This breaks the cycle by gating the value-state with a function of itself, creating a direct regulatory pathway to suppress token contributions based on emergent value representations.

Result: VGA significantly mitigates attention sink formation and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability. Theoretical gradient analysis shows this approach is more effective at decoupling value and attention score updates than prior methods.

Conclusion: VGA provides an efficient solution to extreme-token phenomena in Transformers by directly addressing the mutual reinforcement mechanism through value-state gating, offering benefits across performance, quantization, and interpretability.

Abstract: Large models based on the Transformer architecture are susceptible to extreme-token phenomena, such as attention sinks and value-state drains. These issues, which degrade model performance, quantization fidelity, and interpretability, arise from a problematic mutual reinforcement mechanism where the model learns an inefficient ’no-op’ behavior by focusing attention on tokens with near-zero value states. In this paper, we propose Value-State Gated Attention (VGA), a simple, dedicated, and stable architectural mechanism for performing ’no-op’ attention efficiently by directly breaking this cycle. VGA introduces a learnable, data-dependent gate, computed directly from the value vectors (V), to modulate the output. Through a theoretical analysis of the underlying gradients, we show that gating the value-state with a function of itself is more effective at decoupling value and attention score updates than prior methods that gate on input embeddings. This creates a direct regulatory pathway that allows the model to suppress a token’s contribution based on its emergent value representation. Our experiments demonstrate that VGA significantly mitigates the formation of attention sinks and stabilizes value-state norms, leading to improved performance, robust quantization fidelity, and enhanced model interpretability.

[417] Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

Jingyu Xing, Chenwei Tang, Xinyu Liu, Deng Xiong, Shudong Huang, Wei Ju, Jiancheng Lv, Ziyue Qiao

Main category: cs.LG

TL;DR: COMPASS introduces a novel test-time RL method for LLMs that learns from unlabeled data using composite self-scoring of answers and reasoning paths, eliminating need for human-curated preference data.

DetailsMotivation: Current RL methods for LLMs face scalability bottlenecks due to heavy reliance on human-curated preference data or labeled datasets for reward modeling. The paper aims to enable RL on unlabeled data where models can learn autonomously from continuous experience streams.

Method: COMPASS (Composite Path and Answer Self-Scoring) integrates two components: 1) Dual-Calibration Answer Reward (DCAR) - stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and 2) Decisive Path Reward (DPR) - directly optimizes reasoning process quality beyond just outcome supervision. The method operates without external supervision by jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains.

Result: Extensive experiments show COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, demonstrating effectiveness in enhancing LLMs’ analytical capabilities without external supervision.

Conclusion: COMPASS advances a more scalable direction for LLMs to learn from continuous experience by overcoming the fundamental scalability bottleneck of current RL methods that rely on human-curated data, enabling autonomous learning from unlabeled data streams.

Abstract: Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model’s analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.

[418] Test-driven Reinforcement Learning in Continuous Control

Zhao Yu, Xiuping Wu, Liangjun Ke

Main category: cs.LG

TL;DR: TdRL replaces single reward functions with multiple test functions (pass-fail and indicative tests) to separate task definition from learning guidance, making RL task specification easier and supporting multi-objective optimization.

DetailsMotivation: Traditional RL reward functions serve dual purposes (defining optimal goals and guiding learning), making manual design challenging and often resulting in suboptimal task representations. The paper aims to address this reward design challenge.

Method: Proposes Test-driven RL (TdRL) framework using multiple test functions instead of single reward functions. Includes theoretical proof that maximum entropy policy optimization with appropriate return functions yields policies closer to optimal. Introduces lexicographic heuristic for comparing trajectory distances to optimal set, and develops algorithm implementation.

Result: Experimental results on DeepMind Control Suite show TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization.

Conclusion: TdRL offers a novel perspective for representing task objectives that helps address reward design challenges in RL applications, separating task definition from learning guidance through test functions.

Abstract: Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.

[419] Decomposition of Small Transformer Models

Casper L. Christensen, Logan Riggs

Main category: cs.LG

TL;DR: Extends Stochastic Parameter Decomposition (SPD) to Transformer models, enabling interpretable parameter-space analysis of real models like GPT-2-small.

DetailsMotivation: Previous mechanistic interpretability methods have only worked on toy models, leaving a gap to real-world models. The paper aims to bridge this gap by extending SPD to modern Transformer architectures.

Method: Extends Stochastic Parameter Decomposition (SPD) to Transformer models with an updated causal importance function for sequential data and a new loss function.

Result: Successfully decomposes a toy induction-head model to recover the expected 2-step circuit, and applies SPD to GPT-2-small to locate subcomponents corresponding to interpretable concepts like “golf” and “basketball”.

Conclusion: This work takes the first step in extending SPD to modern models, demonstrating that parameter-space decomposition can surface interpretable mechanisms in real-world language models.

Abstract: Recent work in mechanistic interpretability has shown that decomposing models in parameter space may yield clean handles for analysis and intervention. Previous methods have demonstrated successful applications on a wide range of toy models, but the gap to “real models” has not yet been bridged. In this work, we extend Stochastic Parameter Decomposition (SPD) to Transformer models, proposing an updated causal importance function suited for sequential data and a new loss function. We demonstrate that SPD can successfully decompose a toy induction-head model and recover the expected 2-step circuit. We also show that applying SPD to GPT-2-small can successfully locate subcomponents corresponding to interpretable concepts like “golf” and “basketball”. These results take the first step in the direction of extending SPD to modern models, and show that we can use the method to surface interpretable parameter-space mechanisms.

[420] Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

Zhenshuo Zhang, Minxuan Duan, Youran Ye, Hongyang R. Zhang

Main category: cs.LG

TL;DR: PolicyGradEx: A two-stage RL method that clusters n objectives into k groups using meta-training and fine-tuning, achieving 16% better performance and 26× faster clustering than baselines.

DetailsMotivation: Learning a single policy for all n objectives becomes suboptimal as n grows in applications like robotics, control, and preference optimization. Need efficient method to group related objectives for joint training.

Method: Two-stage approach: 1) Meta-train policy on all objectives using multitask learning, 2) Fine-tune on random subsets and compute task-affinity scores via first-order approximation. Cluster objectives by maximizing intra-cluster affinity scores.

Result: Outperforms SOTA baselines by 16% on average, achieves up to 26× faster clustering speed. First-order approximation verified within 2% error margin. Loss-based clustering beats random/gradient-similarity grouping by 19%.

Conclusion: PolicyGradEx efficiently groups related RL objectives for joint training. The method leverages meta-learning and first-order approximations to estimate task affinities, enabling scalable multi-objective RL with performance improvements.

Abstract: We study the problem of efficiently estimating policies that simultaneously optimize multiple objectives in reinforcement learning (RL). Given $n$ objectives (or tasks), we seek the optimal partition of these objectives into $k \ll n$ groups, where each group comprises related objectives that can be trained together. This problem arises in applications such as robotics, control, and preference optimization in language models, where learning a single policy for all $n$ objectives is suboptimal as $n$ grows. We introduce a two-stage procedure – meta-training followed by fine-tuning – to address this problem. We first learn a meta-policy for all objectives using multitask learning. Then, we adapt the meta-policy to multiple randomly sampled subsets of objectives. The adaptation step leverages a first-order approximation property of well-trained policy networks, which is empirically verified to be accurate within a 2% error margin across various RL environments. The resulting algorithm, PolicyGradEx, efficiently estimates an aggregate task-affinity score matrix given a policy evaluation algorithm. Based on the estimated affinity score matrix, we cluster the $n$ objectives into $k$ groups by maximizing the intra-cluster affinity scores. Experiments on three robotic control and the Meta-World benchmarks demonstrate that our approach outperforms state-of-the-art baselines by 16% on average, while delivering up to $26\times$ faster speedup relative to performing full training to obtain the clusters. Ablation studies validate each component of our approach. For instance, compared with random grouping and gradient-similarity-based grouping, our loss-based clustering yields an improvement of 19%. Finally, we analyze the generalization error of policy networks by measuring the Hessian trace of the loss surface, which gives non-vacuous measures relative to the observed generalization errors.

[421] Batch Acquisition Function Evaluations and Decouple Optimizer Updates for Faster Bayesian Optimization

Kaichi Irie, Shuhei Watanabe, Masaki Onishi

Main category: cs.LG

TL;DR: The paper proposes a coroutine-based approach to decouple quasi-Newton updates while batching acquisition function calls in Bayesian optimization, achieving identical convergence to sequential multi-start optimization while drastically reducing wall-clock time.

DetailsMotivation: Bayesian optimization faces a computational bottleneck in acquisition function optimization where multi-start optimization with quasi-Newton methods is required. Current approaches like BoTorch batch acquisition function calls but suffer from suboptimal convergence due to off-diagonal approximation errors in the inverse Hessian of quasi-Newton methods.

Method: The authors propose decoupling quasi-Newton updates using a coroutine while maintaining batched acquisition function calls. This approach separates the optimization updates from the batched function evaluations to avoid the approximation errors that occur when optimizing summed acquisition functions over multiple points.

Result: The proposed approach achieves theoretically identical convergence to sequential multi-start optimization while drastically reducing wall-clock time compared to previous approaches. The method is implemented in GPSampler in Optuna, effectively reducing computational overhead.

Conclusion: The coroutine-based decoupling of quasi-Newton updates while batching acquisition function calls provides an effective solution to the computational bottleneck in Bayesian optimization, offering both theoretical convergence guarantees and practical speed improvements.

Abstract: Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to PyTorch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches. Our approach is available in GPSampler in Optuna, effectively reducing its computational overhead.

[422] GateRA: Token-Aware Modulation for Parameter-Efficient Fine-Tuning

Jie Ou, Shuaihong Jiang, Yingjun Du, Cees G. M. Snoek

Main category: cs.LG

TL;DR: GateRA introduces token-aware modulation to dynamically adjust PEFT update strength, enabling selective adaptation based on input difficulty rather than applying uniform updates to all tokens.

DetailsMotivation: Existing PEFT methods apply static, input-agnostic updates to all tokens, which can lead to overfitting on trivial content or under-adaptation on informative regions, especially in autoregressive settings with distinct prefill and decoding dynamics.

Method: GateRA incorporates adaptive gating into standard PEFT branches to enable selective, token-level adaptation. It uses entropy-based regularization to encourage near-binary gating decisions and prevent diffuse update patterns, creating interpretable sparse adaptation without hard thresholding.

Result: GateRA consistently outperforms or matches prior PEFT methods on multiple commonsense reasoning benchmarks. Visualizations show phase-sensitive behaviors where GateRA suppresses updates for redundant prefill tokens while emphasizing adaptation during decoding.

Conclusion: GateRA provides a unified framework for token-aware PEFT adaptation that dynamically adjusts update strength based on input difficulty, improving adaptation efficiency and performance while maintaining interpretability through sparse gating patterns.

Abstract: Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, DoRA, and HiRA, enable lightweight adaptation of large pre-trained models via low-rank updates. However, existing PEFT approaches apply static, input-agnostic updates to all tokens, disregarding the varying importance and difficulty of different inputs. This uniform treatment can lead to overfitting on trivial content or under-adaptation on more informative regions, especially in autoregressive settings with distinct prefill and decoding dynamics. In this paper, we propose GateRA, a unified framework that introduces token-aware modulation to dynamically adjust the strength of PEFT updates. By incorporating adaptive gating into standard PEFT branches, GateRA enables selective, token-level adaptation, preserving pre-trained knowledge for well-modeled inputs while focusing capacity on challenging cases. Empirical visualizations reveal phase-sensitive behaviors, where GateRA automatically suppresses updates for redundant prefill tokens while emphasizing adaptation during decoding. To promote confident and efficient modulation, we further introduce an entropy-based regularization that encourages near-binary gating decisions. This regularization prevents diffuse update patterns and leads to interpretable, sparse adaptation without hard thresholding. Finally, we present a theoretical analysis showing that GateRA induces a soft gradient-masking effect over the PEFT path, enabling continuous and differentiable control over adaptation. Experiments on multiple commonsense reasoning benchmarks demonstrate that GateRA consistently outperforms or matches prior PEFT methods.

[423] Tractable Probabilistic Models for Investment Planning

Nicolas M. Cuadrado A., Mohannad Takrouri, Jiří Němeček, Martin Takáč, Jakub Mareček

Main category: cs.LG

TL;DR: TPMs (especially SPNs) enable exact, scalable inference for power investment planning under uncertainty, supporting robust scenario expansion and chance-constrained optimization.

DetailsMotivation: Traditional scenario-based approaches for decade-long power investment planning are limited by generating finite scenarios (Dirac mixtures), which restricts insight into scenario-specific volatility and hinders robust decision-making under profound uncertainty.

Method: Proposes using tractable probabilistic models (TPMs), particularly sum-product networks (SPNs), which enable exact and scalable inference of key quantities like scenario likelihoods, marginals, and conditional probabilities. This supports robust scenario expansion and risk assessment.

Result: Demonstrated effectiveness through a representative power system planning case study, showing computational and reliability advantages over traditional scenario-based models.

Conclusion: TPMs provide a superior alternative to classical scenario-based approaches for power investment planning, enabling direct embedding of chance-constrained optimization and better quantification of high-dimensional uncertainties for robust decision-making.

Abstract: Investment planning in power utilities, such as generation and transmission expansion, requires decade-long forecasts under profound uncertainty. Forecasting of energy mix and energy use decades ahead is nontrivial. Classical approaches focus on generating a finite number of scenarios (modeled as a mixture of Diracs in statistical theory terms), which limits insight into scenario-specific volatility and hinders robust decision-making. We propose an alternative using tractable probabilistic models (TPMs), particularly sum-product networks (SPNs). These models enable exact, scalable inference of key quantities such as scenario likelihoods, marginals, and conditional probabilities, supporting robust scenario expansion and risk assessment. This framework enables direct embedding of chance-constrained optimization into investment planning, enforcing safety or reliability with prescribed confidence levels. TPMs allow both scenario analysis and volatility quantification by compactly representing high-dimensional uncertainties. We demonstrate the effectiveness of the approach through a representative power system planning case study, illustrating its computational and reliability advantages over traditional scenario-based models.

[424] Escaping the Verifier: Learning to Reason via Demonstrations

Locke Cai, Ivan Provilkov

Main category: cs.LG

TL;DR: RARO enables LLM reasoning training without verifiers by using expert demonstrations through adversarial inverse reinforcement learning.

DetailsMotivation: Many reasoning-intensive tasks lack verifiers but have abundant expert demonstrations that are underutilized for reasoning training.

Method: RARO uses adversarial game between policy and relativistic critic: policy mimics expert answers, critic identifies experts among (expert, policy) pairs via inverse reinforcement learning.

Result: Significantly outperforms verifier-free baselines on Countdown, DeepMath, and Poetry Writing tasks, shows same robust scaling trends as RL with verifiers.

Conclusion: RARO effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning without task-specific verifiers.

Abstract: Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial game between a policy and a relativistic critic: the policy learns to mimic expert answers, while the critic aims to identify the experts among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks – Countdown, DeepMath, and Poetry Writing – and enjoys the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

[425] Towards Characterizing Knowledge Distillation of PPG Heart Rate Estimation Models

Kanav Arora, Girish Narayanswamy, Shwetak Patel, Richard Li

Main category: cs.LG

TL;DR: This paper explores knowledge distillation techniques to compress large pre-trained PPG models into smaller versions suitable for edge deployment on wearable devices, evaluating four distillation strategies and characterizing scaling laws.

DetailsMotivation: While deep learning models show strong performance for heart rate estimation from PPG signals, deploying them on wearable devices requires meeting strict memory and latency constraints. There's a need to compress these models for real-time edge inference while maintaining accuracy.

Method: The authors evaluate four knowledge distillation strategies through comprehensive sweeps of teacher and student model capacities: (1) hard distillation, (2) soft distillation, (3) decoupled knowledge distillation (DKD), and (4) feature distillation. They characterize the resulting scaling laws between model size and performance.

Result: The paper presents characterization of scaling laws describing the relationship between model size and performance for distilled PPG models, providing insights into practical methods for building edge-deployable physiological sensing models.

Conclusion: This early investigation lays groundwork for practical and predictable methods to create edge-deployable models for physiological sensing by exploring how large pre-trained PPG models can be effectively distilled to smaller versions suitable for real-time inference on wearable devices.

Abstract: Heart rate estimation from photoplethysmography (PPG) signals generated by wearable devices such as smartwatches and fitness trackers has significant implications for the health and well-being of individuals. Although prior work has demonstrated deep learning models with strong performance in the heart rate estimation task, in order to deploy these models on wearable devices, these models must also adhere to strict memory and latency constraints. In this work, we explore and characterize how large pre-trained PPG models may be distilled to smaller models appropriate for real-time inference on the edge. We evaluate four distillation strategies through comprehensive sweeps of teacher and student model capacities: (1) hard distillation, (2) soft distillation, (3) decoupled knowledge distillation (DKD), and (4) feature distillation. We present a characterization of the resulting scaling laws describing the relationship between model size and performance. This early investigation lays the groundwork for practical and predictable methods for building edge-deployable models for physiological sensing.

[426] Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model

Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani

Main category: cs.LG

TL;DR: Image2Gcode: Direct G-code generation from 2D images using diffusion models, bypassing CAD modeling for faster additive manufacturing.

DetailsMotivation: CAD modeling is a bottleneck in mechanical design workflows - it's slow, requires manual updates for design variations, and hinders rapid prototyping and iteration.

Method: End-to-end framework that extracts slice-wise structural cues from 2D images (hand-drawn or captured), then uses denoising diffusion probabilistic model (DDPM) to transform Gaussian noise into executable G-code sequences with extrusion parameters.

Result: Direct mapping from visual input to native toolpaths, eliminating need for CAD/STL intermediates, lowering entry barrier for additive manufacturing, accelerating design-to-fabrication cycle.

Conclusion: Flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing by enabling automated pipeline from concept to physical artifact.

Abstract: Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.

[427] China Regional 3km Downscaling Based on Residual Corrective Diffusion Model

Honglu Sun, Hao Jing, Zhixiang Dai, Sa Xiao, Wei Xue, Jian Sun, Qifeng Lu

Main category: cs.LG

TL;DR: This paper presents a diffusion-based statistical downscaling method (CorrDiff) for weather forecasting that generates high-resolution 3km forecasts for China from low-resolution 25km global models, outperforming operational regional models.

DetailsMotivation: The need for efficient high-resolution weather forecasts, as traditional methods like dynamical downscaling are computationally expensive. Statistical downscaling using deep learning offers a more efficient alternative to produce detailed forecasts from global model outputs.

Method: Uses CorrDiff, a diffusion-based downscaling framework, with modifications: expanded region coverage (40x larger), inclusion of high-level variables at six pressure levels (not just surface variables), and addition of global residual connections for improved accuracy. Applied to 25km forecasts from CMA-GFS (operational model) and SFF (data-driven SFNO model).

Result: The downscaled forecasts generally outperform direct forecasts from CMA-MESO (high-resolution regional baseline) in terms of MAE for target variables. The generative approach produces fine-scale details and more realistic predictions compared to deterministic regression models, particularly evident in radar composite reflectivity forecasts.

Conclusion: Diffusion-based statistical downscaling (CorrDiff) is effective for generating high-resolution weather forecasts, demonstrating superior performance over operational regional models and producing more realistic fine-scale details than deterministic approaches.

Abstract: A fundamental challenge in numerical weather prediction is to efficiently produce high-resolution forecasts. A common solution is applying downscaling methods, which include dynamical downscaling and statistical downscaling, to the outputs of global models. This work focuses on statistical downscaling, which establishes statistical relationships between low-resolution and high-resolution historical data using statistical models. Deep learning has emerged as a powerful tool for this task, giving rise to various high-performance super-resolution models, which can be directly applied for downscaling, such as diffusion models and Generative Adversarial Networks. This work relies on a diffusion-based downscaling framework named CorrDiff. In contrast to the original work of CorrDiff, the region considered in this work is nearly 40 times larger, and we not only consider surface variables as in the original work, but also encounter high-level variables (six pressure levels) as target downscaling variables. In addition, a global residual connection is added to improve accuracy. In order to generate the 3km forecasts for the China region, we apply our trained models to the 25km global grid forecasts of CMA-GFS, an operational global model of the China Meteorological Administration (CMA), and SFF, a data-driven deep learning-based weather model developed from Spherical Fourier Neural Operators (SFNO). CMA-MESO, a high-resolution regional model, is chosen as the baseline model. The experimental results demonstrate that the forecasts downscaled by our method generally outperform the direct forecasts of CMA-MESO in terms of MAE for the target variables. Our forecasts of radar composite reflectivity show that CorrDiff, as a generative model, can generate fine-scale details that lead to more realistic predictions compared to the corresponding deterministic regression models.

[428] ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

Yuchen Zeng, Shuibai Zhang, Wonjun Kang, Shutong Wu, Lynnix Zou, Ying Fan, Heeju Kim, Ziqian Lin, Jungtaek Kim, Hyung Il Koo, Dimitris Papailiopoulos, Kangwook Lee

Main category: cs.LG

TL;DR: ReJump is a framework for analyzing Large Reasoning Models by representing reasoning traces as visitation orders in a tree structure, enabling quantification of various reasoning behaviors like exploration, exploitation, and verification.

DetailsMotivation: Large Reasoning Models (LRMs) achieve impressive results on complex tasks but their underlying reasoning "algorithms" remain poorly understood. There's a need for systematic analysis tools to understand how these models reason.

Method: Proposes ReJump framework that represents reasoning traces as visitation orders over nodes in a tree of intermediate steps. Uses jumps (adjacent and non-adjacent moves) to capture behaviors like calculation, backtracking, and verification. Develops metrics to quantify exploration, exploitation, overthinking, forgetting, and verification.

Result: Found that models with similar accuracy can exhibit distinct reasoning behaviors, and different tasks favor different reasoning styles. Successfully compared distilled LRMs with teachers, CoT-prompted LLMs with LRMs, and examined effects of training data and reinforcement learning. Showed ReJump can improve reasoning quality through strategies like Best-of-N selection and prompt selection.

Conclusion: ReJump provides a powerful framework for analyzing and understanding the reasoning behaviors of Large Reasoning Models, revealing important insights about how different training strategies shape reasoning patterns, and offers practical applications for improving reasoning quality at test time.

Abstract: Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning “algorithms” remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection. Our code is publicly available at https://github.com/UW-Madison-Lee-Lab/ReJump.

[429] CLAPS: Posterior-Aware Conformal Intervals via Last-Layer Laplace

Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh

Main category: cs.LG

TL;DR: CLAPS combines Last-Layer Laplace Approximation with split-conformal calibration to create posterior-aware conformal regression that yields narrower prediction intervals with same coverage, especially effective on small tabular datasets.

DetailsMotivation: Existing conformal regression methods often rely on residual-based approaches that don't fully leverage predictive shape information. There's a need for methods that produce narrower prediction intervals while maintaining coverage, particularly in data-scarce scenarios like small to medium tabular datasets where uncertainty modeling is crucial.

Method: CLAPS pairs Last-Layer Laplace Approximation with split-conformal calibration to create a Gaussian posterior. It defines a two-sided posterior CDF score that aligns the conformity metric with the full predictive shape rather than just point estimates. The method also includes a lightweight diagnostic suite to separate aleatoric and epistemic uncertainty components.

Result: CLAPS consistently achieves nominal coverage with improved efficiency (narrower prediction intervals) across multiple benchmarks using the same MLP backbone. It shows particular effectiveness on small to medium tabular datasets where data is scarce, with minimal computational overhead compared to residual-based conformal baselines.

Conclusion: CLAPS offers a practical upgrade to residual-based conformal methods by better leveraging posterior information to produce narrower prediction intervals while maintaining coverage, especially valuable in data-scarce scenarios. The included diagnostic tools help practitioners understand uncertainty decomposition and interval behavior.

Abstract: We present CLAPS, a posterior-aware conformal regression method that pairs a Last-Layer Laplace Approximation with split-conformal calibration. From the resulting Gaussian posterior, CLAPS defines a simple two-sided posterior CDF score that aligns the conformity metric with the full predictive shape, not just a point estimate. This alignment yields narrower prediction intervals at the same target coverage, especially on small to medium tabular datasets where data are scarce and uncertainty modeling matters. We also provide a lightweight diagnostic suite that separates aleatoric and epistemic components and visualizes posterior behavior, helping practitioners understand why intervals shrink when they do. Across multiple benchmarks using the same MLP backbone, CLAPS consistently attains nominal coverage with improved efficiency and minimal overhead, offering a clear, practical upgrade to residual-based conformal baselines.

[430] Inverse Optimality for Fair Digital Twins: A Preference-based approach

Daniele Masti, Francesco Basciani, Arianna Fedeli, Girgio Gnecco, Francesco Smarra

Main category: cs.LG

TL;DR: A framework that makes fairness a learnable objective in Digital Twins by inferring human fairness preferences through pairwise comparisons and using them to guide optimization toward human-aligned decisions.

DetailsMotivation: Digital Twins often make mathematically optimal decisions that diverge from human expectations due to mismatch between algorithmic rationality and bounded human rationality, particularly regarding fairness considerations.

Method: Proposes a preference-driven learning workflow using Siamese neural networks to infer latent fairness objectives from human pairwise preferences over decisions, generating convex quadratic cost functions conditioned on context to guide optimization.

Result: Demonstrated effectiveness on COVID-19 hospital resource allocation scenario, showing the approach can produce solutions that better reflect human-perceived fairness while maintaining computational efficiency.

Conclusion: Provides a practical solution to integrate human-centered fairness into autonomous decision-making systems by making fairness a learnable objective within optimization-based Digital Twins.

Abstract: Digital Twins (DTs) are increasingly used as autonomous decision-makers in complex socio-technical systems. However, their mathematically optimal decisions often diverge from human expectations, revealing a persistent mismatch between algorithmic and bounded human rationality. This work addresses this challenge by proposing a framework that introduces fairness as a learnable objective within optimization-based Digital Twins. In this respect, a preference-driven learning workflow that infers latent fairness objectives directly from human pairwise preferences over feasible decisions is introduced. A dedicated Siamese neural network is developed to generate convex quadratic cost functions conditioned on contextual information. The resulting surrogate objectives drive the optimization procedure toward solutions that better reflect human-perceived fairness while maintaining computational efficiency. The effectiveness of the approach is demonstrated on a COVID-19 hospital resource allocation scenario. Overall, this work offers a practical solution to integrate human-centered fairness into the design of autonomous decision-making systems.

[431] The Mean-Field Dynamics of Transformers

Philippe Rigollet

Main category: cs.LG

TL;DR: Transformers as interacting particle systems with continuum limits; attention dynamics connect to Wasserstein flows, synchronization, and clustering; global clustering phenomenon with metastable states; analysis of clustering rates, normalization effects, and phase transitions.

DetailsMotivation: To develop a mathematical framework for understanding Transformer attention dynamics through continuum limits and connecting them to established mathematical concepts like gradient flows and synchronization models.

Method: Interpret Transformer attention as interacting particle system; study continuum (mean-field) limits; idealize attention on sphere; connect to Wasserstein gradient flows, Kuramoto synchronization, and mean-shift clustering; analyze equiangular reduction for exact clustering rates; examine normalization scheme effects.

Result: Identified global clustering phenomenon with asymptotic token clustering after metastable states; derived exact clustering rates via tractable reduction; showed how normalization affects contraction speeds; identified phase transition for long-context attention; revealed mechanisms of representation collapse vs. preservation of multi-cluster structure.

Conclusion: The framework provides mathematical understanding of Transformer dynamics, revealing clustering behavior, normalization effects, and phase transitions that explain both representation collapse and preservation of expressive structure in deep attention architectures.

Abstract: We develop a mathematical framework that interprets Transformer attention as an interacting particle system and studies its continuum (mean-field) limits. By idealizing attention on the sphere, we connect Transformer dynamics to Wasserstein gradient flows, synchronization models (Kuramoto), and mean-shift clustering. Central to our results is a global clustering phenomenon whereby tokens cluster asymptotically after long metastable states where they are arranged into multiple clusters. We further analyze a tractable equiangular reduction to obtain exact clustering rates, show how commonly used normalization schemes alter contraction speeds, and identify a phase transition for long-context attention. The results highlight both the mechanisms that drive representation collapse and the regimes that preserve expressive, multi-cluster structure in deep attention architectures.

[432] An Improved Ensemble-Based Machine Learning Model with Feature Optimization for Early Diabetes Prediction

Md. Najmul Islam, Md. Miner Hossain Rimon, Shah Sadek-E-Akbor Shamim, Zarif Mohaimen Fahad, Md. Jehadul Islam Mony, Md. Jalal Uddin Chowdhury

Main category: cs.LG

TL;DR: The paper develops a machine learning framework for diabetes prediction using BRFSS data, addressing class imbalance with SMOTE/Tomek Links, achieving high performance with ensemble methods, and implementing a React Native mobile app for practical deployment.

DetailsMotivation: Diabetes is a serious global health issue requiring early detection, but prediction is challenging due to overlapping risk factors and data asymmetry. There's a need for accurate, interpretable machine learning models to aid clinical decision-making using extensive health survey data.

Method: Used 2015 BRFSS dataset with ~253,680 records and 22 numerical features. Applied SMOTE and Tomek Links to address class imbalance. Evaluated multiple supervised learning models (Random Forest, XGBoost, CatBoost, LightGBM) and ensemble techniques including stacking. Implemented a React Native mobile app with Python Flask backend for practical deployment.

Result: Individual models achieved strong ROC-AUC ~0.96. The stacking ensemble with XGBoost and KNN performed best with 94.82% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, showing excellent balance between recall and precision. A practical mobile application was developed for early diabetes prediction.

Conclusion: The proposed machine learning framework effectively addresses diabetes prediction challenges, with ensemble methods achieving high performance metrics. The developed mobile application provides an accessible tool for early diabetes detection, supporting clinical decision-making and health monitoring.

Abstract: Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and LightGBM.The stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.

[433] Mitigating the Curse of Detail: Scaling Arguments for Feature Learning and Sample Complexity

Noa Rubin, Orit Davidovich, Zohar Ringel

Main category: cs.LG

TL;DR: The paper proposes a simpler heuristic scale analysis method to predict when feature learning emerges in deep networks, avoiding complex numerical solutions of exact theories.

DetailsMotivation: Current theories of rich feature learning involve computationally intensive high-dimensional non-linear equations, making analysis complex and challenging for practical deep learning problems.

Method: Proposes a heuristic scale analysis approach that predicts data and width scales where different feature learning patterns emerge, which is simpler than exact theories.

Result: The method reproduces scaling exponents of known results and makes novel predictions for complex architectures like three-layer non-linear networks and attention heads.

Conclusion: The scale analysis approach extends the scope of first-principle theories of deep learning by providing a simpler, practical alternative to computationally intensive exact theories.

Abstract: Two pressing topics in the theory of deep learning are the interpretation of feature learning mechanisms and the determination of implicit bias of networks in the rich regime. Current theories of rich feature learning, often appear in the form of high-dimensional non-linear equations, which require computationally intensive numerical solutions. Given the many details that go into defining a deep learning problem, this complexity is a significant and often unavoidable challenge. Here, we propose a powerful heuristic route for predicting the data and width scales at which various patterns of feature learning emerge. This form of scale analysis is considerably simpler than exact theories and reproduces the scaling exponents of various known results. In addition, we make novel predictions on complex toy architectures, such as three-layer non-linear networks and attention heads, thus extending the scope of first-principle theories of deep learning.

[434] Uncertainty Quantification for Scientific Machine Learning using Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN)

Y. Sungtaek Ju

Main category: cs.LG

TL;DR: SVGP KANs integrate sparse variational Gaussian processes with Kolmogorov-Arnold Networks for scalable Bayesian uncertainty quantification while maintaining interpretability.

DetailsMotivation: Standard Kolmogorov-Arnold Networks lack principled uncertainty quantification capabilities needed for scientific applications, requiring a framework that can distinguish aleatoric from epistemic uncertainty.

Method: Integrate sparse variational Gaussian process inference with Kolmogorov-Arnold topology using analytic moment matching to propagate uncertainty through deep additive structures while maintaining interpretability.

Result: Demonstrated framework’s ability in three studies: calibration of heteroscedastic measurement noise in fluid flow reconstruction, quantification of prediction confidence degradation in multi-step forecasting, and out-of-distribution detection in convolutional autoencoders.

Conclusion: SVGP KANs are a promising architecture for uncertainty-aware learning in scientific machine learning, providing scalable Bayesian inference with quasi-linear computational complexity.

Abstract: Kolmogorov-Arnold Networks have emerged as interpretable alternatives to traditional multi-layer perceptrons. However, standard implementations lack principled uncertainty quantification capabilities essential for many scientific applications. We present a framework integrating sparse variational Gaussian process inference with the Kolmogorov-Arnold topology, enabling scalable Bayesian inference with computational complexity quasi-linear in sample size. Through analytic moment matching, we propagate uncertainty through deep additive structures while maintaining interpretability. We use three example studies to demonstrate the framework’s ability to distinguish aleatoric from epistemic uncertainty: calibration of heteroscedastic measurement noise in fluid flow reconstruction, quantification of prediction confidence degradation in multi-step forecasting of advection-diffusion dynamics, and out-of-distribution detection in convolutional autoencoders. These results suggest Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KANs) is a promising architecture for uncertainty-aware learning in scientific machine learning.

[435] Towards agent-based-model informed neural networks

Nino Antulov-Fantulin

Main category: cs.LG

TL;DR: ABM-NNs: Neural networks constrained by agent-based model principles for interpretable, structure-preserving dynamics learning across ecological, epidemiological, and economic systems.

DetailsMotivation: Standard neural differential equations fail to enforce important constraints in complex systems (mass conservation, information locality, bounded rationality) that are crucial in agent-based models but lack physical invariants like energy.

Method: Agent-Based-Model informed Neural Networks (ABM-NNs) using restricted graph neural networks and hierarchical decomposition to learn interpretable, structure-preserving dynamics consistent with ABM principles.

Result: Validated across three case studies: (1) Generalized Lotka-Volterra system - recovered ground-truth parameters from short trajectories with interventions; (2) Graph-based SIR contagion model - outperformed GCN, GraphSAGE, Graph Transformer in forecasting and noise robustness; (3) Real-world macroeconomic model - learned coupled GDP dynamics and enabled counterfactual policy analysis.

Conclusion: ABM-NNs provide a principled framework for learning interpretable dynamics that respect agent-based model constraints, enabling better generalization, robustness, and policy analysis across diverse complex systems.

Abstract: In this article, we present a framework for designing neural networks that remain consistent with the underlying principles of agent-based models. We begin by highlighting the limitations of standard neural differential equations in modeling complex systems, where physical invariants (like energy) are often absent but other constraints (like mass conservation, information locality, bounded rationality) must be enforced. To address this, we introduce Agent-Based-Model informed Neural Networks (ABM-NNs), which leverage restricted graph neural networks and hierarchical decomposition to learn interpretable, structure-preserving dynamics. We validate the framework across three case studies of increasing complexity: (i) a generalized Generalized Lotka–Volterra system, where we recover ground-truth parameters from short trajectories in presence of interventions; (ii) a graph-based SIR contagion model, where our method outperforms state-of-the-art graph learning baselines (GCN, GraphSAGE, Graph Transformer) in out-of-sample forecasting and noise robustness; and (iii) a real-world macroeconomic model of the ten largest economies, where we learn coupled GDP dynamics from empirical data and demonstrate counterfactual analysis for policy interventions

[436] FRWKV:Frequency-Domain Linear Attention for Long-Term Time Series Forecasting

Qingyuan Yang, Shizhuo Deng, Dongyue Chen, Da Teng, Zehua Gan

Main category: cs.LG

TL;DR: FRWKV is a frequency-domain linear-attention framework for long-sequence time series forecasting that achieves O(T) complexity by combining linear attention with frequency-domain analysis, outperforming traditional Transformers.

DetailsMotivation: Traditional Transformers have quadratic complexity (O(T²)) which limits their scalability for long-sequence time series forecasting, and they fail to effectively exploit frequency-domain information that could enhance temporal feature representations.

Method: FRWKV integrates linear attention mechanisms (inspired by RWKV’s O(T) linear attention) with frequency-domain analysis, creating a framework that maintains linear computational complexity while leveraging spectral information to improve temporal feature representations.

Result: FRWKV achieves first-place average rank across eight real-world datasets, with ablation studies confirming the critical importance of both linear attention and frequency-encoder components for performance.

Conclusion: The work demonstrates powerful synergy between linear attention and frequency analysis, establishing a new paradigm for scalable time series modeling with linear computational complexity.

Abstract: Traditional Transformers face a major bottleneck in long-sequence time series forecasting due to their quadratic complexity $(\mathcal{O}(T^2))$ and their limited ability to effectively exploit frequency-domain information. Inspired by RWKV’s $\mathcal{O}(T)$ linear attention and frequency-domain modeling, we propose FRWKV, a frequency-domain linear-attention framework that overcomes these limitations. Our model integrates linear attention mechanisms with frequency-domain analysis, achieving $\mathcal{O}(T)$ computational complexity in the attention path while exploiting spectral information to enhance temporal feature representations for scalable long-sequence modeling. Across eight real-world datasets, FRWKV achieves a first-place average rank. Our ablation studies confirm the critical roles of both the linear attention and frequency-encoder components. This work demonstrates the powerful synergy between linear attention and frequency analysis, establishing a new paradigm for scalable time series modeling. Code is available at this repository: https://github.com/yangqingyuan-byte/FRWKV.

cs.MA

[437] CrowdLLM: Building LLM-Based Digital Populations Augmented with Generative Models

Ryan Feng Lin, Keyu Tian, Hanming Zheng, Congjing Zhang, Li Zeng, Shuai Huang

Main category: cs.MA

TL;DR: CrowdLLM integrates LLMs with generative models to create more accurate and diverse digital populations for applications like social simulation and crowdsourcing.

DetailsMotivation: Existing LLM-based digital populations lack sufficient accuracy and diversity compared to real human populations, limiting their practical applications in social simulation, crowdsourcing, marketing, and recommendation systems.

Method: CrowdLLM integrates pretrained large language models (LLMs) with generative models to enhance both diversity and fidelity of digital populations.

Result: Theoretical analysis shows CrowdLLM’s potential for creating cost-effective, representative, and scalable digital populations. Comprehensive experiments across multiple domains (crowdsourcing, voting, user rating) demonstrate promising performance in both accuracy and distributional fidelity to human data.

Conclusion: CrowdLLM successfully addresses the limitations of existing LLM-only approaches by combining LLMs with generative models to create digital populations that better match real human crowds in both accuracy and diversity.

Abstract: The emergence of large language models (LLMs) has sparked much interest in creating LLM-based digital populations that can be applied to many applications such as social simulation, crowdsourcing, marketing, and recommendation systems. A digital population can reduce the cost of recruiting human participants and alleviate many concerns related to human subject study. However, research has found that most of the existing works rely solely on LLMs and could not sufficiently capture the accuracy and diversity of a real human population. To address this limitation, we propose CrowdLLM that integrates pretrained LLMs and generative models to enhance the diversity and fidelity of the digital population. We conduct theoretical analysis of CrowdLLM regarding its great potential in creating cost-effective, sufficiently representative, scalable digital populations that can match the quality of a real crowd. Comprehensive experiments are also conducted across multiple domains (e.g., crowdsourcing, voting, user rating) and simulation studies which demonstrate that CrowdLLM achieves promising performance in both accuracy and distributional fidelity to human data.

[438] MARINE: Theoretical Optimization and Design for Multi-Agent Recursive IN-context Enhancement

Hongwei Zhang, Ji Lu, Yongsheng Du, Yanqin Gao, Lingjun Huang, Baoli Wang, Fang Tan, Peng Zou

Main category: cs.MA

TL;DR: MARINE is a multi-agent framework that transforms LLM reasoning from single-response outputs to iterative refinement, converting pass@N capabilities into near-optimal pass@1 performance with significant parameter efficiency gains.

DetailsMotivation: Current LLM-based agents are limited by practical constraints that force single-response outputs, leaving significant performance potential unrealized. There's a need to move beyond conventional one-shot or multi-sample paradigms to unlock better reasoning capabilities.

Method: MARINE reconceptualizes test-time reasoning as iterative refinement of a persistent reference trajectory. It uses a refinement operator that systematically converts base model’s pass@N capabilities into pass@1 performance. The framework employs minimal feasible batches to maximize expected gains under fixed budgets, with logarithmically growing batch schedules for continuous improvement.

Result: On BrowserComp-ZH benchmark, MARINE achieved 46.0% pass@1 accuracy with a 685B-parameter model. Most notably, an 80B-parameter model with MARINE matched the performance of standalone 1000B-parameter agents, reducing parameter requirements by over 10x. MARINE also delivers higher-quality samples than traditional sampling-and-ranking strategies within fixed computational budgets.

Conclusion: MARINE establishes a new paradigm for parameter-efficient reasoning that fundamentally departs from conventional approaches. It has great potential to boost post-training efficiency by providing higher-quality samples to alignment and optimization processes while dramatically reducing parameter requirements.

Abstract: Large Language Model (LLM)-based agents demonstrate advanced reasoning capabilities, yet practical constraints frequently limit outputs to single responses, leaving significant performance potential unrealized. This paper introduces MARINE (Multi-Agent Recursive IN-context Enhancement), a theoretically grounded framework that reconceptualizes test-time reasoning as iterative refinement of a persistent reference trajectory, fundamentally departing from conventional one-shot or multi-sample paradigms. The MARINE refinement operator systematically converts a base model’s pass@N capabilities into near-optimal pass@1 performance. Rigorous theoretical analysis establishes that minimal feasible batches maximize expected performance gains under fixed invocation budgets, while logarithmically growing batch schedules ensure continuous improvement without computational constraints. Comprehensive evaluation on the BrowserComp-ZH benchmark demonstrates state-of-the-art results, with a 685B-parameter implementation achieving 46.0% pass@1 accuracy. Meanwhile, MARINE establishes a new paradigm for parameter-efficient reasoning: an 80B-parameter model augmented with MARINE matches the performance of standalone 1000B-parameter agents, reducing parameter requirements by over an order of magnitude. Notably, within a fixed computational budget, the proposed MARINE delivers higher-quality samples to alignment and optimization processes than traditional sampling-and-ranking strategies. Consequently, it has great potential to boost post-training efficiency.

[439] Probabilistic Multi-Agent Aircraft Landing Time Prediction

Kyungmin Kim, Seokbin Yoon, Keumjin Lee

Main category: cs.MA

TL;DR: A probabilistic multi-agent framework for aircraft landing time prediction that provides uncertainty estimates and accounts for air traffic interactions, evaluated on Incheon Airport data.

DetailsMotivation: Accurate landing time prediction is crucial for air traffic management, but uncertainty in trajectories and multi-agent interactions make predictions challenging. Models need to provide both point estimates and uncertainty quantification while accounting for how nearby aircraft affect each other through ATC interventions.

Method: Proposed a probabilistic multi-agent aircraft landing time prediction framework that outputs landing times as distributions rather than point estimates. The model incorporates attention mechanisms to capture interactions between aircraft in the airspace.

Result: The framework achieved higher prediction accuracy than baselines when evaluated on air traffic surveillance data from Incheon International Airport. It successfully quantified uncertainties in predictions and used attention scores to uncover underlying air traffic control patterns, enhancing explainability.

Conclusion: The proposed probabilistic multi-agent framework effectively addresses the challenges of landing time prediction by providing uncertainty-aware predictions that account for aircraft interactions, offering both improved accuracy and explainability for air traffic management.

Abstract: Accurate and reliable aircraft landing time prediction is essential for effective resource allocation in air traffic management. However, the inherent uncertainty of aircraft trajectories and traffic flows poses significant challenges to both prediction accuracy and trustworthiness. Therefore, prediction models should not only provide point estimates of aircraft landing times but also the uncertainties associated with these predictions. Furthermore, aircraft trajectories are frequently influenced by the presence of nearby aircraft through air traffic control interventions such as radar vectoring. Consequently, landing time prediction models must account for multi-agent interactions in the airspace. In this work, we propose a probabilistic multi-agent aircraft landing time prediction framework that provides the landing times of multiple aircraft as distributions. We evaluate the proposed framework using an air traffic surveillance dataset collected from the terminal airspace of the Incheon International Airport in South Korea. The results demonstrate that the proposed model achieves higher prediction accuracy than the baselines and quantifies the associated uncertainties of its outcomes. In addition, the model uncovered underlying patterns in air traffic control through its attention scores, thereby enhancing explainability.

cs.MM

[440] Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation

Xinyi Che, Wenbo Wang, Jian Guan, Qijun Zhao

Main category: cs.MM

TL;DR: OD-PFA framework improves multimodal emotion recognition by disentangling shared and modality-specific emotional cues using orthogonal separation and feature alignment.

DetailsMotivation: Existing MERC methods focus on aligning cross-modal semantics but overlook important modality-specific emotional nuances like micro-expressions, tone variations, and sarcastic language, limiting their ability to capture the full spectrum of emotional expression.

Method: Proposes Orthogonal Disentanglement with Projected Feature Alignment (OD-PFA): 1) Decouples unimodal features into shared and modality-specific components, 2) Uses orthogonal disentanglement strategy with reconstruction loss to separate components while preserving emotional information, 3) Applies projected feature alignment to map shared features into common latent space with cross-modal consistency alignment loss.

Result: Extensive evaluations on IEMOCAP and MELD benchmark datasets demonstrate the effectiveness of OD-PFA for multimodal emotion recognition tasks, outperforming state-of-the-art approaches.

Conclusion: OD-PFA successfully addresses limitations of existing methods by capturing both shared semantics and modality-specific emotional cues, leading to improved multimodal emotion recognition performance through effective disentanglement and alignment strategies.

Abstract: Multimodal Emotion Recognition in Conversation (MERC) significantly enhances emotion recognition performance by integrating complementary emotional cues from text, audio, and visual modalities. While existing methods commonly utilize techniques such as contrastive learning and cross-attention mechanisms to align cross-modal emotional semantics, they typically overlook modality-specific emotional nuances like micro-expressions, tone variations, and sarcastic language. To overcome these limitations, we propose Orthogonal Disentanglement with Projected Feature Alignment (OD-PFA), a novel framework designed explicitly to capture both shared semantics and modality-specific emotional cues. Our approach first decouples unimodal features into shared and modality-specific components. An orthogonal disentanglement strategy (OD) enforces effective separation between these components, aided by a reconstruction loss to maintain critical emotional information from each modality. Additionally, a projected feature alignment strategy (PFA) maps shared features across modalities into a common latent space and applies a cross-modal consistency alignment loss to enhance semantic coherence. Extensive evaluations on widely-used benchmark datasets, IEMOCAP and MELD, demonstrate effectiveness of our proposed OD-PFA multimodal emotion recognition tasks, as compared with the state-of-the-art approaches.

[441] Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation

Xiaosen Lyu, Jiayu Xiong, Yuren Chen, Wanlong Wang, Xiaoqing Dai, Jing Wang

Main category: cs.MM

TL;DR: CSS framework combines SPF for efficient multimodal fusion and PGM for stable optimization, achieving state-of-the-art emotion recognition performance.

DetailsMotivation: Existing MERC methods struggle with capturing complex cross-modal interactions and suffer from gradient conflicts/unstable training in deeper architectures.

Method: Proposes Cross-Space Synergy (CSS) with two components: Synergistic Polynomial Fusion (SPF) using low-rank tensor factorization for efficient high-order cross-modal interactions, and Pareto Gradient Modulator (PGM) for Pareto-optimal gradient updates to reduce conflicts.

Result: CSS outperforms existing methods on IEMOCAP and MELD datasets in both accuracy and training stability.

Conclusion: CSS effectively addresses multimodal interaction challenges and optimization instability, demonstrating strong performance in complex emotion recognition scenarios.

Abstract: Multimodal Emotion Recognition in Conversation (MERC) aims to predict speakers’ emotions by integrating textual, acoustic, and visual cues. Existing approaches either struggle to capture complex cross-modal interactions or experience gradient conflicts and unstable training when using deeper architectures. To address these issues, we propose Cross-Space Synergy (CSS), which couples a representation component with an optimization component. Synergistic Polynomial Fusion (SPF) serves the representation role, leveraging low-rank tensor factorization to efficiently capture high-order cross-modal interactions. Pareto Gradient Modulator (PGM) serves the optimization role, steering updates along Pareto-optimal directions across competing objectives to alleviate gradient conflicts and improve stability. Experiments show that CSS outperforms existing representative methods on IEMOCAP and MELD in both accuracy and training stability, demonstrating its effectiveness in complex multimodal scenarios.

eess.AS

[442] An Adaptive Method for Target Curve Selection

Gabriele Ravizza, Julián Villegas, Christer P. Volk, Tore Stegenborg-Andersen, Yan Pei

Main category: eess.AS

TL;DR: Adapting Interactive Differential Evolution algorithm to audio domain for identifying consumer headphone frequency response preferences using adaptive listening tests.

DetailsMotivation: To develop a method for identifying preferred over-the-ear headphone frequency response targets among consumers, particularly untrained listeners, which is challenging with traditional approaches.

Method: Adaptation of Interactive Differential Evolution (IDE) algorithm to audio domain using adaptive paired rating listening tests (paired comparison with scale). Data collected from three listening experiments with over 20 consumers.

Result: The method demonstrates convergence based on two convergence measures, showing potential for extracting frequency response preferences from untrained consumers.

Conclusion: The adapted IDE algorithm can successfully converge and may facilitate the task of identifying headphone frequency response preferences from untrained consumers.

Abstract: In this paper, we introduce an adaptation of the “Interactive Differential Evolution” (IDE) algorithm to the audio domain for the task of identifying the preferred over-the-ear headphone frequency response target among consumers. The method is based on data collection using an adaptive paired rating listening test paradigm (paired comparison with a scale). The IDE algorithm and its parameters are explained in detail. Additionally, data collected from three listening experiments with more than 20 consumers is presented, and the algorithm’s performance in this untested domain is investigated on the basis of two convergence measures. The results indicate that this method can converge and may ease the task of ’extracting’ frequency response preference from untrained consumers.

[443] BUT Systems for Environmental Sound Deepfake Detection in the ESDD 2026 Challenge

Junyi Peng, Lin Zhang, Jin Li, Oldrich Plchot, Jan Cernocky

Main category: eess.AS

TL;DR: BUT’s ESDD 2026 submission uses an ensemble of SSL models with MHFA back-end and feature augmentation to detect environmental sound deepfakes from unseen generators, achieving strong generalization performance.

DetailsMotivation: Address the critical challenge of generalizing audio deepfake detection to unseen synthesis algorithms, which is essential for real-world deployment where new generation methods constantly emerge.

Method: Proposes a robust ensemble framework combining diverse SSL models (BEATs, EAT, Dasheng for general audio, plus speech-specific SSLs) with lightweight Multi-Head Factorized Attention back-end. Introduces feature domain augmentation based on distribution uncertainty modeling to enhance robustness against unseen spectral distortions.

Result: Best single system achieved EERs of 0.00%, 4.60%, and 4.80% on Development, Progress (Track 1), and Final Evaluation sets. Fusion system improved to 0.00%, 3.52%, and 4.38% respectively, demonstrating strong generalization to unseen generators.

Conclusion: The proposed SSL ensemble with MHFA back-end and feature augmentation effectively addresses the generalization challenge in environmental sound deepfake detection, achieving state-of-the-art performance on the ESDD 2026 challenge while using only official training data.

Abstract: This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00%, 4.60%, and 4.80% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00%, 3.52%, and 4.38% across the same partitions.

[444] Direction-of-Arrival and Noise Covariance Matrix joint estimation for beamforming

Vitor Gelsleichter Probst Curtarelli, Stephan Paul, Anderson Wedderhoff Spengler

Main category: eess.AS

TL;DR: Joint DoA and Noise Covariance Matrix estimation method with quasi-linear solution and multi-frequency operation for improved beamforming in reverberant environments.

DetailsMotivation: To develop a more efficient and robust joint estimation method for DoA and NCM that improves beamforming performance in challenging acoustic environments, particularly reverberant settings where traditional methods like MUSIC underperform.

Method: Proposes a joint estimation framework that: 1) Derives a quasi-linear solution for NCM estimation instead of exhaustive search, 2) Introduces novel DoA estimation operating across all frequency bins, 3) Builds upon existing NCM framework for beamforming applications.

Result: Outperforms classical techniques like MUSIC in mid- to high-angle scenarios with lower angular errors, achieves superior signal enhancement through beamforming, demonstrates better noise rejection and interference canceling capabilities compared to other signal enhancement techniques.

Conclusion: The proposed joint estimation method provides a more efficient and robust solution for DoA and NCM estimation, particularly effective in reverberant environments, with validated improvements in both theoretical and empirical performance metrics for beamforming applications.

Abstract: We propose a joint estimation method for the Direction-of-Arrival (DoA) and the Noise Covariance Matrix (NCM) tailored for beamforming applications. Building upon an existing NCM framework, our approach simplifies the estimation procedure by deriving an quasi-linear solution, instead of the traditional exhaustive search. Additionally, we introduce a novel DoA estimation technique that operates across all frequency bins, improving robustness in reverberant environments. Simulation results demonstrate that our method outperforms classical techniques, such as MUSIC, in mid- to high-angle scenarios, achieving lower angular errors and superior signal enhancement through beamforming. The proposed framework was also fared against other techniques for signal enhancement, having better noise rejection and interference canceling capabilities. These improvements are validated using both theoretical and empirical performance metrics.

eess.IV

[445] Missing Wedge Inpainting and Joint Alignment in Electron Tomography through Implicit Neural Representations

Cedric Lim, Corneel Casert, Arthur R. C. McCray, Serin Lee, Andrew Barnum, Jennifer Dionne, Colin Ophus

Main category: eess.IV

TL;DR: Self-supervised implicit neural representation approach for electron tomography that addresses missing-wedge artifacts and alignment issues without requiring training data, enabling high-quality reconstructions from single datasets.

DetailsMotivation: Conventional electron tomography reconstruction suffers from missing-wedge artifacts and data misalignment due to experimental constraints. Existing supervised machine learning methods require training data and lack generalization across different materials systems.

Method: Proposes a fully self-supervised implicit neural representation (INR) approach using a neural network as a regularizer. The method enables inline alignment through pose optimization, missing wedge inpainting, and denoising of low dose datasets via model regularization using only a single dataset.

Result: The method produces high-quality tomograms from diverse and information-limited datasets. It shows that INR-based self-supervised reconstructions offer high fidelity with minimal user input and preprocessing.

Conclusion: The self-supervised INR approach can be readily applied to a wide variety of materials samples and experimental parameters, providing an effective solution to electron tomography reconstruction challenges without requiring external training data.

Abstract: Electron tomography is a powerful tool for understanding the morphology of materials in three dimensions, but conventional reconstruction algorithms typically suffer from missing-wedge artifacts and data misalignment imposed by experimental constraints. Recently proposed supervised machine-learning-enabled reconstruction methods to address these challenges rely on training data and are therefore difficult to generalize across materials systems. We propose a fully self-supervised implicit neural representation (INR) approach using a neural network as a regularizer. Our approach enables fast inline alignment through pose optimization, missing wedge inpainting, and denoising of low dose datasets via model regularization using only a single dataset. We apply our method to simulated and experimental data and show that it produces high-quality tomograms from diverse and information limited datasets. Our results show that INR-based self-supervised reconstructions offer high fidelity reconstructions with minimal user input and preprocessing, and can be readily applied to a wide variety of materials samples and experimental parameters.

[446] FlowSteer: Conditioning Flow Field for Consistent Image Restoration

Tharindu Wickremasinghe, Chenyang Qi, Harshana Weligampola, Zhengzhong Tu, Stanley H. Chan

Main category: eess.IV

TL;DR: FlowSteer is a zero-shot conditioning method that improves flow-based text-to-image models for image restoration tasks by injecting measurement priors during sampling, without retraining or adapters.

DetailsMotivation: Flow-based T2I models drift away from measurement fidelity in image restoration tasks. Existing solutions require data-specific flows or task-specific adapters that are computationally heavy and not scalable across tasks.

Method: FlowSteer is an operator-aware conditioning scheme that injects measurement priors along the sampling path, coupling a frozen flow’s implicit guidance with explicit measurement constraints. It leverages the nature of flow models and their sensitivities to noise to design the scheduler.

Result: Across super-resolution, deblurring, denoising, and colorization tasks, FlowSteer improves measurement consistency and identity preservation in a strictly zero-shot setting without retrained models or adapters.

Conclusion: FlowSteer achieves higher fidelity of reconstructed images while leveraging the rich generative priors of flow models, demonstrating that existing generative capabilities can be efficiently manipulated for restoration tasks.

Abstract: Flow-based text-to-image (T2I) models excel at prompt-driven image generation, but falter on Image Restoration (IR), often “drifting away” from being faithful to the measurement. Prior work mitigate this drift with data-specific flows or task-specific adapters that are computationally heavy and not scalable across tasks. This raises the question “Can’t we efficiently manipulate the existing generative capabilities of a flow model?” To this end, we introduce FlowSteer (FS), an operator-aware conditioning scheme that injects measurement priors along the sampling path,coupling a frozed flow’s implicit guidance with explicit measurement constraints. Across super-resolution, deblurring, denoising, and colorization, FS improves measurement consistency and identity preservation in a strictly zero-shot setting-no retrained models, no adapters. We show how the nature of flow models and their sensitivities to noise inform the design of such a scheduler. FlowSteer, although simple, achieves a higher fidelity of reconstructed images, while leveraging the rich generative priors of flow models.

[447] Tumor-anchored deep feature random forests for out-of-distribution detection in lung cancer segmentation

Aneesh Rangnekar, Harini Veeraraghavan

Main category: eess.IV

TL;DR: RF-Deep is a lightweight, plug-and-play random forests-based OOD detection framework that uses deep features from pretrained transformers to detect out-of-distribution inputs in 3D CT tumor segmentation, improving reliability for clinical deployment.

DetailsMotivation: Current state-of-the-art tumor segmentation models are vulnerable to out-of-distribution inputs, producing confidently incorrect segmentations that pose clinical risks. Existing OOD detection methods either suffer from task-specific biases or add computational overhead through architectural modifications.

Method: RF-Deep uses random forests with limited outlier exposure to analyze hierarchical deep features extracted from multiple regions of interest anchored to predicted tumor segmentations. It repurposes features from pretrained-then-finetuned backbone encoders, making it plug-and-play, lightweight, and scalable to varying image fields-of-view.

Result: RF-Deep achieved AUROC > 93.50 for challenging near-OOD datasets (pulmonary embolism, negative COVID-19) and near-perfect detection (AUROC > 99.00) for far-OOD datasets (kidney cancer, healthy pancreas), substantially outperforming logit-based and radiomics approaches across 1,916 CT scans.

Conclusion: RF-Deep provides an effective, lightweight, architecture-agnostic approach for OOD detection in tumor segmentation, enhancing reliability for clinical deployment without adding significant computational costs or requiring architectural modifications.

Abstract: Accurate segmentation of cancerous lesions from 3D computed tomography (CT) scans is essential for automated treatment planning and response assessment. However, even state-of-the-art models combining self-supervised learning (SSL) pretrained transformers with convolutional decoders are susceptible to out-of-distribution (OOD) inputs, generating confidently incorrect tumor segmentations, posing risks for safe clinical deployment. Existing logit-based methods suffer from task-specific model biases, while architectural enhancements to explicitly detect OOD increase parameters and computational costs. Hence, we introduce a plug-and-play and lightweight post-hoc random forests-based OOD detection framework called RF-Deep that leverages deep features with limited outlier exposure. RF-Deep enhances generalization to imaging variations by repurposing the hierarchical features from the pretrained-then-finetuned backbone encoder, providing task-relevant OOD detection by extracting the features from multiple regions of interest anchored to the predicted tumor segmentations. Hence, it scales to images of varying fields-of-view. We compared RF-Deep against existing OOD detection methods using 1,916 CT scans across near-OOD (pulmonary embolism, negative COVID-19) and far-OOD (kidney cancer, healthy pancreas) datasets. RF-Deep achieved AUROC > 93.50 for the challenging near-OOD datasets and near-perfect detection (AUROC > 99.00) for the far-OOD datasets, substantially outperforming logit-based and radiomics approaches. RF-Deep maintained similar performance consistency across networks of different depths and pretraining strategies, demonstrating its effectiveness as a lightweight, architecture-agnostic approach to enhance the reliability of tumor segmentation from CT volumes.

[448] Learned iterative networks: An operator learning perspective

Andreas Hauptmann, Ozan Öktem

Main category: eess.IV

TL;DR: This chapter presents a unified operator view for learned iterative networks in image reconstruction, connecting functional analytic foundations with discrete implementations.

DetailsMotivation: Learned iterative networks are widely used in computational imaging but often viewed as purely discrete, despite being based on functional analytic optimization algorithms. There's a need to unify these perspectives and provide a coherent theoretical framework.

Method: The authors formulate a learned reconstruction operator that separates computation (how to compute) from learning (what to compute). They present this unified operator view for learned iterative networks, reviewing both linear and nonlinear inverse problems within this framework.

Result: The chapter demonstrates that many learned iterative approaches are closely related in their core when viewed through this unified operator framework. A short numerical study is presented to illustrate the concepts.

Conclusion: The unified operator view provides a coherent theoretical foundation for learned iterative networks, connecting their functional analytic origins with practical discrete implementations, and showing the fundamental relationships between different approaches.

Abstract: Learned image reconstruction has become a pillar in computational imaging and inverse problems. Among the most successful approaches are learned iterative networks, which are formulated by unrolling classical iterative optimisation algorithms for solving variational problems. While the underlying algorithm is usually formulated in the functional analytic setting, learned approaches are often viewed as purely discrete. In this chapter we present a unified operator view for learned iterative networks. Specifically, we formulate a learned reconstruction operator, defining how to compute, and separately the learning problem, which defines what to compute. In this setting we present common approaches and show that many approaches are closely related in their core. We review linear as well as nonlinear inverse problems in this framework and present a short numerical study to conclude.

[449] Enhancing Brain Age Estimation with a Multimodal 3D CNN Approach Combining Structural MRI and AI-Synthesized Cerebral Blood Volume Measures

Jordan Jomsky, Kay C. Igwe, Zongyu Li, Yiren Zhang, Max Lashley, Tal Nuriel, Andrew Laine, Jia Guo

Main category: eess.IV

TL;DR: Multimodal BrainAGE framework combining structural T1w MRI and AI-generated cerebral blood volume (AICBV) maps outperforms single-modality approaches in predicting brain age and detecting early neurodegenerative changes.

DetailsMotivation: Current BrainAGE approaches rely mainly on T1w structural MRI, missing functional vascular changes that may precede tissue damage and cognitive decline. AICBV maps offer vascular information relevant to early neurodegeneration without requiring contrast agents.

Method: Developed multimodal BrainAGE framework using two separate 3D VGG-based networks: one trained on structural T1w scans and one on AICBV maps generated from pre-trained 3D patch-based deep learning model. Models trained/validated on 2,851 scans from 13 open-source datasets and evaluated on ADNI subjects (n=1,233).

Result: Combined model achieved best brain age prediction (MAE=3.95 years, R²=0.943), outperforming T1w-only (MAE=4.10) and AICBV-only (MAE=4.49) models. BrainAGE increased stepwise across CN < MCI < AD and correlated with cognitive impairment. AICBV-based BrainAGE showed strong separation between stable vs. progressive MCI, suggesting sensitivity to prodromal vascular changes.

Conclusion: Multimodal BrainAGE combining structural and vascular information improves brain age prediction and shows promise for detecting early neurodegenerative changes, particularly vascular alterations that precede overt atrophy in MCI progression.

Abstract: Brain age gap estimation (BrainAGE) is a promising imaging-derived biomarker of neurobiological aging and disease risk, yet current approaches rely predominantly on T1-weighted structural MRI (T1w), overlooking functional vascular changes that may precede tissue damage and cognitive decline. Artificial intelligence-generated cerebral blood volume (AICBV) maps, synthesized from non-contrast MRI, offer an alternative to contrast-enhanced perfusion imaging by capturing vascular information relevant to early neurodegeneration. We developed a multimodal BrainAGE framework that integrates brain age predictions using linear regression from two separate 3D VGG-based networks, one model trained on only structural T1w scans and one trained on only AICBV maps generated from a pre-trained 3D patch-based deep learning model. Each model was trained and validated on 2,851 scans from 13 open-source datasets and was evaluated for concordance with mild cognitive impairment (MCI) and Alzheimer’s disease (AD) using ADNI subjects (n=1,233). The combined model achieved the most accurate brain age gap for cognitively normal (CN) controls, with a mean absolute error (MAE) of 3.95 years ($R^2$=0.943), outperforming models trained on T1w (MAE=4.10) or AICBV alone (MAE=4.49). Saliency maps revealed complementary modality contributions: T1w emphasized white matter and cortical atrophy, while AICBV highlighted vascular-rich and periventricular regions implicated in hypoperfusion and early cerebrovascular dysfunction, consistent with normal aging. Next, we observed that BrainAGE increased stepwise across diagnostic strata (CN < MCI < AD) and correlated with cognitive impairment (CDRSB r=0.403; MMSE r=-0.310). AICBV-based BrainAGE showed particularly strong separation between stable vs. progressive MCI (p=$1.47 \times 10^{-8}$), suggesting sensitivity to prodromal vascular changes that precede overt atrophy.

[450] Heart Failure Prediction using Modal Decomposition and Masked Autoencoders for Scarce Echocardiography Databases

Andrés Bell-Navas, María Villalba-Orero, Enrique Lara-Pezzi, Jesús Garicano-Mena, Soledad Le Clainche

Main category: eess.IV

TL;DR: Novel framework combining Modal Decomposition and Masked Autoencoders for heart failure time prediction from echocardiography videos, outperforming existing ViT and CNN architectures.

DetailsMotivation: Heart failure is a leading global cause of mortality (18M deaths/year), creating urgent need for early, rapid, and effective prediction systems. Current approaches lack focus on predicting the specific time of heart failure occurrence.

Method: Two-stage system: 1) Transform echocardiography videos into annotated images using Higher Order Dynamic Mode Decomposition (HODMD) for data augmentation and feature extraction. 2) Train Vision Transformer using Masked Autoencoders with combined self-supervised and supervised learning for real-time heart failure time prediction.

Result: The approach improves prediction accuracy from scarce databases and demonstrates superiority over established ViT and CNN architectures for heart failure time prediction.

Conclusion: The novel framework successfully addresses the challenging task of heart failure time prediction (previously unaddressed), combining modal decomposition with MAE-based ViT training to work effectively with limited data. Code will be released in ModelFLOWs-app software.

Abstract: Heart diseases remain the leading cause of mortality worldwide, implying approximately 18 million deaths according to the WHO. In particular, heart failures (HF) press the healthcare industry to develop systems for their early, rapid, and effective prediction. This work presents an automatic system based on a novel framework which combines Modal Decomposition and Masked Autoencoders (MAE) to extend the application from heart disease classification to the more challenging and specific task of heart failure time prediction, not previously addressed to the best of authors’ knowledge. This system comprises two stages. The first one transforms the data from a database of echocardiography video sequences into a large collection of annotated images compatible with the training phase of machine learning-based frameworks and deep learning-based ones. This stage includes the use of the Higher Order Dynamic Mode Decomposition (HODMD) algorithm for both data augmentation and feature extraction. The second stage builds and trains a Vision Transformer (ViT). MAEs based on a combined scheme of self-supervised (SSL) and supervised learning, so far barely explored in the literature about heart failure prediction, are adopted to effectively train the ViT from scratch, even with scarce databases. The designed neural network analyses in real-time images from echocardiography sequences to estimate the time of happening a heart failure. This approach demonstrates to improve prediction accuracy from scarce databases and to be superior to several established ViT and Convolutional Neural Network (CNN) architectures. The source code will be incorporated into the next version release of the ModelFLOWs-app software (https://github.com/modelflows/ModelFLOWs-app).

[451] From Fibers to Cells: Fourier-Based Registration Enables Virtual Cresyl Violet Staining From 3D Polarized Light Imaging

Alexander Oberstrass, Esteban Vaca, Eric Upschulte, Meiqi Niu, Nicola Palomero-Gallagher, David Graessel, Christian Schiffer, Markus Axer, Katrin Amunts, Timo Dickscheid

Main category: eess.IV

TL;DR: Deep learning generates virtual Cresyl violet staining from 3D-PLI brain images, eliminating need for physical post-staining and registration.

DetailsMotivation: Current methods require costly nonlinear registration between 3D-PLI fiber images and post-stained cytoarchitecture due to tissue distortions, limiting sample availability.

Method: Supervised deep learning for image-to-image translation using Fourier-based registration to align training patches from 3D-PLI and actual Cresyl violet stained sections.

Result: Method successfully predicts virtual Cresyl violet staining from 3D-PLI, showing plausible cell organization patterns with larger cell bodies correctly localized.

Conclusion: Virtual staining approach enables spatially aligned cellular-level analysis of fiber-cytoarchitecture relationships without physical post-staining or complex registration.

Abstract: Comprehensive assessment of the various aspects of the brain’s microstructure requires the use of complementary imaging techniques. This includes measuring the spatial distribution of cell bodies (cytoarchitecture) and nerve fibers (myeloarchitecture). The gold standard for cytoarchitectonic analysis is light microscopic imaging of cell-body stained tissue sections. To reveal the 3D orientations of nerve fibers, 3D Polarized Light Imaging (3D-PLI) has been introduced, a method that is label-free and allows subsequent staining of sections after 3D-PLI measurement. By post-staining for cell bodies, a direct link between fiber- and cytoarchitecture can potentially be established in the same section. However, inevitable distortions introduced during the staining process make a costly nonlinear and cross-modal registration necessary in order to study the detailed relationships between cells and fibers in the images. In addition, the complexity of processing histological sections for post-staining only allows for a limited number of such samples. In this work, we take advantage of deep learning methods for image-to-image translation to generate a virtual staining of 3D-PLI that is spatially aligned at the cellular level. We use a supervised setting, building on a unique dataset of brain sections, to which Cresyl violet staining has been applied after 3D-PLI measurement. To ensure high correspondence between both modalities, we address the misalignment of training data using Fourier-based registration. In this way, registration can be efficiently calculated during training for local image patches of target and predicted staining. We demonstrate that the proposed method can predict a Cresyl violet staining from 3D-PLI, resulting in a virtual staining that exhibits plausible patterns of cell organization in gray matter, with larger cell bodies being localized at their expected positions.

[452] PET Image Reconstruction Using Deep Diffusion Image Prior

Fumio Hashimoto, Kuang Gong

Main category: eess.IV

TL;DR: Anatomical prior-guided PET reconstruction using diffusion models with HQS optimization, enabling cross-tracer generalization for low-dose PET imaging.

DetailsMotivation: Diffusion models show promise for medical imaging but face limitations in PET due to tracer-specific contrast variability and high computational demands, limiting cross-tracer applications.

Method: Proposed anatomical prior-guided PET reconstruction using diffusion models with DDIP framework, alternating between diffusion sampling and model fine-tuning guided by PET sinogram. Used HQS algorithm to decouple network optimization from iterative reconstruction for efficiency.

Result: Method successfully reconstructed high-quality images across different PET tracers using models pretrained on different tracer data. Demonstrated robust generalization across tracer distributions and scanner types in simulation and clinical datasets.

Conclusion: The proposed method provides an efficient and versatile reconstruction framework for low-dose PET imaging that can generalize across different tracers and scanner types.

Abstract: Diffusion models have shown great promise in medical image denoising and reconstruction, but their application to Positron Emission Tomography (PET) imaging remains limited by tracer-specific contrast variability and high computational demands. In this work, we proposed an anatomical prior-guided PET image reconstruction method based on diffusion models, inspired by the deep diffusion image prior (DDIP) framework. The proposed method alternated between diffusion sampling and model fine-tuning guided by the PET sinogram, enabling the reconstruction of high-quality images from various PET tracers using a score function pretrained on a dataset of another tracer. To improve computational efficiency, the half-quadratic splitting (HQS) algorithm was adopted to decouple network optimization from iterative PET reconstruction. The proposed method was evaluated using one simulation and two clinical datasets. For the simulation study, a model pretrained on [$^{18}$F]FDG data was tested on [$^{18}$F]FDG data and amyloid-negative PET data to assess out-of-distribution (OOD) performance. For the clinical-data validation, ten low-dose [$^{18}$F]FDG datasets and one [$^{18}$F]Florbetapir dataset were tested on a model pretrained on data from another tracer. Experiment results show that the proposed PET reconstruction method can generalize robustly across tracer distributions and scanner types, providing an efficient and versatile reconstruction framework for low-dose PET imaging.

[453] Guiding WaveMamba with Frequency Maps for Image Debanding

Xinyi Wang, Smaranda Tasmoc, Nantheera Anantrasirichai, Angeliki Katsenou

Main category: eess.IV

TL;DR: Proposes a banding restoration method using Wavelet State Space Model and frequency masking to remove compression artifacts in smooth regions while preserving textures.

DetailsMotivation: Compression at low bitrates introduces banding artifacts in smooth regions like skies, degrading visual quality, especially in user-generated content due to repeated transcoding.

Method: Uses Wavelet State Space Model with frequency masking map to preserve high-frequency details while removing banding artifacts through post-processing.

Result: Achieves DBI value of 0.082 on BAND-2k dataset, outperforming state-of-the-art methods while preserving image textures, confirmed by visual inspection.

Conclusion: The proposed post-processing approach effectively suppresses banding artifacts compared to existing methods while maintaining image quality, with code publicly available.

Abstract: Compression at low bitrates in modern codecs often introduces banding artifacts, especially in smooth regions such as skies. These artifacts degrade visual quality and are common in user-generated content due to repeated transcoding. We propose a banding restoration method that employs the Wavelet State Space Model and a frequency masking map to preserve high-frequency details. Furthermore, we provide a benchmark of open-source banding restoration methods and evaluate their performance on two public banding image datasets. Experimentation on the available datasets suggests that the proposed post-processing approach effectively suppresses banding compared to the state-of-the-art method (a DBI value of 0.082 on BAND-2k) while preserving image textures. Visual inspections of the results confirm this. Code and supplementary material are available at: https://github.com/xinyiW915/Debanding-PCS2025.

[454] Random forest-based out-of-distribution detection for robust lung cancer segmentation

Aneesh Rangnekar, Harini Veeraraghavan

Main category: eess.IV

TL;DR: RF-Deep uses random forest classifier with deep features from pretrained transformer encoder to detect out-of-distribution CT scans and improve cancer segmentation reliability.

DetailsMotivation: Transformer-based segmentation models degrade when applied to out-of-distribution CT datasets, creating reliability issues for automated cancer treatment planning and response assessment.

Method: Developed RF-Deep: random forest classifier using deep features from Swin Transformer encoder pretrained with SimMIM on 10,432 unlabeled 3D CT scans. Segmentation model trained on 317 lung cancer scans, with RF-Deep detecting OOD cases on independent test datasets.

Result: RF-Deep achieved FPR95 of 18.26% on PE, 27.66% on COVID-19, and <0.1% on abdominal CTs, consistently outperforming established OOD detection methods.

Conclusion: RF-Deep provides a simple and effective approach to enhance reliability of cancer segmentation in both in-distribution and out-of-distribution scenarios.

Abstract: Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.

Last updated: 2025-12-19
Built with Hugo, theme modified on Stack